This collection of scripts allows maintainable and time-efficient relational online analytical processing through the specification of small, modular SQL queries. It is mainly suited for querying a rarely modified data set, such as that of GHTorrent. The result of each SQL query is saved in corresponding table, which can then be used by subsequent queries. As a result, each query can be independently developed and tested. Automatic dependency analysis of the queries ensures that new queries can use already calculated results and that every time a query is changed all tables that depend on it (and only those) are automatically repopulated.
When using the MySQL or the PostgreSQL engine,
the table population timestamps are
automatically propagated from the database to the files that track
the table creation times in the tables folder.
This allows multiple users to work together with separate instances
of a simple-rolap repository, without requiring the recalculation
of expensive tables.
When the database's information schema is locked (e.g. due to the creation
of new indices) this part of the dependency generation step can block
until the (potentially long) operation finishes.
To avoid this, set the SKIP_TIMESTAMPING environment variable,
for example by running make SKIP_TIMESTAMPING=1.
The provided functionality is mainly useful in cases where materialized views are unsupported or unusable. The scripts are written for MySQL, PostgreSQL, and SQLite, but they should be easy to port to other relational database systems.
For complex relational OLAP queries, simple-rolap can be combined particularly effectively with RDBUnit. You can find a complete tutorial on using simple-rolap with RDBUnit for mining Git repositories in a technical briefing presented at the 2017 International Conference on Software Engineering.
You can find simple-rolap introduced in the following article, which you can cite as its definitive description.
Diomidis Spinellis. Modular data analytics. IEEE Software, 41(5):20–23, September 2024. doi:10.1109/MS.2024.3409988
In addition, if you use simple-rolap for mining software repositories, consider reading and citing the following.
Georgios Gousios and Diomidis Spinellis. Mining software engineering data from GitHub. In Proceedings of the 39th International Conference on Software Engineering Companion, ICSE-C '17, pages 501–502, Piscataway, NJ, USA, May 2017. IEEE Press. Technical Briefing. doi:10.1109/ICSE-C.2017.164
The simple-rolap scripts are used by including the provided Makefile.
Consequently, all that is needed is to provide the repository in suitably
accessible location.
Here is an example.
cd /usr/local/lib
sudo git clone --depth=1 https://github.com/dspinellis/simple-rolap.gitAlternatively, you can perform a local install by adding two lines in
the project's Makefile. (See below.)
The system where simple-rolap is run must have and installation of the database being used, GNU make, GNU sed, and (if you want to visualize the associated dependency graphs) GraphViz.
To start using simple-rolap, create a Makefile that
a) specifies the project's configuration by defining a few variables,
and b) includes the simple-rolap Makefile.
Consider the following self-explanatory example.
# Database engine to use (One of sqlite or mysql)
export RDBMS?=sqlite
# The (default) database containing the data you want to query
export MAINDB?=rxjs-ghtorrent
# The (explicitly specified) database that will contain your queries' results
export ROLAPDB?=stratsel
include /usr/local/src/simple-rolap/MakefileTo work with a local install of simple-rolap change the Makefile's
last line into the following.
include simple-rolap/Makefile
simple-rolap/Makefile:
git clone https://github.com/dspinellis/simple-rolapIf some actions need to be performed before running the queries,
you can specify them in a variable named DEPENDENCIES and then
add corresponding rules.
Example:
export DEPENDENCIES=rxjs-ghtorrent.db
include simple-rolap/Makefile
rxjs-ghtorrent.db:
wget https://github.com/ghtorrent/tutorial/raw/master/rxjs-ghtorrent.dbThen comes the specification of SQL statements that will analyze the data. The simple-rolap system supports two types of modules: those that create tables and those that run queries to create a report. Below is a table creation module.
-- Projects that have been forked
CREATE TABLE stratsel.forked_projects AS
SELECT distinct forked_from AS id FROM projects
WHERE forked_from is not null;The SQL statement creates a table in the result database by using tables
from the data and the result database.
The module must reside in a file named after the table it creates,
with the suffix .sql, e.g. forked_projects.sql in the preceding example.
SELECT clauses must appear indented to avoid confusing the table
creation module with a query module.
By typing make the module will be run, if and only if its results have never
been produced, or if they are older than the result tables on which they
depend.
This is accomplished by creating a timestamped file for each module
execution in a directory named tables.
Each module can contain database engine-specific extensions and also SQL statements that create indices, as shown in the following example.
-- Projects in our candidate set that have their URL blacklisted
CREATE TABLE leadership.blacklisted_projects ENGINE=MyISAM AS
SELECT projects.id
FROM leadership.blacklisted_urls
LEFT join projects ON blacklisted_urls.url = projects.url;
ALTER TABLE leadership.blacklisted_projects add index(id);A query module is a simple SQL SELECT query, such as the following.
-- URLs of popular projects
SELECT projects.id, Concat('https://github.com/', Substr(url, 30)) AS url
FROM stratsel.popular_projects
LEFT join projects ON projects.id = popular_projects.id;When make is run on such a module, the results of the query will
appear in the directory reports in a file named after the module's name
with the suffix .sql replaced by .txt.
To authenticate yourself with the main database, setup suitable
environment variables (e.g. PGPASSWORD) or files (e.g. .my.cnf
or .pgpass).
By default simple-rolap will run all SQL queries.
If some additional actions need to be performed in addition
to running the queries, you can specify them in a variable
named ALL and then add corresponding rules.
Example:
export ALL=popular.svg
include simple-rolap/Makefile
popular.svg: reports/popular.dot
dot -Tsvg $? >$@Here are some more things that the provided Makefile allows you to do.
- Run
make helpto obtain a list of available targets. - Run
make graph.svg,make graph.png, ormake graph.pdfto generate a diagram of the ROLAP queries' dependencies. Usefull-graphrather thangraphto include the main database tables. - Run
make schema.svg,make schema.png, ormake schema.pdfto generate a diagram of the ROLAP SQLite3 database. This requires the tables to be created with one field per line and the opening and closing brackets in separate lines. - Run
make testto run any RDBUnit unit tests you may have written. PassUNIT=unit-name to execute only the specified unit test. - Run
make cleanto remove all auto-generated files, so that you can start a new analysis from scratch. - Run
make sync-timestampsto update the timestamps of unmodified files to match those of the committed files. In collaborative development settings this will avoid rerunning queries that have already been executed by others. - Run
make tagsto create atagsfile that many editors can use to automatically navigate between the queries. - Run
make ordered-dependenciesto create a file namedordered-dependencieswith the queries listed in the order determined by their dependencies. - Run
make V=1to see the commands executed within the Makefile. - Run
make V=1 TIME=timeto see timing information of executions. - Run
make V=2to also see the commands executed within the shell scripts. - Run
make V=3to see unit test output. - Run
make V=4to see the generated SQL code. - Run
make OMIT_QUERY_DEPS=trueto omit dependencies on the query's source code from the generated dependencies. - Put SQL statements that you want to precede each query (e.g. MySQL
optimizer tuning
SETcommands), in the file.config.sql.