The code for our leapfrog implementation for Apache Jena is available here.
The dataset used was a reduced version of the Wikidata truthy dump from November 15, 2018. The original dump and its reduced version are available at zenodo.
-
any x64 linux distribution with glib support
-
java 8
-
python (both 2 or 3 works)
-
- On a debian-based distro:
sudo apt install bzip2
- On a debian-based distro:
-
Some of the following steps can take hours to complete, so we recommend using tmux to execute them.
-
Clone this repository.
git clone git@github.com:cirojas/leapfrog-benchmark.git
if you use ssh keys
or
git clone https://github.com/cirojas/leapfrog-benchmark.git
if you don't.
-
Download the dataset used and move it to the
benchmark
folder -
Extract it
bzip2 -d wikidata-wcg-filtered.nt.bz2
-
Or you can construct the dataset from the truthy wikidata dump
-
Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into
jena
folder -
Change directory into
jena
folder -
Extract it
tar -xf apache-jena-3.9.0.tar.gz
-
Create the database for jena
apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-wcg-filtered.nt
-
Edit the file
apache-jena-3.9.0/bin/tdbloader2index
with any text editor. After the line 389generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
add the following lines:
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
then save and exit.
-
Create the database for the leapfrog implementation
apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-wcg-filtered.nt
- Download Blazegraph jar from its sourceforge page or from here and move it into
blazegraph
folder - Change directory into
blazegraph
folder java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-wcg-filtered.nt
- Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into
virtuoso
folder - Change directory into
virtuoso
folder - Extract it
tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
- Init the server
virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini
- The server can take some time to start, wait a minute and start the interactive sql:
virtuoso-opensource/bin/isql localhost:1111
and enter the following commands:ld_dir('..', '*.nt', 'http://wikidata.org');
rdf_loader_run();
exit();
- Shut down the server
virtuoso-opensource/bin/isql localhost:1111 -K
-
Change directory into
benchmark
folder -
bash run-benchmark.sh queries/bgps
-
bash run-benchmark.sh queries/optionals
Now the results are available in the folders
queries/bgps/output
andqueries/optionals/output
For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon:
queryNumber;numberOfResutls;executionTimeInNanoseconds
- Download the Wikidata truthy dump
wikidata-wcg.nt.bz2
from here. - Extract it
bzip2 -d wikidata-wcg.nt.bz2
. - Move it to
wikidata-filter
folder and change directory to that folder. - Execute
python remove_labels_and_descriptions.py
to remove labels and descriptions from wikidata, along with strings having other language than english. - Execute
python remove_properties.py
to remove all properies listed inremoved_properties.txt
in our case we removed all properties that appeared more than 1.000.000 times or less than 1.000 times.
For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result.
The jars are in the find-queries
folder.
To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt
, where properties_wikidata.txt
is a file with the properties that can be chosen.
You can find our results in our repository