Skip to content

cirojas/leapfrog-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code

The code for our leapfrog implementation for Apache Jena is available here.

Dataset

The dataset used was a reduced version of the Wikidata truthy dump from November 15, 2018. The original dump and its reduced version are available at zenodo.

Repeating the experiments

Prerequisites

  • any x64 linux distribution with glib support

  • java 8

  • python (both 2 or 3 works)

  • bzip2

    • On a debian-based distro: sudo apt install bzip2
  • pip

  • SPARQLWrapper

    Some of the following steps can take hours to complete, so we recommend using tmux to execute them.

Getting the repo and the dataset

  • Clone this repository.

    • git clone git@github.com:cirojas/leapfrog-benchmark.git if you use ssh keys

    or

    • git clone https://github.com/cirojas/leapfrog-benchmark.git if you don't.
  • Download the dataset used and move it to the benchmark folder

  • Extract it bzip2 -d wikidata-wcg-filtered.nt.bz2

  • Or you can construct the dataset from the truthy wikidata dump

Create the database for Jena and leapfrog

  • Download the files apache-jena-3.9.0.tar.gz from Apache Jena downloads page or here and move it into jena folder

  • Change directory into jena folder

  • Extract it tar -xf apache-jena-3.9.0.tar.gz

  • Create the database for jena apache-jena-3.9.0/bin/tdbloader2 --loc=db/jena ../wikidata-wcg-filtered.nt

  • Edit the file apache-jena-3.9.0/bin/tdbloader2index with any text editor. After the line 389

    generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
    

    add the following lines:

    generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP
    generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO
    generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
    

    then save and exit.

  • Create the database for the leapfrog implementation apache-jena-3.9.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-wcg-filtered.nt

Create the database for Blazegraph

  • Download Blazegraph jar from its sourceforge page or from here and move it into blazegraph folder
  • Change directory into blazegraph folder
  • java -Xmx20g -cp blazegraph.jar com.bigdata.rdf.store.DataLoader load.properties ../wikidata-wcg-filtered.nt

Create the database for Virtuoso Opensource

  • Download the file from Virtuoso Open Source Edition v7.2.5.1 from its github releases page or from here and move it into virtuoso folder
  • Change directory into virtuoso folder
  • Extract it tar -xf virtuoso-opensource.x86_64-generic_glibc25-linux-gnu.tar.gz
  • Init the server virtuoso-opensource/bin/virtuoso-t -c virtuoso.ini
  • The server can take some time to start, wait a minute and start the interactive sql: virtuoso-opensource/bin/isql localhost:1111 and enter the following commands:
    • ld_dir('..', '*.nt', 'http://wikidata.org');
    • rdf_loader_run();
    • exit();
  • Shut down the server virtuoso-opensource/bin/isql localhost:1111 -K

Run the benchmark

  • Change directory into benchmark folder

  • bash run-benchmark.sh queries/bgps

  • bash run-benchmark.sh queries/optionals

    Now the results are available in the folders queries/bgps/output and queries/optionals/output

    For each query pattern you will find a folder containing four files, one for each database. Each line of a file contains three values separated by a semicolon: queryNumber;numberOfResutls;executionTimeInNanoseconds

Building the dataset

  • Download the Wikidata truthy dump wikidata-wcg.nt.bz2 from here.
  • Extract it bzip2 -d wikidata-wcg.nt.bz2.
  • Move it to wikidata-filter folder and change directory to that folder.
  • Execute python remove_labels_and_descriptions.py to remove labels and descriptions from wikidata, along with strings having other language than english.
  • Execute python remove_properties.py to remove all properies listed in removed_properties.txt in our case we removed all properties that appeared more than 1.000.000 times or less than 1.000 times.

Getting random queries for the benchmark

For each query pattern we created a java program that will find 50 random sets of properties with at least 1 result. The jars are in the find-queries folder. To find a query, you need to execute java -jar find_XYZ.jar [jena-database-location] properties_wikidata.txt, where properties_wikidata.txt is a file with the properties that can be chosen.

Results

You can find our results in our repository

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published