Skip to content

Internal and external vocabulary

Hannah Bast edited this page Apr 29, 2022 · 2 revisions

Why is there an internal and an external vocabulary?

Internally in Qlever, every IRI and every literal has an 8-byte ID, and as much as possible of the internal query processing is carried out with these IDs, instead of with the string or value they represent. This is part of what makes Qlever so fast.

When producing the final result for a query, the internal IDs need to be translated to the strings or values they represent. For values that fit into an 8-byte integer, QLever uses special code. For all other IDs (in particular, for all IRIs and non-value literals) this translation is achieved via a simple map, called vocabulary.

For large datasets, the vocabulary is huge and we cannot assume that it fits into RAM, even when compressed. For example, at the time of this writing, for the complete Wikidata (17 billion triples), the vocabulary has over 3 billion elements, with an uncompressed size of 190 GB. For the complete UniProt dataset (94 billion triples), the vocabulary has over 24 billion elements, with an uncompressed size of 1.5 TB.

QLever therefore distinguishes between an internal vocabulary (the part of the map stored in RAM) and an external vocabulary (the part of the map stored on disk).

Controlling what goes in which vocabulary

Qlever uses several rules to decide what goes into the internal and what goes into the external vocabulary. Some of these rules can be configured by the user via the settings.json file (passed to IndexBuilderMain via the -s option), some are currently still hard-coded (below are two examples for a settings.json, one for Wikidata and one for UniProt). Here is a list of QLever's current rules.

  1. Literals below a fixed length are stored in the internal vocabulary, all longer literals are stored in the external vocabulary. The threshold is currently hard-coded as 1024 bytes in ConstantsIndexBuilding.h.

  2. Literals with an explicit language tag are external by default. They can be made internal via the languages-internal setting; see the Wikidata settings below for an example.

  3. All IRIs or literals matching a certain prefix can be made external using the prefixes-external setting; see the Wikidata and UniProt settings below for examples.

  4. There is currently no way to configure that all objects of a certain predicate are made external. However, this can be easily changed in the code, here is an example for UniProt.

Performance impact

It can make a huge performance difference whether elements of the vocabulary are internal or external. The exact times depend on which elements are accessed, where exactly they are stored in memory or on disk, and which parts of the external vocabulary are cached by the operating system. For the sake of example, let us assume a SPARQL query which produces a table with 10 million rows, one column each, where each element is an IRI or literal of size 100 bytes (so 1 GB of data in total). Let us also assume a standard PC with an average RAM access time of 100ns, and two scenarios for the disk: an HDD with an average access time of 5ms and a throughput rate of 250 MB / second, and an SSD with an average access time of 5µs and a throughout rate of 2 GB / second. In practice, there are then five scenarios:

  1. All elements are in the internal vocabulary. Then the translation of IDs will take around 1 second (100 ns times 10 M plus the time needed for string allocation).

  2. All elements are in the external vocabulary, but that part of the external vocabulary has been accessed before and cached by the OS. Then the translation time will be similar as in the previous case.

  3. All the elements are in the external vocabulary, but they are right next to each other on disk. Then the translation will take around 4 seconds on HDD (1 GB divided by 250 MB / second) and around 0.5 seconds on SSD (1 GB divided by 2 GB / second).

  4. All the elements are in the external vocabulary, but at very different locations on disk (more than the default page size apart). Then the translation will take around 50.000 seconds ~ 14 hours on HDD (5ms times 10 M) and around 50 seconds on SSD (5µs times 10 M).

  5. The previous item describes the worst case. In practice, the elements will not be spread out maximally. The more of them are close to each other on disk, the closer the times will get to the times from Item 3. For example, if groups of ten element are stored next to each other on disk, the translation will be ten times faster than in the worst case.

Example configurations

Here is an example settings.json for Wikidata:

{
  "languages-internal": ["en"],
  "prefixes-external": [
    "<http://www.wikidata.org/entity/statement",
    "<http://www.wikidata.org/value",
    "<http://www.wikidata.org/reference"
  ],  
  "locale": {
          "language": "en",
          "country": "US",
          "ignore-punctuation": true
  },  
  "ascii-prefixes-only": true,
  "num-triples-per-partial-vocab": 50000000
}

Here is an example settings.json for UniProt:


  "languages-internal": ["en"],
  "prefixes-external": [
    "<http://purl.uniprot.org/uniprot/",
    "<http://purl.uniprot.org/uniparc/",
    "<http://purl.uniprot.org/uniref/",
    "<http://purl.uniprot.org/isoforms/",
    "<http://purl.uniprot.org/range/",
    "<http://purl.uniprot.org/position/",
    "<http://purl.uniprot.org/refseq/",
    "<http://purl.uniprot.org/embl-cds/",
    "<http://purl.uniprot.org/EMBL",
    "<http://purl.uniprot.org/PATRIC",
    "<http://purl.uniprot.org/SEED",
    "<http://purl.uniprot.org/gi",
    "<http://rdf.ebi.ac.uk/resource",
    "<http://purl.uniprot.org/SHA-384"
  ],  
  "locale": {
          "language": "en",
          "country": "US",
          "ignore-punctuation": true
  },  
  "ascii-prefixes-only": true,
  "num-triples-per-partial-vocab": 20000000
}