There are two kinds of identifiers in Blazegraph. External identifiers that correspond to RDF Values (IRIs, blank nodes, and Literals) and Internal Values (IVs). Internal Values are generated by a number of different mechanisms, but the specific mechanisms must be stable across the life cycle of a given triple or quad store instance. The typical mechanisms are:
- Dictionary coding (the TERM2ID, ID2TERM, and BLOBS indices, which encoding the RDF Value into an TermId or BlobIV).
- Vocabulary declarations (which encode the RDF Value into 2-3 bytes).
- Inlining of numerical XSD data types (including fixed length types such as xsd:int, xsd:float, xsd:long, etc. as well as variable length types such as xsd:Integer and xsd:Decimal).
- Inlining of small XSD non-numeric types.
- Inlining of blank nodes.
- Inlining according to application specific logic.
There are several major advantages to inlining:
- Inlined IVs make it possible to recover the external form of the RDF Value without a dictionary look against an index. This is a huge performance gain when it comes time to externalize the results of a query.
- Inlined IVs allow FILTERS based on comparisons in the value space (other than equality and inequality) to be evaluated directly against the inline IV (rather than doing a JOIN against the dictionary indices).
- Inlined IVs do not need to be stored in the dictionary indices. This reduces the size of those indices on the disk and reduces the IO Wait associated with the update of the dictionary indices.
Configuring the Inlining Behavior
For the most part, inlining is enabled by default. The LexiconConfiguration class is responsible for decisions about what can and cannot be inlined. Those decisions are made based on the AbstractTripleStore.Options.
Specifying a VocabularyClass
Blazegraph is capable of inlining the IRIs for pre-declared vocabulary items. The inlined IRIs occupy 2-3 bytes in the statement indices making them very compact. Further, since these vocabulary items are pre-declared, they can be decoded without reference to the dictionary indices. Thus they have no overhead when externalizing RDF Value objects from inline values.
Blazegraph uses a VocabularyClass by default. If you use other ontologies, you may want to extend an existing bigdata Vocabulary to also specify your own Vocabulary. One good extension point is com.bigdata.rdf.vocab.RDFSVocabulary. This class provides the IRI declarations for RDF, RDFS, OWL, FOAF, SKOS, Dublin Core, XML Schema and openrdf.
Vocabulary classes determine how external IRIs are mapped into internal values. This mapping MUST be stable. Thus you MUST version your Vocabulary class (by creating a new implementation class) each time you modify it. Otherwise you risk having an inconsistent encoding of some IRIs through the dictionary indices and the vocabulary class. This would result in a failure to correctly query the data involving those IRIs.
If you do define your own vocabulary class, then you would specify it as follows when creating a new triple or quad store.
Example Custom Vocabulary
You can see an example of a custom vocabulary for the PubChem data set on github.