# Working with OpenCitations Meta provenance data with RDFlib

This primer provides a guide on how to access, parse, and query the OpenCitations Meta provenance data dump.

Provenance data for OC Meta is currently provided only as RDF files (in JSON-LD format), meaning that &mdash; due to the large size of the dataset &mdash; it isn't (yet) stored in a triplestore, nor accessible via SPARQL endpoint or REST API. Nonetheless, we might sometimes need to query it for a subset of entities, or even for single entities. This document illustrates how to interface the RDF dump to create an easily searchable, smaller RDF graph storing the provenance information for selected entities. 

As a use case for demonstration, I will use version 7 of the dataset: [https://doi.org/10.6084/m9.figshare.21747536.v7](https://doi.org/10.6084/m9.figshare.21747536.v7). 

## Downloading the data and accessing the files

OC Meta dump is published as a record on Figshare, comprising 5 files. Each of these 5 files stores metadata and provenance information as RDF (JSON-LD format) for a specific type of entity among the ones used in the OCDM (and therefore in OC Meta). The 5 entity types are:

* bibliographic resource (corresponding to http:///purl.org/spar/fabio/Expression), in file **"br"**;
* identifier (corresponding to http://purl.org/spar/datacite/Identifier), in file **"id"**;
* agent role (corresponding to http://purl.org/spar/pro/RoleInTime), in file **"ar"**;
* responsible agent (corresponding to http://xmlns.com/foaf/0.1/Agent), in file **"ra"**;
* resource embodiments (corresponding to http://purl.org/spar/fabio/Manifestation), in file **"re"**.

Each of these files is a [tarball](https://en.wikipedia.org/wiki/Tar_(computing)) (an archive, with `.tar.gz` extension, that groups into a single file multiple files and directories, maintaining directory structure) in which files are compressed (with different algorithms???). Each tarball is very large in size: e.g. the "br" file taken as an example here weighs nearly 12GB compressed and nearly 30GB decompressed: make sure you have enough space on your drive to decompress and store the files you want to process.

Here we consider the metadata and provenance concerning the `fabio:BibliographicResource` entities, stored in the "br.tar.gz" tarball. The following steps should be reproducible also for the other tarballs.

1. Download from [Figshare](https://figshare.com/) the tarball you are interested in processing. As said, here we use "br.tar.gz" from record https://doi.org/10.6084/m9.figshare.21747536.v7.
   
2. Decompress and extract the content of the tarball archive in a directory of your choice. This might take a while (it took around 1h30min on a Windows machine with  Intel i7-8565U CPU). I used Windows' File Explorer interface (which should integrate [archivelib](https://www.libarchive.org/)). Theoretically, you could also work directly on compressed files in Python, but the time to access specific single files makes it hugely impractical whenever we need to access multiple files.
   
3. The uncompressed directory has a complex structure. As mentioned in the Figshare record page:
   >The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).
   After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data. At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format. 
   For example, data related to the entity is located in the folder /br/06250/10000/1000/1000.zip, while information about provenance in /br/06250/10000/1000/prov/1000.zip.
   The JSON-LD files inside the archives are further compressed using the zip algorithm. It is recommended to process these inner files as compressed without extracting them, to manage data more efficiently."



## Accessing the provenance graphs for specific entities



Each of the zipped JSON-LD files contains named graphs; each of the named graphs represents provenance information available for a given entity, and the graph IRI, i.e. the value of the "@id" field in the JSON-LD, always follows the following pattern: <IRI of the entity to which the provenance graph refers> + "/prov/" + .

Given the large size of the dataset, searching the provenance graph for a given entity (a bibliographic resource in this case) by iterating over all the files in the directories would take too long. Luckily, by examining the structure of an entity's OMID we can infer the exact path and name of the file where its provenance graph is stored and access it directly. To this end, we can use the `get_provenance_graph()` function in the `src` module.

```python
def get_provenance_graph(entity_iri:str, data_root:str) -> dict:
    """
    Uses the entity's IRI (i.e. its OMID) and finds the exact 
    path of the file storing its provenance graph in a subdirectory of data_root. 
    Then, it reads the file and returns the provenance graph as a dictionary.
    
    param entity_iri: The IRI of the entity whose provenance graph is to be retrieved.
    param data_root: The path to the root directory storing the provenance data, i.e. the folder resulting from decompression of a .tar.gz file.
    return: The provenance graph of the entity as a dictionary.
    """
    digits = entity_iri.split('/')[-1] 
    supplier_prefix = digits[:digits.find('0', 1)+1]
    sequential_number = int(digits.removeprefix(supplier_prefix))
    for dir in os.listdir(data_root):
        if dir == supplier_prefix:
            dir1_path = os.path.join(data_root, dir)
            for subdir in sorted(os.listdir(dir1_path), key=lambda x: int(x)):
                if sequential_number < int(subdir):
                    dir2_path = os.path.join(dir1_path, subdir)
                    for subsubdir in sorted([d for d in os.listdir(dir2_path) if d.isdigit()], key=lambda x: int(x)):
                        if sequential_number < int(subsubdir):
                            dir3_path = os.path.join(dir2_path, subsubdir)
                            prov_dir_path = os.path.join(dir3_path, 'prov')
                            with ZipFile(os.path.join(prov_dir_path, 'se.zip')) as archive:
                                with archive.open('se.json') as f:
                                    data: list = json.load(f)
                                    for obj in data:
                                        if obj['@id'] == entity_iri + '/prov/':
                                            return obj
                            break
                    break
    return None
```

As a demo, we can retrieve the RDF provenance of 3 randomly selected bibliographic resources: https://w3id.org/oc/meta/br/061503302006,     
https://w3id.org/oc/meta/br/069301323, and https://w3id.org/oc/meta/br/062708702.

In [None]:
from src import get_provenance_graph
from pprint import pprint

brs = ['https://w3id.org/oc/meta/br/061503302006', 'https://w3id.org/oc/meta/br/069301323', 'https://w3id.org/oc/meta/br/062708702']
root_dir = 'E:/br_test/br' # path to the root directory resulting from the decompression of the .tar.gz file

provenance_graphs = []
for iri in brs:
    prov_graph = get_provenance_graph(iri, root_dir)
    provenance_graphs.append(prov_graph)

print(f'Provenance graph for {brs[0]}⤵️\n')
pprint(provenance_graphs[0])



## Managing the RDF provenance graphs with RDFlib.


We can use the `rdflib` Python library to deal with the RDF graphs. 
RDFLib provides a specific class for dealing with quads/[named graphs](https://www.w3.org/TR/rdf12-concepts/#dfn-named-graph), the `Dataset` class, which is an implementation of [RDF 1.1. Dataset concept](https://www.w3.org/TR/rdf11-datasets/)). One could also use `rdflib.ConjunctiveGraph` class to read quads, but it's best to use `rdflib.Dataset` since `ConjunctiveGraph` is deprecated.

<div class="alert alert-block alert-info">
Make sure to know the basic definitions of RDF graph, named graphs, datasets, etc. It might be useful to revise the W3C Working Draft explaining the core concepts: https://www.w3.org/TR/rdf11-concepts/.</div>

The following cell shows how to use an `rdflib.Dataset` object as interface for the provenance graphs we retrieved earlier. ⚠️Remember to set the `default_union` flag to `True` when instantiating the `Dataset` object, so that the default graph of the dataset is populated with the *triples* inside the named graphs: the flatter structure of the default graph allows us to query the data via SPARQL without the need to specify a graph's IRI (by populating the default graph as a union of all the RDF graphs in the dataset).

In [None]:
from rdflib import Dataset, URIRef
from rdflib.namespace import Namespace
import json


# Instantiate a Dataset object
ds = Dataset(default_union=True) # default_union=True is required to merge the named graphs in a single graph, i.e. the default graph

# Parse the provenance graphs into the dataset
prov_data = json.dumps(provenance_graphs) # convert the list of provenance graphs to a JSON string
ds.parse(data=prov_data, format='json-ld') # Dataset.parse() can also be directly passed a filepath (e.g. 'se.json') instead of the data parameter

# Optionally define prefixes for vocabularies that are not automatically recognized by rdflib
OC = Namespace('https://w3id.org/oc/ontology/') # Define the OC namespace
ds.namespace_manager.bind("oc", OC) # Bind the OC namespace to the prefix 'oc' in the ds.namespace_manager

# SERIALISATION
# ❗ Tip: Use TriG format to represent quads and also using prefixes (N-quads does not support prefixes).
# If 'destination' parameter is specified, the serialised data will be directly written to the file at the specified path.
serialisation = ds.serialize(format='trig')
print(serialisation)

## Basic operations with RDFlib

In [None]:
# ALL TRIPLES IN DEFAULT GRAPH (UNION OF ALL NAMED GRAPHS)
print("\nAll triples in the default graph:")
for t in ds.triples((None, None, None)):
    print(t)

# to get IRI strings you can unpack the tuples:
print("\nAll triples in the default graph (IRI strings):")
for s, p, o in ds.triples((None, None, None)):
    print(s,p,o)

In [None]:
# ALL QUADS IN THE DATASET (IN ALL NAMED GRAPHS)
print("\nAll quads in the dataset:")
for q in ds.quads((None, None, None, None)):
    print(q)

# as for triples, you can unpack the quads when iterating over them. 
# Also, you can filter quads (and triples) by specifying one or more values as arguments to the quads()/triples() method (see cell below).

In [None]:
# Get all the subjects and objects of http://purl.org/dc/terms/description

# SOLUTION 1: Using triples()
print("\n SOLUTION 1) Subjects and objects of http://purl.org/dc/terms/description:")
for s, p, o in ds.triples((None, URIRef('http://purl.org/dc/terms/description'), None)): # filter by predicate
    print(s, 'dc:description', o)

# alternatively, you can directly use the function subject_objects()
print("\n SOLUTION 2) Using subject_objects():")
for s,o in ds.subject_objects(URIRef('http://purl.org/dc/terms/description')):
    print(s, 'dc:description', o)

In [None]:
# GRAPH NAMES (IRI OF NAMED GRAPHS)
print("\nNamed graphs IRIs:")
for g in ds.contexts():
    print(g.identifier) # prints the graph name

In [None]:
# QUERYING NAMED GRAPHS

# Pattern to query all quads in the dataset
query_all = """
SELECT ?s ?p ?o ?g
WHERE {
  GRAPH ?g {
    ?s ?p ?o.
  }
}
"""
print("\nAll Quads (graph IRIs and their triples):")
for s,p,o,g in ds.query(query_all):
    print(s,p,o,g) # s,p,o are the triple pattern, g is the graph name


# Pattern (example) for querying a specific named graph
query_named = """
SELECT ?s ?p ?o
WHERE {
  GRAPH <https://w3id.org/oc/meta/br/062708702/prov/> {
    ?s ?p ?o.
  }
}
"""
print("\nNamed Graph (https://w3id.org/oc/meta/br/062708702/prov/):")
for s,p,o in ds.query(query_named):
    print(s,p,o)

In [None]:
# SPARQL Query across the DEFAULT (union) GRAPH
query = """
SELECT ?s ?o
WHERE {
  ?s prov:hadPrimarySource ?o .
}
"""
## N.B. We can omit the prefix declaration for the namespaces that are present in the dataset namespace manager.
# for n in ds.namespace_manager.namespaces():
#     print(n)

# Execute the query on the union graph
print('\nExample SPARQL query results:')
query_res = ds.query(query)
for result_subj, result_obj in query_res:
    print(result_subj, 'prov:hadPrimarySource', result_obj)

# Importantissimo

Per poter leggere delle quads (named graphs) come un unico grafo, mantenendo però i nomi delle coppie IRI-grafo, devi ricordarti di settare il parametro `default_union` a True nell'istanziare il Dataset.


Infatti, questo ti consente di avere con `Dataset` un oggetto equivalente a ciò che otterresti con `ConjunctiveGraph`, ma senza usare quest'ultima classe, che è deprecata. Quello che fa, praticamente, è ***unire*** tutte le triple dei grafi denominati in un unico grafo (che immagino sia il default graph?), proprio come farebbe `ConjunctiveGraph` di default. Questo consente di fare query SPARQL senza dover esplicitare il nome dei grafi in cui cercare, mantenendo la possibilità di iterare le quadruple nel dataset (con il generator `Dataset.quads()`). 

```
from rdflib import Dataset
from rdflib import URIRef, Namespace

# Sample RDF data
rdf_data = """
[
    {
        "@graph": [
            {
                "@id": "https://w3id.org/oc/meta/br/06150340286/prov/se/1",
                "@type": [
                    "http://www.w3.org/ns/prov#Entity"
                ],
                "http://purl.org/dc/terms/description": [
                    {
                        "@value": "The entity 'https://w3id.org/oc/meta/br/06150340286' has been created."
                    }
                ],
                "http://www.w3.org/ns/prov#generatedAtTime": [
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2023-12-13T14:51:10.332887"
                    },
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2024-03-28T00:59:42+00:00"
                    }
                ],
                "http://www.w3.org/ns/prov#hadPrimarySource": [
                    {
                        "@id": "https://openalex.s3.amazonaws.com/browse.html"
                    }
                ],
                "http://www.w3.org/ns/prov#specializationOf": [
                    {
                        "@id": "https://w3id.org/oc/meta/br/06150340286"
                    }
                ],
                "http://www.w3.org/ns/prov#wasAttributedTo": [
                    {
                        "@id": "https://w3id.org/oc/meta/prov/pa/1"
                    }
                ]
            }
        ],
        "@id": "https://w3id.org/oc/meta/br/06150340286/prov/"
    },
    {
        "@graph": [
            {
                "@id": "https://w3id.org/oc/meta/br/06150340460/prov/se/1",
                "@type": [
                    "http://www.w3.org/ns/prov#Entity"
                ],
                "http://purl.org/dc/terms/description": [
                    {
                        "@value": "The entity 'https://w3id.org/oc/meta/br/06150340460' has been created."
                    }
                ],
                "http://www.w3.org/ns/prov#generatedAtTime": [
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2023-12-13T14:51:10.661001"
                    },
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2024-03-28T21:46:36+00:00"
                    }
                ],
                "http://www.w3.org/ns/prov#hadPrimarySource": [
                    {
                        "@id": "https://openalex.s3.amazonaws.com/browse.html"
                    }
                ],
                "http://www.w3.org/ns/prov#specializationOf": [
                    {
                        "@id": "https://w3id.org/oc/meta/br/06150340460"
                    }
                ],
                "http://www.w3.org/ns/prov#wasAttributedTo": [
                    {
                        "@id": "https://w3id.org/oc/meta/prov/pa/1"
                    }
                ]
            }
        ],
        "@id": "https://w3id.org/oc/meta/br/06150340460/prov/"
    },
    {
        "@graph": [
            {
                "@id": "https://w3id.org/oc/meta/br/06150340250/prov/se/1",
                "@type": [
                    "http://www.w3.org/ns/prov#Entity"
                ],
                "http://purl.org/dc/terms/description": [
                    {
                        "@value": "The entity 'https://w3id.org/oc/meta/br/06150340250' has been created."
                    }
                ],
                "http://www.w3.org/ns/prov#generatedAtTime": [
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2023-12-13T14:51:10.600630"
                    },
                    {
                        "@type": "http://www.w3.org/2001/XMLSchema#dateTime",
                        "@value": "2024-03-28T00:59:42+00:00"
                    }
                ],
                "http://www.w3.org/ns/prov#hadPrimarySource": [
                    {
                        "@id": "https://openalex.s3.amazonaws.com/browse.html"
                    }
                ],
                "http://www.w3.org/ns/prov#specializationOf": [
                    {
                        "@id": "https://w3id.org/oc/meta/br/06150340250"
                    }
                ],
                "http://www.w3.org/ns/prov#wasAttributedTo": [
                    {
                        "@id": "https://w3id.org/oc/meta/prov/pa/1"
                    }
                ]
            }
        ],
        "@id": "https://w3id.org/oc/meta/br/06150340250/prov/"
    }
]
"""

# Create a Dataset, define the namespaces prefixes (optional), and parse the data
dataset = Dataset(default_union=True) # default_union=True merges all named graphs into the default graph (producing a ConjunctiveGraph-like object)

OC = Namespace("https://w3id.org/oc/ontology/")
dataset.namespace_manager.bind("oc", OC)  # Binds 'oc' as the prefix for the OC namespace in the dataset namespace manager

dataset.parse(data=rdf_data, format="json-ld")


# SPARQL Query across the union graph
query = """
SELECT ?s ?o
WHERE {
  ?s prov:specializationOf ?o .
}
"""

# Execute the query on the union graph
print('\nExample SPARQL query results:')
query_res = dataset.query(query)
for result_subj, result_obj in query_res:
    print(result_subj, 'prov:specializationOf', result_obj)

# TRIPLES IN DEFAULT GRAPH (UNION OF ALL NAMED GRAPHS)
print("\nTriples in the default graph:")
for t in dataset.triples((None, None, None)):
    print(t)

# QUADS IN THE DATASET (ALL NAMED GRAPHS)
print("\nQuads in the dataset:")
for q in dataset.quads((None, None, None, None)):
    print(q)

# GRAPH NAMES (IRI OF NAMED GRAPHS)
print("\nNamed graphs IRIs:")
for g in dataset.contexts():
    print(g.identifier) # prints the graph name


# SERIALIZATION
print("\nSerialization in TriG:")
print(dataset.serialize(format='trig')) # serialize the dataset in TriG format (supports named graphs and prefixes)
```