# Configuring and populating a graph database

In this tutorial, we show how to use RDF and Blazegraph to create a graph database using Python.

## What is RDF

The [Resource Description Framework (RDF)](https://en.wikipedia.org/wiki/Resource_Description_Framework) is a high-level data model (some times it is improperly called "language") based on triples *subject-predicate-object* called statements. For instance, a simple natural language sentence such as <span style="color: red">*Umberto Eco is author of The name of the rose*</span> can be expressed through an RDF statement assigning to:

* <span style="color: red">*Umberto Eco*</span> the role of subject;
* <span style="color: red">*is author of*</span> the role of predicate;
* <span style="color: red">*The name of the rose*</span> the role of object.

The main entities comprising RDF are listed as follows.


### Resources

A *resource* is an object we want to talk about, and it is identified by an [IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier). IRIs are the most generic class of Internet identifiers for resources, but often [HTTP URLs](https://en.wikipedia.org/wiki/URL) are used instead, which may be considered a subclass of IRIs (e.g. the [URL `http://www.wikidata.org/entity/Q12807`](http://www.wikidata.org/entity/Q12807) identifies Umberto Eco in [Wikidata](https://wikidata.org)).


### Properties

A *property* is a special type of resource since it is used to describe relation between resources, and it is identified by an IRI (e.g. the [URL `http://www.wikidata.org/prop/direct/P800`](http://www.wikidata.org/entity/P800) identifies the property *has notable work* - which mimic the <span style="color: red">*is author of*</span> predicate of the statement above).


### Statements

*Statements* enable one to assert properties between resources. Each statement is a triple subject-predicate-object, where the subject is a resource, the predicate is a property, and the object is either a resource or a literal (i.e. a string). 

There are different notations that can be used to represent statements in RDF in plain text files. The simplest (and most verbose) one is called [N-Triples](https://en.wikipedia.org/wiki/N-Triples). It allows to define statements according to the following syntax:

```
# 1) statement with a resource as an object
<IRI subject> <IRI predicate> <IRI object> .

# 2) statement with a literal as an object
<IRI subject> <IRI predicate> "literal value"^^<IRI type of value> .
```

Type (1) statements must be used to state relationships between resources, while type (2) statements are generally used to associate attributes to a specific resource (the IRI defining the type of value is not specified for generic literals, i.e. strings). For instance, in Wikidata, the exemplar sentence above (<span style="color: red">*Umberto Eco is author of The name of the rose*</span>) is defined by three distict RDF statements:

```
<http://www.wikidata.org/entity/Q12807> <http://www.w3.org/2000/01/rdf-schema#label> "Umberto Eco" .

<http://www.wikidata.org/entity/Q172850> <http://www.w3.org/2000/01/rdf-schema#label> "The Name of the Rose" .

<http://www.wikidata.org/entity/Q12807> <http://www.wikidata.org/prop/direct/P800> <http://www.wikidata.org/entity/Q172850> .
```

Actually, the relation described by the natural language sentence is defined by the third RDF statement above. However, two additional statements have been added to associate the strings representing the name of the resources referring to *Umberto Eco* and *The name of the rose*. Be aware: literals (i.e. simple values) cannot be subjects in any statement.


### A special property

While all the properties you can use in your statements as predicates can be defined in several distinct vocabularies (the [Wikidata data model](https://www.wikidata.org/wiki/Wikidata:List_of_properties), [schema.org data model](https://schema.org/docs/datamodel.html), etc.), RDF defines a special property that is used to associate a resource to its intended type (e.g. another resource representing a class of resources). The IRI of this property is `http://www.w3.org/1999/02/22-rdf-syntax-ns#type`. For instance, we can use this property to assign the appropriate type of object to the two entities defined in the excerpt above, i.e. that referencing to *Umberto Eco* and *The name of the rose*, as shown as follows:

```
<http://www.wikidata.org/entity/Q12807> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Person> .

<http://www.wikidata.org/entity/Q172850> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/Book> .
```

In the example above, we reuse two existing classes of resources included in [schema.org](https://schema.org) for people and books. It is worth mentioning that an existing resource can be associated via `http://www.w3.org/1999/02/22-rdf-syntax-ns#type` to one or more types, if they apply.


### RDF Graphs

An *RDF Graph* is a set of RDF statements. For instance, a file that contains RDF statements represents an RDF graph, and IRIs contained in different graph actually refer to the same resource. 

We talk about graphs in this context because all the RDF statements, and the resources they include, actually defined a direct graph structure, where the direct edges are labelled with the predicates of the statements and the subjects and objects are nodes linked through such edges. For instance, the diagram below represents all the RDF statements introduced above using a visual graph.

![An image of the RDF graph presented in the RDF statements above](rdfgraph.png)


### Triplestores

A *triplestore* is a database built for storing and retrieving RDF statements, and can contain one or more RDF graphs.

## Blazegraph, a database for RDF data

[Blazegraph DB](https://blazegraph.com/) is a ultra high-performance graph database supporting RDF/SPARQL APIs (thus, it is a triplestore). It supports up to 50 Billion edges on a single machine. Its code is entirely [open source and available on GitHub](https://github.com/blazegraph/database). 

Running this database as a server application is very simple. One has just to [download the .jar application](https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar), put it in a directory, and [run it](https://github.com/blazegraph/database/wiki/Quick_Start) from a shell as follows:

```
java -server -Xmx1g -jar blazegraph.jar
```

You need at least Java 9 installed in your system. If you do not have it, you can easily dowload and install it from the [Java webpage](https://www.java.com/it/download/manual.jsp). As you can see from the output of the command above, the database will be exposed via HTTP at a specific IP address:

![Screenshot of the execution of Blazegraph](blazegraph.png)

However, from your local machine, you can always contact it at the following URL:

```
http://127.0.0.1:9999/blazegraph/
```

## From a diagram to a graph

As you can see, the UML diagram introduced in the previous lecture, which I recall below, is already organised as a (directed) graph. Thus, translating such a data model into an RDF graph database is kind of straightforward.

![UML diagram of a data model](../02/uml.png)

The important thing to decide, in this context, is to clarify what are the names (i.e. the URLs) of the classes and properties to use to represent the data compliant with the data model. In particular:

* supposing that each resource will be assigned to at least one of the types defined in the data model, we need to identify the names of all the most concrete classes (e.g. `JournalArticle`, `BookChapter`, `Journal`, `Book`);

* each attribute of each UML class will be represented by a distinct RDF property which will be involved in statements where the subjects are always resources of the class in consideration and the objects are simple literals (i.e. values). Of course, we have to identify the names of these properties (i.e. the URLs);

* each relation starting from an UML class and ending in another UML class will be represented by a distinct RDF property which will be involved in statements where the subjects are always resources of the source class while the objects are resources of the target class. Of course, we have to identify the names of these properties (i.e. the URLs);

* please, bear in mind that all attributes and relations defined in a class are inherited (i.e. can be used by) all its subclasses.

You can choose to reuse existing classes and properties (e.g. as defined in [schema.org](https://schema.org)) or create your own. In the latter case, you have to remind to use an URL you are in control of (e.g. your website or GitHub repository). For instance, a possible pattern for defining your own name for the class `Book` could be `https://<your website>/Book` (e.g. `https://essepuntato.it/Book`). Of course, there are strategies and guidelines that should be used to implement appropriately data model in RDF-compliant languages. However these are out of the scope of the present course (and will be clarified in other courses).

The name of all the classes and properties I will use in the examples in this tutorial are as follows:

* UML class `JournalArticle`: `https://schema.org/ScholarlyArticle`;
* UML class `BookChapter`: `https://schema.org/Chapter`;
* UML class `Journal`: `https://schema.org/Periodical`;
* UML class `Book`: `https://schema.org/Book`;
* UML attribute `doi` of class `Publication`: `https://schema.org/identifier`;
* UML attribute `publicationYear` of class `Publication`: `https://schema.org/datePublished`;
* UML attribute `title` of class `Publication`: `https://schema.org/name`;
* UML attribute `issue` of class `JournalArticle`: `https://schema.org/issueNumber`;
* UML attribute `volume` of class `JournalArticle`: `https://schema.org/volumeNumber`;
* UML attribute `id` of class `Venue`: `https://schema.org/identifier`;
* UML attribute `name` of class `Venue`: `https://schema.org/name`;
* UML relation `publicationVenue` of class `Publication`: `https://schema.org/isPartOf`.

## Using RDF in Python

The [library `rdflib`](https://rdflib.readthedocs.io/en/stable/) provides classes and methods that allow one to create RDF graphs and populating them with RDF statements. It can be installed using the `pip` command as follows: 

```
pip install rdflib
```

The [class `Graph`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph) is used to create an (initially empty) RDF graph, as shown as follows:

In [13]:
from rdflib import Graph

my_graph = Graph()

All the resources (including the properties) are defined using the [class `URIRef`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.term.URIRef). The constructor of this class takes in input a string representing the IRI (or URL) of the resource in consideration. For instance, the code below shows all the resources mentioned above, i.e. those referring to classes, attributes and relations:

In [14]:
from rdflib import URIRef

# classes of resources
JournalArticle = URIRef("https://schema.org/ScholarlyArticle")
BookChapter = URIRef("https://schema.org/Chapter")
Journal = URIRef("https://schema.org/Periodical")
Book = URIRef("https://schema.org/Book")

# attributes related to classes
doi = URIRef("https://schema.org/identifier")
publicationYear = URIRef("https://schema.org/datePublished")
title = URIRef("https://schema.org/name")
issue = URIRef("https://schema.org/issueNumber")
volume = URIRef("https://schema.org/volumeNumber")
identifier = URIRef("https://schema.org/identifier")
name = URIRef("https://schema.org/name")

# relations among classes
publicationVenue = URIRef("https://schema.org/isPartOf")

Instead, literals (i.e. value to specify as objects of RDF statements) can be created using the [class `Literal`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.term.Literal). The constructor of this class takes in input a value (of any basic type: it can be a string, an integer, etc.) and create the related literal object in RDF, as shown in the next excerpt:

In [15]:
from rdflib import Literal

a_string = Literal("a string with this value")
a_number = Literal(42)
a_boolean = Literal(True)

Using these classes it is possible to create all the Python objects necessary to create statements describing all the data to be pushed into an RDF graph. We need to use the [method `add`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.Graph.add) to add a new RDF statement to a graph. Such method takes in input a tuple of three elements defining the subject (an `URIRef`), the predicate (another `URIRef`) and the object (either an `URIRef` or a `Literal`) of the statements.

The following code show how to populate the RDF graph defining using the data obtained by processing the two CSV documents presented in previous tutorials. i.e. [that of the publications](../01/01-publications.csv) and [that of the venues](../01/01-venues.csv). For instance, all the venues are created using the following code:

In [16]:
from pandas import read_csv, Series
from rdflib import RDF

# This is the string defining the base URL used to defined
# the URLs of all the resources created from the data
base_url = "https://comp-data.github.io/res/"

venues = read_csv("../01/01-venues.csv", 
                  keep_default_na=False,
                  dtype={
                      "id": "string",
                      "name": "string",
                      "type": "string"
                  })

venue_internal_id = {}
for idx, row in venues.iterrows():
    local_id = "venue-" + str(idx)
    
    # The shape of the new resources that are venues is
    # 'https://comp-data.github.io/res/venue-<integer>'
    subj = URIRef(base_url + local_id)
    
    # We put the new venue resources created here, to use them
    # when creating publications
    venue_internal_id[row["id"]] = subj
    
    if row["type"] == "journal":
        # RDF.type is the URIRef already provided by rdflib of the property 
        # 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'
        my_graph.add((subj, RDF.type, Journal))
    else:
        my_graph.add((subj, RDF.type, Book))
    
    my_graph.add((subj, name, Literal(row["name"])))
    my_graph.add((subj, identifier, Literal(row["id"])))

As you can see, all the RDF triples have been added to the graph, that currently contain the following number of distinct triples (which is coincident with the number of cells in the original table):

In [17]:
print("-- Number of triples added to the graph after processing the venues")
print(len(my_graph))

-- Number of triples added to the graph after processing the venues
12


The same approach can be used to add information about the publications, as shown as follows:

In [18]:
publications = read_csv("../01/01-publications.csv", 
                        keep_default_na=False,
                        dtype={
                            "doi": "string",
                            "title": "string",
                            "publication year": "int",
                            "publication venue": "string",
                            "type": "string",
                            "issue": "string",
                            "volume": "string"
                        })

for idx, row in publications.iterrows():
    local_id = "publication-" + str(idx)
    
    # The shape of the new resources that are publications is
    # 'https://comp-data.github.io/res/publication-<integer>'
    subj = URIRef(base_url + local_id)
    
    if row["type"] == "journal article":
        my_graph.add((subj, RDF.type, JournalArticle))

        # These two statements applies only to journal articles
        my_graph.add((subj, issue, Literal(row["issue"])))
        my_graph.add((subj, volume, Literal(row["volume"])))
    else:
        my_graph.add((subj, RDF.type, BookChapter))
    
    my_graph.add((subj, name, Literal(row["title"])))
    my_graph.add((subj, identifier, Literal(row["doi"])))
    
    # The original value here has been casted to string since the Date type
    # in schema.org ('https://schema.org/Date') is actually a string-like value
    my_graph.add((subj, publicationYear, Literal(str(row["publication year"]))))
    
    # The URL of the related publication venue is taken from the previous
    # dictionary defined when processing the venues
    my_graph.add((subj, publicationVenue, venue_internal_id[row["publication venue"]]))

After the addition of this new statements, the number of total RDF triples added to the graph is equal to all the cells in the venue CSV plus all the non-empty cells in the publication CSV:

In [19]:
print("-- Number of triples added to the graph after processing venues and publications")
print(len(my_graph))

-- Number of triples added to the graph after processing venues and publications
31


It is worth mentioning that we should not map in RDF cells in the original table that do not contain any value. Thus, if for instance there is an `issue` cell in the publication CSV which is empty (i.e. no information about the issue have been specified), you should not create any RDF statement mapping such a non-information.

## How to create and populate a graph database with Python

Once we have created our graph with all the triples we need, we can upload persistently the graph on our triplestore. In order to do that, we have to create an instance of the [class `SPARQLUpdateStore`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.plugins.stores.html#rdflib.plugins.stores.sparqlstore.SPARQLUpdateStore), which acts as a proxy to interact with the triplestore. The important thing is to open the connection with the store passing, as input, a tuple of two strings with the same URLs, defining the SPARQL endpoint of the triplestore where to upload the data.

Then, we can upload triple by triple using a for-each iteration over the list of RDF statements obtained by using the [method `triples`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.triples) of the class `Graph`, passing as input a tuple with three `None` values, as shown as follows:

In [20]:
from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore

store = SPARQLUpdateStore()

# The URL of the SPARQL endpoint is the same URL of the Blazegraph
# instance + '/sparql'
endpoint = 'http://127.0.0.1:9999/blazegraph/sparql'

# It opens the connection with the SPARQL endpoint instance
store.open((endpoint, endpoint))

for triple in my_graph.triples((None, None, None)):
   store.add(triple)
    
# Once finished, remeber to close the connection
store.close()