<a href="https://colab.research.google.com/github/cw00dw0rd/ArangoNotebooks/blob/master/Intro_To_Knowledge_Graphs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will:
- Further illustrate the RDF data model
- Show how to interpolate from an RDF triple to property graph
- Highlight potential challenges when importing

The first few code blocks are some basic setup, including:
- Importing needed libraries
- Initializing a temporary ArangoDB Oasis instance
- Loading some sample data 

TODO Intro to DBpedia

If you are running the notebook on your own, run the next few code blocks until you get to the ***STOP HERE*** text block.

In [4]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [5]:
import json
import requests
import sys
import oasis
import time
import textwrap

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [6]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="Intro-Knowledge-Graph", credentialProvider="https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [7]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://tutorials.arangodb.cloud:8529
Username: TUT302qvexbj8kximy60pkx8q
Password: TUTxbnvstv401s3filas9z9
Database: TUTdwdf04fgmoh262z5sjqgj5


Feel free to use the above URL to checkout the ArangoDB WebUI!

In [None]:
import re
f = open("test.nt")
f.readline()

'<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .\n'

## STOP HERE

## Subject Predicate Object

Now that we have some of the basic setup completed let's dive into it.

The portion that still needs covered has to do with the format of RDF files. As briefly mentioned in the article associated with this notebook, the RDF specification stores data as triples using the Subject, Predicate, Object (SPO) format.

{{ S-P-O IMAGE }}

This specification and formatting is what keeps the data unfirom and machine readable.

### Serializing

Here is an example of RDF formatted data, serialized as triples. This is from the attached `doyle.nt` file. 
```
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> .
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Artist> .
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Writer> .
```


The `nt` format stands for [N-Triples](https://www.w3.org/TR/n-triples/) and is one of many methods for serializing RDF data. Some common formats for RDF data include:
- XML
- Turtle (ttl)
- N-triples (nt)
- N-quads (nq)
- JSON

We are using `nt` as it is a format supplied directly from DBpedia. 

This is actually a list of 4 triples with the Subject of Arthur Conan Doyle. Let's examine just the first triple from the list to understand what is going on.
```
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .
```
Here we can see that the triple consists of three different links, when dealing with linked data these links are referred to as URIs instead of URLs. The difference between the two is that:
- URIs identity resources
- URLs identify resources AND the protocol to access them
- All URLs are URIs with additional access information
- Not all URIs are URLs

Everything we deal with in this example also happens to be a URL. For instance, try navigating to the first URL from the triple `http://dbpedia.org/resource/Arthur_Conan_Doyle`. You should be brought a page similar to the Wikipedia page with information about Arthur Conan Doyle. You will almost always see the identifiers referred to as URIs even when, such as with this example, they could also be referred to as URLs.




### Subject

The first link supplied is the **Subject** of the triple. Upon inspecting the first link further we can see it is a dbpedia.org resource for Arthur Conan Doyle. A resource is what an RDF expression is describing. This resource can be any number of things, not just web pages. This distinction is why it is important to know that the resource links won't always have associated web pages. Here is a snippit from the [W3 definition](https://www.w3.org/TR/PR-rdf-syntax/) of an RDF resource:

*..A resource may be a part of a Web page; e.g. a specific HTML or XML element within the document source. A resource may also be a whole collection of pages; e.g. an entire Web site. A resource may also be an object that is not directly accessible via the Web; e.g. a printed book...*

### Predicate
The next URI supplied is `http://www.w3.org/1999/02/22-rdf-syntax-ns#type` and this is the predicate of the expression. This predicate is what is actually describing the resource. This predicate refers to the W3 rdf syntax definition of type, indicated with the `#type` at the end of the URI. If you were to navigate to this links (since it is also a URL) you will see an XML page that contains definitions for various properties. 
Here is the definition for type as defined by W3:
```
rdf:type a rdf:Property ;
	rdfs:isDefinedBy <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ;
	rdfs:label "type" ;
	rdfs:comment "The subject is an instance of a class." ;
	rdfs:range rdfs:Class ;
	rdfs:domain rdfs:Resource .
```
  Don't worry too much about everything that is going on this page, the important thing to take away is that this URI is referring to a pre-defined descriptor. This page is essentially the glossary for RDF-specific vocabulary. The RDF XML schema will be covered in more detail (later/in another course?).

### Object
The final URI in this expression is the object `<http://www.w3.org/2002/07/owl#Thing>`. This provides the last piece of information to complete the RDF statement. Navigating to this link takes us to another list of vocabulary terms used in the w3.org ontology. 

As defined on that same page:

*This ontology partially describes the built-in classes and properties that together form the basis of the RDF/XML syntax of OWL 2...*

The supplied RDF/XML definition for the `Thing` class is:

```
owl:Thing a owl:Class ;
     rdfs:label "Thing" ;
     rdfs:comment "The class of OWL individuals." ;
     rdfs:isDefinedBy <http://www.w3.org/2002/07/owl#> .
```
This tells us that the `Thing` is just a base class of OWL individuals.


So as a quick recap with these URIs we know that: 
- The subject of the RDF expression is `Arthur Conan Doyle`
- Who has a property of `type` 
- The `type` property is part of a pre-defined w3.org vocabulary
- He is a `type` of `Thing`, which is also pre-defined

This small amount of information isn't immediately useful to humans as we are aware that a person is a thing but defining even the most basic ideas such as what a `Thing` is, is what makes up the semantic web. Having class definitions for the most minute things is necessary. This allows the data to be read and understood by machines, which are not aware of what a `Thing` is until we explicitly define it. 

Defining these ontologies and schemas is what allows for knowledge graphs to combine data from multiple sources. For example, if we take a look at the next statement:

```
<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> .
```

The Subject and Predicate are the same but the Object has changed. This object actually links to the DBpedia ontology definition for a `Person`. The `Person` class has multiple different associated properties that define what a person is and now we know all the attributes of every `Person` defined in the entire DBpedia knowledge graph.



These types of defined classes allow for querying for things such as:
- Lists of Writers
- Lists of Artists
- Lists of Artists who are also Writers
- Lists of Writers who are also Writers, who were born in Scotland

### End Statement
The final note is that with N-triples you end the statement with a `.`, there are also other symbols you can use for differnet formatting options [detailed here](https://www.w3.org/TR/n-triples/).

## Importing to ArangoDB
These statements for Arthur Conana Doyle can already be represented as a property graph but deciding on how to approach that comes with many considerations. 

In this section we will:
- Import the triples to the temporary Oasis database
- Look at possible complications depending on serialization format



The first hurdle is formatting the statements into separate documents that can be linked to each other as a property graph. 

Rather than view this as a single statement we would like to have these be associated nodes describing Arthur Conan Doyle. Doing this is typically a multi-part process

In [9]:
statements = "<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Artist> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Writer> ."
print(statements)

<http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Person> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Artist> . <http://dbpedia.org/resource/Arthur_Conan_Doyle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Writer> .
