# Knowledge and Data 2020: Practical Assignment 3 
## RDF Data, RDFS knowledge and inferencing 

YOUR NAME: Vanshita Sharma Kumar 

YOUR VUNetID: vkr560

*(If you do not provide your name and VUNetID we will not accept your submission).* 

### Learning objectives

At the end of this exercise you should be able to:

1. Access local an external data via SPARQL both from within a python programming environment and stand-alone with a GUI, such as [YASGUI](https://yasgui.triply.cc/), and this way integrate data from different sources  
2. Model your own first knowledge base, in this case an RDF Schema knowledge graph
3. Implement inference rules 

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everythink is filled in and works, safe the Notebook and submit it 
as a Jupyter Notebook, i.e. with an ipynb extension. Please use as name of the 
Notebook your studentID+Assignment3.ipynb.  

Other than in courses dedicated to programming we will not evaluate the style
of the programs. But we will test your programs on other data than we provide, 
and your program should give the correct answers to those test-data as well. 

Before you start, you need to:

- **Install the *rdflib* Python package:** *pip install rdflib* (should already be installed from the previous assignment)
- **Install the *SPARQLWrapper* Python package:** *pip install SPARQLWrapper*
- **Install the free edition of the GraphDB Triplestore:** please follow this short [GraphDB tutorial](https://github.com/ucds-vu/knowledge-data-vu/blob/master/Tutorials/Preliminaries/tutorial-GraphDB.md). 

Then, add the file example-from-slides.ttl to a newly created database, say called assignment-3. 

**Note that you should have an active internet connection to run the code in this notebook.**

## Task 1: (3.5 points) Integrate Local and External Data

You can integrate SPARQL queries into your Python code by using the *RDFLib* and *SPARQLWrapper* libraries. 

The following code accesses the DBPedia knowledge graph using its SPARQL endpoint, and returns the result of the SPARQL query requesting all the labels asserted to Amsterdam (test it!)  

In [1]:
# This code only works if you are online

from rdflib import Graph, RDF, Namespace, Literal, URIRef
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?cityName
    WHERE { 
        <http://dbpedia.org/resource/Amsterdam> rdfs:label ?cityName 
    }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
    print(result["cityName"]["value"])  

Amsterdam
أمستردام
حكومة أمستردام
Amsterdam
Amsterdam
Amsterdam
Άμστερνταμ
Άμστερνταμ (δήμος)
Amsterdamo
Ámsterdam
Amsterdam
Amsterdam
Amsterdam (commune)
Amstardam
Amsterdam
アムステルダム
Amsterdam
암스테르담
Amsterdam (gemeente)
Amsterdam
Amsterdam
Амстердам
Amesterdão
Amsterdam
Амстердам
阿姆斯特丹


Your task is now the following:
1. Write a SPARQL query that extracts all the cities from your local knowledge graph (constructed by loading the file example-from-slides.ttl) 
2. Find the number of inhabitants of these cities and the longitude and latitude information (if available) from DBPedia.
3. Merge the triples from example-from-slides.ttl with the information extracted from DBpedia + Save all these triples into a new file 'extended-example.ttl' + Print all triples in Turtle Syntax.

For your convenience, we already wrote the following functions that might be useful to complete this task. 
In addition, we have loaded and printed the 'example-from-slides.ttl' dataset.

In [2]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef
from SPARQLWrapper import SPARQLWrapper, JSON

g = Graph()



# Loads the data from a certain file given as input in Turtle syntax into the Graph g  
# -------------------------
def load_graph(filename):
    with open(filename, 'r') as f:
        g.parse(f, format='turtle')
        

# Prints a certain graph given as input in Turtle syntax
# -------------------------
def serialize_graph(myGraph):
     print(myGraph.serialize(format='turtle'))
        

# Saves the Graph g in Turtle syntax to a certain file given as input
# -------------------------
def save_graph(myGraph, filename):
    with open(filename, 'w') as f:
        myGraph.serialize(filename, format='turtle')
        
    
# Changes the namespace of a certain URI given as input to a DBpedia URI 
# Example: transformToDBR("http://example.com/kad2020/Amsterdam") returns "http://dbpedia.org/resource/Amsterdam"
# -------------------------
def transformToDBR(uri):
    if isinstance(uri, Literal):
        # changes the literal to uppercase so that the object with the same name refers to an object and not the string
        return uri.upper()
    components = g.namespace_manager.compute_qname(uri)
    return "http://dbpedia.org/resource/%s"%(components[2])

# -------------------------

load_graph('example-from-slides.ttl')
serialize_graph(g)


# Don't forget to run this cell before continuing the task.


@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Netherlands a ex:Country ;
    ex:contains ex:Ijsselmeer ;
    ex:containsCity ex:Rotterdam ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:neighbours rdfs:subPropertyOf ex:closeBy .

ex:Amsterdam a ex:Capital ;
    ex:closeBy ex:Germany .

ex:Belgium a ex:Country .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:Germany a ex:EuropeanCountry ;
    ex:hasCapital ex:Berlin .

ex:closeBy rdfs:domain ex:Location ;
    rdfs:range ex:Location .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City ;
    rdfs:subPropertyOf ex:contains .

ex:Capital rdfs:subClassOf ex:City .

ex:City rdfs:subClassOf ex:Location .

ex:Country rdfs:subClassOf ex:Location .




### 1. Write a SPARQL query that finds all the cities in the dataset

As there is no explicit class City, you will have to find those cities in the dataset (example-from-slides.ttl) using implicit information that can be deduced from the domain and ranges of the relations (e.g. things in a hasCapital relation are capitals and a capital is a city, etc.).

Save all the cities returned from the SPARQL query into the empty set "cities". 

In [3]:
cities = set()
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?city
        WHERE {
            ?country ex:hasCapital | ex:has_Capital | ex:containsCity ?city.
            
        }
       """)
for row in qres:
    cities.add(row[0])
    
for city in cities:
    print(city) 
    
    # make query which takes the city 
    

http://example.com/kad2020/Rotterdam
http://example.com/kad2020/Berlin
http://example.com/kad2020/Amsterdam


### 2. For each city, find from DBpedia its longitude & latitude, and its number of inhabitants (if available)

Don't forget to adapt the namespace of the cities in your dataset when querying DBpedia, using the above function *transformToDBR(uri)*

The empty graph g2 should only contain the triples extracted from DBpedia, but added to the URIs with the 'ex' namespace. 
An example of a triple in g2 is the following triple: 
       
       ex:Amsterdam dbo:populationTotal "872680"^^xsd:nonNegativeInteger .

In [4]:
g2 = Graph()

ex = Namespace("http://example.com/kad2020/") 
g2.bind("ex", ex)

dbo = Namespace("http://dbpedia.org/ontology/")
g2.bind("dbo", dbo)

geo = Namespace("http://www.w3.org/2003/01/geo/wgs84_pos#")  
g2.bind("geo", geo)

for city in cities: 
    
    transform_to_dbr = transformToDBR(city) 
    
    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setQuery("""
    PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?longitude ?latitude <"""+transform_to_dbr+"""> ?population
    WHERE {
      <"""+transform_to_dbr+"""> geo:lat ?latitude. 
      <"""+transform_to_dbr+"""> geo:long ?longitude .
      
        OPTIONAL{<"""+transform_to_dbr+"""> dbo:populationTotal ?population}

    } 
""")
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    
    split_city = city.split('/')[-1] 
    # print(split_city)
                    
    for result in results["results"]["bindings"]:
        longitude = result["longitude"]["value"]
        latitude = result["latitude"]["value"]

        g2.add((ex[city], geo.long, Literal(longitude)))
        g2.add((ex[city], geo.lat, Literal(latitude)))
        
        if 'population' in result:
            population = result["population"]["value"]
            g2.add((ex[city], dbo.populationTotal, Literal(population)))
            
            
serialize_graph(g2)

@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .

<http://example.com/kad2020/http://example.com/kad2020/Amsterdam> dbo:populationTotal "872680" ;
    geo:lat "52.3667" ;
    geo:long "4.9" .

<http://example.com/kad2020/http://example.com/kad2020/Berlin> dbo:populationTotal "3769495" ;
    geo:lat "52.52" ;
    geo:long "13.405" .

<http://example.com/kad2020/http://example.com/kad2020/Rotterdam> dbo:populationTotal "651157" ;
    geo:lat "51.9167" ;
    geo:long "4.5" .




### 3. Save your results

- Merge the triples from example-from-slides.ttl with the information extracted from DBpedia
- Save all these triples into a new file 'extended-example.ttl'
- Print all triples in Turtle Syntax.


In [5]:
# Your code here
# k.poon@student.vu.nl
new_graph = g + g2
save_graph(new_graph, 'extended-example.ttl')
serialize_graph(new_graph)

@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix ex: <http://example.com/kad2020/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Netherlands a ex:Country ;
    ex:contains ex:Ijsselmeer ;
    ex:containsCity ex:Rotterdam ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

<http://example.com/kad2020/http://example.com/kad2020/Amsterdam> dbo:populationTotal "872680" ;
    geo:lat "52.3667" ;
    geo:long "4.9" .

<http://example.com/kad2020/http://example.com/kad2020/Berlin> dbo:populationTotal "3769495" ;
    geo:lat "52.52" ;
    geo:long "13.405" .

<http://example.com/kad2020/http://example.com/kad2020/Rotterdam> dbo:populationTotal "651157" ;
    geo:lat "51.9167" ;
    geo:long "4.5" .

ex:neighbours rdfs:subPropertyOf ex:closeBy .

ex:Amsterdam a ex:Capital ;
    ex:closeB

## Task 2: (2.5 points)  Implement Basic Inferencing Rules 

In the lecture we showed that the RDFS inference rules can be used to infer new knowledge. For example, infer class membership based on rdfs:domain or infer relationships between subjects and objects based on rdfs:subPropertyOf. 

Create rules (or 1 rule?!) to inference class membership based on the RDF Schema language features 
*	For example: infer that an instance belongs to a class because of domain and range restrictions
*	For example: infer that an instance belongs to a (super)class because it also belongs to a subclass

We implemented the rdfs2 rule. You should implement the 5 following remaining rules:  

*     (rdfs2) If G contains the triples (aaa rdfs:domain xxx.) and (uuu aaa yyy.)  then infer the triple (uuu rdf:type xxx.)

*     (rdfs3) If G contains the triples (aaa rdfs:range xxx.) and (uuu aaa vvv.) then infer the triple (vvv rdf:type xxx .)

*     (rdfs5) If G contains the triples (uuu rdfs:subPropertyOf vvv.) and (vvv rdfs:subPropertyOf xxx.) then infer the triple
(uuu rdfs:subPropertyOf xxx.) 

*     (rdfs7) If G contains the triples (aaa rdfs:subPropertyOf bbb.) and (uuu aaa yyy.) then infer the triple (uuu bbb yyy) 

*     (rdfs9) If G contains the triples (uuu rdfs:subClassOf xxx.) and (vvv rdf:type uuu.) then infer the triple
 (vvv rdf:type xxx.)   -> this one was not mentioned in the lecture, but is a very important one. 
 
 
*     (rdfs11) If G contains the triples (uuu rdfs:subClassOf vvv.) and (vvv rdfs:subClassOf xxx.) then infer the triple
(uuu rdfs:subClassOf xxx.)


Run your rule reasoner on your knowledge graph.

In [6]:
#sbj, pre, obj. s,p,o

def myRDFSreasoner(myGraph):
    inferredTriples = 0
    for sbj, prd, obj in myGraph:

        # --- rdfs2 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#domain"))):
            generator = myGraph.subject_objects(URIRef(sbj))
            for s, o in generator:
                inferredTriples += 1
                print("(rdfs 2) ", s, "rdf:type", obj)
        
        
        # --- rdfs3 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#range"))):
            generator = myGraph.subject_objects(URIRef(sbj))
            for s, o in generator:
                inferredTriples += 1
                print("(rdfs 3) ", o, "rdf:type", obj) 
        
        # --- rdfs5 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#subPropertyOf"))):
            generator = myGraph.subject_objects(URIRef("http://www.w3.org/2000/01/rdf-schema#subPropertyOf"))
            for s, o in generator:
                if obj==s:
                    inferredTriples += 1
                    print("(rdfs 5) ", sbj, "rdf:subPropertyOf", o)
        
        # --- rdfs7 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#subPropertyOf"))):
            generator = myGraph.subject_objects(URIRef(sbj))
            for s, o in generator:
                inferredTriples += 1
                print("(rdfs 7) ", s, "rdf:subPropertyOf", o)
        
        # --- rdfs9 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#subClassOf"))):
            generator = myGraph.subject_objects(URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"))
            for s, o in generator:
                if sbj==o:
                    inferredTriples += 1
                    print("(rdfs 9) ", s, "rdf:type", obj)
                
        #--- rdfs11 ---
        if (prd.eq(URIRef("http://www.w3.org/2000/01/rdf-schema#subClassOf"))):
            generator = myGraph.subject_objects(URIRef("http://www.w3.org/2000/01/rdf-schema#subClassOf"))
            for s,o in generator:
                if obj==s:
                    inferredTriples += 1
                    print("(rdfs 11) ", sbj, "rdf:subClassOf", o)

        
    print("---------------------------------")
    print("Number of inferred triples:", inferredTriples)
    print("---------------------------------")
    
myRDFSreasoner(g)


# rdf 11: first part get all triples with subclassof, take object from subclassof, search for the triples with subjects which is the same as the object from the previous line, with the predicates subclassof then make new triple 





(rdfs 7)  http://example.com/kad2020/Netherlands rdf:subPropertyOf http://example.com/kad2020/Belgium
(rdfs 9)  http://example.com/kad2020/Amsterdam rdf:type http://example.com/kad2020/City
(rdfs 11)  http://example.com/kad2020/Capital rdf:subClassOf http://example.com/kad2020/Location
(rdfs 3)  http://example.com/kad2020/Rotterdam rdf:type http://example.com/kad2020/City
(rdfs 7)  http://example.com/kad2020/Netherlands rdf:subPropertyOf http://example.com/kad2020/Rotterdam
(rdfs 9)  http://example.com/kad2020/Germany rdf:type http://example.com/kad2020/Country
(rdfs 11)  http://example.com/kad2020/EuropeanCountry rdf:subClassOf http://example.com/kad2020/Location
(rdfs 3)  http://example.com/kad2020/Berlin rdf:type http://example.com/kad2020/Capital
(rdfs 9)  http://example.com/kad2020/Netherlands rdf:type http://example.com/kad2020/Location
(rdfs 9)  http://example.com/kad2020/Belgium rdf:type http://example.com/kad2020/Location
(rdfs 3)  http://example.com/kad2020/Germany rdf:type h

## Task 3: (2 points) Build your very own RDFS knowledge graph. 


Define a small RDF Schema vocabulary in Turtle. You can choose your own domain (e.g. movies, geography, sports) respecting all the following rules:
*	The schema should define at least 4 classes, 4 properties, and 4 instances.
*   The properties should be used to relate the instances
*	The instances should be a member of your classes
*	All resources should have an rdfs:label in a suitable language.

You should use (at least) the following language features of RDF and RDFS:
* 	rdf:type (or 'a')
* 	rdfs:subClassOf
* 	rdfs:subPropertyOf
* 	rdfs:domain and rdfs:range
*	rdfs:label

Be sure to define the 'rdf:' and 'rdfs:' namespace prefixes for RDF and RDF Schema in your file (perhaps have a look at http://prefix.cc)

For creating your vocabulary, you can either use a text editor, or add the axioms directly (programatically) to your Knowledge Graph as you did last week. 

Play around with the inference rules you have created in the previous task to make sure that you some added some implicit knowledge, that becomes "visible" via inferencing (this will be useful for the next task). 

Finally:
- Add the knowledge you created into the RDFLIB graph datastructure *myRDFSgraph*, 
- Print *myRDFSgraph* in Turtle so that we can check your "design"
- Save *myRDFSgraph* into a new file 'myRDFSgraph.ttl'

In [33]:
myRDFSgraph = Graph()

ex = Namespace('http://example.org/')
myRDFSgraph.bind('ex',ex)
rdf = Namespace('http://www.w3.org/1999/02/22-rdf-syntax-ns#')
myRDFSgraph.bind('rdf',rdf)
rdfs = Namespace('http://www.w3.org/2000/01/rdf-schema#')
myRDFSgraph.bind('rdfs',rdfs)


myRDFSgraph.add((ex.Art, rdf.type, rdfs.Class))
myRDFSgraph.add((ex.ArtWork, rdf.subPropertyOf, rdfs.Art))


myRDFSgraph.add((ex.MonaLisa, rdfs.subClassOf, ex.Art))
myRDFSgraph.add((ex.DaVinci, rdfs.domain, ex.ArtWork))
myRDFSgraph.add((ex.common_theme, rdfs.range, ex.Themes))


myRDFSgraph.add((ex.Painter, rdf.type, rdfs.Class))
myRDFSgraph.add((ex.Sculpture, rdf.type, rdfs.Class))
myRDFSgraph.add((ex.Themes, rdf.type, rdfs.Class))
myRDFSgraph.add((ex.ArtWork, rdf.type, rdfs.Class))
myRDFSgraph.add((ex.David, ex.madeOf, ex.CarraraMarble))

myRDFSgraph.add((ex.MonaLisa, rdf.type, ex.Painting))
myRDFSgraph.add((ex.StarryNight, ex.is_a, ex.OilPainting))
myRDFSgraph.add((ex.TheNachtWacht, rdf.type, ex.Painting))
myRDFSgraph.add((ex.David, rdf.type, ex.Sculpture))

myRDFSgraph.add((ex.MonaLisa, ex.painted_by, ex.DaVinci))
myRDFSgraph.add((ex.MonaLisa, rdf.type, ex.Artwork))
myRDFSgraph.add((ex.DaVinci, ex.painted, ex.MonaLisa))
myRDFSgraph.add((ex.TheNachtWacht, ex.common_theme, ex.creative_expression))
myRDFSgraph.add((ex.StarryNight, rdf.type, ex.ArtWork))

myRDFSgraph.add((ex.StarryNight, ex.painted_by, ex.VanGogh))


myRDFSgraph.add((ex.David, ex.Located_in, Literal("Italy")))
myRDFSgraph.add((ex.DaVinci, rdf.type, ex.Painter))


myRDFSgraph.add((ex.Michelangelo, rdf.subClassOf, ex.Art))
myRDFSgraph.add((ex.Location, rdfs.subPropertyOf, ex.Located_in))
myRDFSgraph.add((ex.Art, rdfs.label, Literal("Kunst")))


print("Now let's check what we can infer from your knowledge graph...")
print("The more rules you cover, the better!")
myRDFSreasoner(myRDFSgraph)
print(myRDFSgraph.serialize(format='ttl'))
save_graph(myRDFSgraph, "Painters_graph.ttl")

Now let's check what we can infer from your knowledge graph...
The more rules you cover, the better!
(rdfs 3)  http://example.org/creative_expression rdf:type http://example.org/Themes
---------------------------------
Number of inferred triples: 1
---------------------------------
@prefix ex: <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Art a rdfs:Class ;
    rdfs:label "Kunst" .

ex:ArtWork a rdfs:Class ;
    rdf:subPropertyOf rdfs:Art .

ex:Painter a rdfs:Class .

ex:Sculpture a rdfs:Class .

ex:Themes a rdfs:Class .

ex:David a ex:Sculpture ;
    ex:Located_in "Italy" ;
    ex:madeOf ex:CarraraMarble .

ex:Location rdfs:subPropertyOf ex:Located_in .

ex:Michelangelo rdf:subClassOf ex:Art .

ex:StarryNight a ex:ArtWork ;
    ex:is_a ex:OilPainting ;
    ex:painted_by ex:VanGogh .

ex:TheNachtWacht a ex:Painting ;
    ex:common_theme ex:creative_expression .

ex:common_theme rdfs:range

## Task 4 (2 points) Compare local inferences with GraphDB results

Upload *myRDFSgraph.ttl* to GraphDB (check [the GraphDB tutorial](https://github.com/ucds-vu/knowledge-data-vu/blob/master/Tutorials/Preliminaries/tutorial-GraphDB.md) before starting to work with GraphDB).

Formulate two different SPARQL queries, and write a Python code that executes these queries over your GraphDB SPARQL endpoint (check example of Task 1).

**Each SPARQL query should return a different type of inferred knowledge** (at least one triple that was not explicitly asserted in the graph).

Specify below next to your query, which type of RDFS rule is the GraphDB reasoner using to infer this answer (rdfs2, rdfs3, rdfs5, rdfs7, rdfs9, rdfs11). 

In [8]:
# Get your GraphDB repository URL and assign it to the variable 'myEndpoint' below. 
# It should be similar (but not the same) to this: 

myEndpoint = "http://145.108.229.246:7200/repositories/assignment-3"
sparql = SPARQLWrapper(myEndpoint)



In [9]:
# Query 1 - Specify which RDFS rule are you testing: 

# Check example of Task 1 on how to query remote SPARQL endpoints

sparql.setQuery("""

""")






In [10]:
# Query 2 - Specify which RDFS rule are you testing: 

# Check example of Task 1 on how to query remote SPARQL endpoints

sparql.setQuery("""

""")

