# Introduction

This notebook is an hands\-on introduction to querying data with SPARQL.

This notebook is built with a **Python3 kernel** and uses the **SPARQLWrapper, RDFLib** and **pandas** python libraries.

For this tutorial all queries are run locally in the Jupyther notebook. But can also be run at public endpoints \(websites that expect sparql\).

## RDF

Resource Description Framework: a way to talk about data on the web.

 - Triples : Small sentences that have a subject, predicate and an object
 - subject : the thing the sentence is about
 - predicate: the concept linking the subject to the object
 - object: what the subject relates to.

Example:

```
I love chocolate .
```

I  = subject, love = predicate, chocolate = object .

Different ways of writing the same thing, we use the syntax [Turtle](https://www.w3.org/TR/turtle/)

We replace the words such as I with an IRI (URL, thing that is in the top of your browser/internet)
e.g.
I =  https://ch.linkedin.com/in/jervenbolleman/en
love = https://www.dictionary.com/browse/love
chocolate = https://www.wikidata.org/wiki/Q195

## SPARQL

Query language to find answers in data represented as RDF.



# Initialisation

This notebook uses a Python3 kernel.



## Python libraries

**Dependencies:**

- **SPARQLWrapper**  
  SPARQLWrapper is a wrapper around a SPARQL service. It helps in creating the query URI and, possibly, convert the result into a more manageable format. The package is licensed under W3C license.  
  useful links: https://rdflib.github.io/sparqlwrapper/ and https://pypi.org/project/SPARQLWrapper/
- **RDFLIB**  
  Is a Python library for working with RDF data, it has it's own SPARQL engine build in and does not need \(but can work with\) a SPARQL capable database. [Documentation](https://rdflib.readthedocs.io/en/stable/)
- **pandas**  
  pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured \(tabular, multidimensional, potentially heterogeneous\) and time series data both easy and intuitive.  
  useful link: https://pypi.org/project/pandas/



In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

import pandas as pd
from pandas import json_normalize


In [3]:
g = Graph()
prefixes="""PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX dbpedia:<http://dbpedia.org/resource/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX tto:<http://example.org/tuto/ontology#>
PREFIX ttr:<http://example.org/tuto/resource#>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
"""

data = g.parse(format ="turtle" ,data = prefixes+"""

dbo:Person rdfs:subClassOf tto:Creature .

tto:Animal a rdfs:Class ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "animal" ;
	rdfs:subClassOf tto:Creature .

tto:Cat a rdfs:Class ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "cat" ;
	rdfs:subClassOf tto:Animal .

tto:Creature a rdfs:Class ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "creature" .

tto:Dog a rdfs:Class ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "dog" ;
	rdfs:subClassOf tto:Animal .

tto:Monkey a rdfs:Class ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "monkey" ;
	rdfs:subClassOf tto:Animal .

tto:pet a rdf:Property ;
	rdfs:domain dbo:Person ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "domestic animal" ;
	rdfs:range tto:Animal .

tto:sex a rdf:Property ;
	rdfs:domain tto:Creature ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "sex" ;
	rdfs:range xsd:string .

tto:weight a rdf:Property ;
	rdfs:comment "weight in kilograms" ;
	rdfs:domain tto:Creature ;
	rdfs:isDefinedBy <http://example.org/tuto/ontology#> ;
	rdfs:label "weight" ;
	rdfs:range xsd:decimal .

ttr:Eve dbo:parent ttr:William ;
	dbp:birthDate "2006-11-03"^^xsd:date ;
	dbp:name "Eve" ;
	tto:sex "female" ;
	a dbo:Person .

ttr:John dbp:birthDate "1942-02-02"^^xsd:date ;
	dbp:name "John" ;
	tto:pet ttr:LunaCat , ttr:TomCat ;
	tto:sex "male" ;
	a dbo:Person .

ttr:LunaCat dbp:name "Luna" ;
	tto:color "violet" ;
	tto:sex "female" ;
	tto:weight 4.2 ;
	a tto:Cat .

ttr:RexDog dbp:name "Rex" ;
	tto:color "brown" ;
	tto:sex "male" ;
	tto:weight 8.8 ;
	a tto:Dog .

ttr:SnuffMonkey dbp:name "Snuff" ;
	tto:color "golden" ;
	tto:sex "male" ;
	tto:weight 3.6 ;
	a tto:Monkey .

ttr:TomCat dbp:name "Tom" ;
	tto:color "grey" ;
	tto:sex "male" ;
	tto:weight 5.8 ;
	a tto:Cat .

ttr:William dbo:parent ttr:John ;
	dbp:birthDate "1978-07-20"^^xsd:date ;
	dbp:name "William" ;
	tto:pet ttr:RexDog ;
	tto:sex "male" ;
	a dbo:Person .""")


## Introduction

![Cartoon of how data items are related](https://sparql-playground.sib.swiss/queries/cartoon-rdf-type.png)

This example contains a very simple set with persons and their pets.

The following diagram shows you the main resources it contains.

This is a simplified version \(all triples can be seen in the data section\)



## RDF Data used in this tutorial



In this tutorial we use a minimal dataset just enough to train on RDF data. Here we use a syntax called Turtle. RDF is about small sentences called triples, they have a subject, predicate and an object.
In most cases we use an IRI (the thing that is in the address bar) to identify things. e.g. [http://purl.uniprot.org/uniprot/P05067]
We use IRIs to avoid amiguity e.g. 9606 is that the tax code for human or a pubmed identifier or just a number?
These IRIs are very long so we make them shorter with an abbreviation we call these "prefixes", and are normally at the top of the SPARQL query and RDF files.

There are also literals, they have datatypes  e.g. "animal" is an xsd:String, while "1985"^^xsd:gYear is a year in the gregorian calendar as defined by XML 1.1.

The last type is blank nodes, blank nodes can be thought of as things that should have an IRI but we just don't know them yet.

The prefixes we use are

```
PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX dbpedia:<http://dbpedia.org/resource/>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX tto:<http://example.org/tuto/ontology#>
PREFIX ttr:<http://example.org/tuto/resource#>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
```



# SPARQL query examples



## Turning a data example into a query

Take one triple

```turtle
ttr:John a dbo:Person .
```

replace ttr:John with a variable

```sparql 
?person rdf:type dbo:Person .
```

add that we want to select the ?person .

```sparql
select ?person

  ?person rdf:type dbo:Person .

```

and separate what we want as a result from the pattern

```sparql
select ?person
where {
  ?person rdf:type dbo:Person .
}
```

### Q1: Select things that are persons

Selects subjects connected to the object dbo:Person via the predicate rdf:type
?thing is the only variable



In [0]:
q1 = g.query(prefixes+"""
select ?thing where {
  ?thing rdf:type dbo:Person .
}""")

# Remember that in the turtle data above we can see:
#
# 'ttr:John a dbo:Person'
#
# 'a' can also be used instead of 'rdf:type'
# 'a' is a synonym of 'rdf:type'

for row in q1:
    print("%s os a Person" % row)


### Q2: Select "things" that are Female

![Diagram](https://sparql-playground.sib.swiss/queries/cartoon-female.png) 



In [7]:
q2 = g.query(prefixes+"""
select ?thing where {
  ?thing tto:sex "female" .
}""")

# Notice that not only the persons
# but also the pets are taken

for row in q2:
    print("%s is female" % row)


http://example.org/tuto/resource#Eve is female
http://example.org/tuto/resource#LunaCat is female


### Q3: Select things that are females

Selects subjects connected to the literal "female" via the predicate tto:sex
?thing is the only variable

In [8]:
q3 = g.query("""
select ?thing where {
  ?thing a dbo:Person .
  ?thing tto:sex "female" .
}""")


# Use the same name of the variable in the 2 statements
# It is the name of the variable that enforces the constraint

# Note the dot "." which must be added in the first statement, otherwise we get a MalformedQueryException.
# Like good english we need to finish one sentence before starting the next.

for row in q3:
    print("q3: %s a female person" % row)

q3: http://example.org/tuto/resource#Eve a female person


In [17]:
# Hint: Use the semicolon ';' to refer to the previous subject
#
# select ?thing where {
#  ?thing a dbo:Person ;
#     tto:sex "female" .
#}
q3b = g.query(prefixes+"""
select ?thing where {
  ?thing a dbo:Person ;
  tto:sex "female" .
}""")

## Exercise use semi-colon to avoid typing the subject twice 

In [11]:
# Hint: Use the semicolon ';' to refer to the previous subject
#
# select ?thing where {
#  ?thing a dbo:Person ;
#     tto:sex "female" .
#}
e3 = g.query(prefixes+"""
select ?thing where {
  ?thing a dbo:Person ;
     tto:sex "female" .
}""")

for row in e3:
    print("e3: %s a female person" % row)

e3: http://example.org/tuto/resource#Eve a female person


In [18]:

# Note that we could also use the comma ',' , if we had hermaphrodites in our dataset
# The following statements selects things that are persons male and females at the same
# select ?thing where {
#  ?thing a dbo:Person ;
#   		tto:sex "female" , "male" 
#}
#Of course this does not return any value in our "normal" dataset....""")



for row in q3b:
    print("q3b: %s a female person" % row)


q3b: http://example.org/tuto/resource#Eve a female person


### Q4: Select things that have a "sex"

![diagram](https://sparql-playground.sib.swiss/queries/cartoon-female.png) 



In [19]:
q4 =g.query(prefixes+"""
select ?thing ?sex where {
  ?thing tto:sex ?sex .
}
""")
# Notice that we get 2 variables in our dataset
#
# Explore the use of the keywords LIMIT and OFFSET at the end of the query
#
# example LIMIT 3
# example OFFSET 2 LIMIT 3
#
# Add the word distinct before ?thing in the select, to get the number of distinct sexes on the dataset

for row in q4:
    print("q4: %s a thing that has a sex= %s" % row)

q4: http://example.org/tuto/resource#Eve a thing that has a sex= female
q4: http://example.org/tuto/resource#LunaCat a thing that has a sex= female
q4: http://example.org/tuto/resource#John a thing that has a sex= male
q4: http://example.org/tuto/resource#RexDog a thing that has a sex= male
q4: http://example.org/tuto/resource#SnuffMonkey a thing that has a sex= male
q4: http://example.org/tuto/resource#TomCat a thing that has a sex= male
q4: http://example.org/tuto/resource#William a thing that has a sex= male


### Q5: Select persons and their pets

```sparql
select ?person ?pet where {
  ?person rdf:type dbo:Person .
  ?person tto:pet ?pet .
}
```

another example where we select 2 variables
notice that only the persons 
who actually have a pet are returned in the result set



In [20]:
q5 = g.query(prefixes+"""
select ?person ?pet where {
    ?person rdf:type dbo:Person .
	?person tto:pet ?pet .
}
""")
for row in q5:
    print("%s has %s as pet" % row)

http://example.org/tuto/resource#John has http://example.org/tuto/resource#LunaCat as pet
http://example.org/tuto/resource#John has http://example.org/tuto/resource#TomCat as pet
http://example.org/tuto/resource#William has http://example.org/tuto/resource#RexDog as pet


## OPTIONAL

For when some data is nice to have but not essential.





### Q6: Select persons and (if they have) their pets

![Diagram](.basic_sparql_tutorial_SWAT4HCLS_2024.ipynb.upload/paste-0.08931847913213065)

In [21]:
q6 = g.query(prefixes+"""
select ?person ?pet where {
    ?person rdf:type dbo:Person .
	optional {?person tto:pet ?pet }.
}
""")
# The use of the clause optional allows
# to extract their pets if they exist
# but will not exclude the persons who don't have pets
for row in q6:
    print("%s has %s as pet" % row)

http://example.org/tuto/resource#Eve has None as pet
http://example.org/tuto/resource#John has http://example.org/tuto/resource#LunaCat as pet
http://example.org/tuto/resource#John has http://example.org/tuto/resource#TomCat as pet
http://example.org/tuto/resource#William has http://example.org/tuto/resource#RexDog as pet


## Finding where something is missing

The are a few ways to select for missing data in SPARQL we use the concept of filtering for something that does not exists.

```sparql
FILTER (NOT EXISTS {  })
```

### Q7: Select persons that have no pets



In [30]:
q7 = g.query(prefixes+"""
select
  ?person
  ?pet
where {
    ?person rdf:type dbo:Person .
    filter exists {?person tto:pet ?_ }.
}""")


for row in q7:
    print(row)
    print("%s does not have a pet. Pet is %s" % row)

(rdflib.term.URIRef('http://example.org/tuto/resource#John'), None)
http://example.org/tuto/resource#John does not have a pet. Pet is None
(rdflib.term.URIRef('http://example.org/tuto/resource#William'), None)
http://example.org/tuto/resource#William does not have a pet. Pet is None


Note that the variable ?pet is not bound even if you use filter exists
 ```sparql
 filter exists {?person tto:pet ?_ }.
 ```
Bound means that a variable is "filled" in or matches with the graph.

### Q8: William's and John's pets

Selects the pets of a list of owners using the clause union or values

In [29]:
q8 = g.query(prefixes+"""
select
 ?pet
where {
 {
    ttr:William tto:pet ?pet .
 } UNION {
    ttr:John tto:pet ?pet .
 }
}""")
for row in q8:
    print("%s is a pet" % row)

http://example.org/tuto/resource#RexDog is a pet
http://example.org/tuto/resource#LunaCat is a pet
http://example.org/tuto/resource#TomCat is a pet


In this scenario we only have 2 persons who have pets, so it is not the best example
but you could see the potential of union in a real scenario, where we would have many owners and we would 
like to filter just a few of them

Alternatively you can also use the keyword VALUES that sets what the values of ?owner could be (faster option)

In [26]:
q8b = g.query(prefixes+"""select
 ?owner
 ?pet
where {
  values (?owner) { (ttr:William) (ttr:John) }
  ?owner tto:pet ?pet .
}""")

for row in q8b:
    print("%s is the owner of the pet %s" % row)

http://example.org/tuto/resource#William is the owner of the pet http://example.org/tuto/resource#RexDog
http://example.org/tuto/resource#John is the owner of the pet http://example.org/tuto/resource#LunaCat
http://example.org/tuto/resource#John is the owner of the pet http://example.org/tuto/resource#TomCat


## Exercise:  Select Eve's grandparent

Replace *** with specific variables or resources.

Use the property dbo:parent to connect Eve to her parent



In [0]:
e1 = g.query(prefixes+"""
select ?grandparent where {
  ttr:Eve dbo:parent  ?parent .
  ?parent   dbo:parent   ?grandparent  .
}""")

for row in e1:
    print("%s is the grandparent of Eve" % row)

## Exercise:  Select persons who don't have cats

You should filter out any person with a pet which is of type tto:Cat

In [0]:
e2 = g.query("""
select ?person where {
  ?person rdf:type dbo:Person .
  *** *** *** {
    ?person tto:pet ?pet .
    ?pet rdf:type *** .
  }
}
""")

for row in e2:
    print("%s is a person who does not have a Cat" % row)

### Exercise 3: Select the relatives of William

We introduce the concept of `BIND`. Bind put's _one_ single value into a variable.

In [0]:
e3 =g.query("""
select
  ?relative
  ?william
where {
  bind (ttr:William as ?william))
  {?william *** ?relative}
  ***
  {?relative *** ?william}
}
""")

for row in e3:
    print("%s is a relative of %s" % row)

## Paths

Inverting a query pattern

```sparql
?a :prop ?b .
```

is equivalent:
```sparql
?b ^:prop ?a .
```


Once that is done, try to write the query in one line using the pipe (|) which means OR
```sparql
ttr:William (prop | ^prop) ?relative .
```

In [0]:
e3b =g.query("""
select
  ?relative
  ?william
where {
  bind (ttr:William as ?william))
  ?william * *** * * **** * ?relative}
}
""")

for row in e3b:
    print("%s is a relative of %s" % row)

### Q9: Get the direct sub classes of class Creature

the single graph pattern matching process finds all the subjects described as a `rdfs:subClassOf tto:Creature` .

In [11]:
q9 = g.query(prefixes + """
select ?subSpecies   where {
  ?subSpecies rdfs:subClassOf tto:Creature .
}
""")

#The same rules applied to the data / resources 
#are applied to the ontology / classes

for row in q9:
    print("%s is a tto:Creature " % row)

http://dbpedia.org/ontology/Person is a tto:Creature 
http://example.org/tuto/ontology#Animal is a tto:Creature 


### Q10: Get the direct and indirect sub classes of class Creature

the + after rdfs:subClassOf retrieves solutions for ?species if it is connected to tto:Creature by one or multiple owl:subClassOf predicates

In [12]:
q10 = g.query("""select ?subSpecies   where {
  ?subSpecies rdfs:subClassOf+ tto:Creature .
}""")

for row in q10:
    print("%s is a sub species of tto:Creature " % row)


http://dbpedia.org/ontology/Person is a sub species of tto:Creature 
http://example.org/tuto/ontology#Animal is a sub species of tto:Creature 
http://example.org/tuto/ontology#Cat is a sub species of tto:Creature 
http://example.org/tuto/ontology#Dog is a sub species of tto:Creature 
http://example.org/tuto/ontology#Monkey is a sub species of tto:Creature 


There are different ways to express the property path level

```
 path* | path+ | path?
```

```
  * -> means 0 or more
  + -> means 1 or more
  ? -> means 0 or 1 
```

 The same can be used for any property for example:

 ```sparql
 select ?parents where {
   ttr:Eve dbo:parent+ ?parents .
}
```



### Alternative for Exercise with unneeded variable

Try to write a two line where part of a query in only one line using '/'
knowing that:

```sparql
?a prop ?c .
?c prop ?d .
```

can be simplified like this:

```sparql
?a prop / prop ?d
```

In [0]:
e1 = g.query(prefixes+"""
select ?grandparent where {
  ttr:Eve dbo:parent  * dbo:parent ?grandparent  .
}""")

for row in e1:
    print("%s is the grandparent of Eve" % row)

### Q11: Select all things that are animals

Hint:use the property rdfs:subClassOf+, rdf:type and the class tto:Animal

In [9]:
q11 = g.query(prefixes+"""
select ?thing ?type where {
  ?type rdfs:subClassOf+ tto:Animal .
  ?thing a ?type .
}
""")



for row in q11:
    print("%s is a thing of type %s" % row)

# or alternatively

q11b = g.query(prefixes+"""
select ?thing where {
  ?thing a / rdfs:subClassOf+ tto:Animal .
}""")

print("type is not used in this variant")

for row in q11b:
    print("%s is a thing" % row)

http://example.org/tuto/resource#LunaCat is a thing of type http://example.org/tuto/ontology#Cat
http://example.org/tuto/resource#TomCat is a thing of type http://example.org/tuto/ontology#Cat
http://example.org/tuto/resource#RexDog is a thing of type http://example.org/tuto/ontology#Dog
http://example.org/tuto/resource#SnuffMonkey is a thing of type http://example.org/tuto/ontology#Monkey
type is not used in this variant
http://example.org/tuto/resource#LunaCat is a thing
http://example.org/tuto/resource#TomCat is a thing
http://example.org/tuto/resource#RexDog is a thing
http://example.org/tuto/resource#SnuffMonkey is a thing


## Adding data
When one has write access to a data base via sparql we can insert data. If we want to make new data but are not allowed to add it we can CONSTRUCT it for download.

### Q12: Search for lonely pets and try to find them a nice owner :)

Insert the triple using an INSERT.

In [16]:
# First we show that ttr:SnuffMonkey has None as owner
q12 = g.query(prefixes+"""
select
  ?pet
  ?owner
where {
  ?pet a / rdfs:subClassOf+ tto:Animal .
  optional {?owner tto:pet ?pet}
}
""")

for row in q12:
    print("%s is a pet with the owner %s" % row)


http://example.org/tuto/resource#LunaCat is a pet with the owner http://example.org/tuto/resource#John
http://example.org/tuto/resource#TomCat is a pet with the owner http://example.org/tuto/resource#John
http://example.org/tuto/resource#RexDog is a pet with the owner http://example.org/tuto/resource#William
http://example.org/tuto/resource#SnuffMonkey is a pet with the owner None


In [17]:
# we use update instead of query.
i12 = g.update("""
INSERT DATA { dbpedia:Harrison_Ford tto:pet ttr:SnuffMonkey.}
""")

# note the lack of a where clause! this inserts without consideration of data that is there


In [22]:
q12again = g.query("""
select
  ?pet
  ?owner
where {
  ?pet a / rdfs:subClassOf+ tto:Animal .
  optional {?owner tto:pet ?pet}
}
""")

for row in q12again:
    print("%s is a pet with the owner %s" % row)

http://example.org/tuto/resource#LunaCat is a pet with the owner http://example.org/tuto/resource#John
http://example.org/tuto/resource#TomCat is a pet with the owner http://example.org/tuto/resource#John
http://example.org/tuto/resource#RexDog is a pet with the owner http://example.org/tuto/resource#William
http://example.org/tuto/resource#SnuffMonkey is a pet with the owner http://dbpedia.org/resource/Harrison_Ford


## Federated querying

There are many databases in the world. In the end we can't all put them into one system. Federated querying is a way to make two or more databases act as if they where one database.

We use `SERVICE` for this. Service plus an IRI of the other database is enough to tell your database to ask a second database for help with more data.

### Q13: Combining our mini database with a big one

The birthday variable is retrieved from [dbpedia](https://www.dbpedia.org/) using a federated query  
through the `SERVICE` keyword. The graph retrieved is combined with the local graph
and solutions are built from the distant and local graph pattern matching processes.
The local graph pattern



In [1]:
q13 = g.query(prefixes+"""
select * where {
   bind (dbpedia:Harrison_Ford as ?subj)
   ?subj tto:pet ?pet .
   service <https://dbpedia.org/sparql> {
       ?subj dbp:birthDate ?birthday .
 }
}""")

for row in q13:
    print(row)

NameError: name 'g' is not defined