# Lab 3: Querying RDF graphs using SPQRQL and Entity linking


## Wikidata and its SPARQL endpoint

Wikidata is an RDF dataset that contains knowledge about the world. Wikidata is a **knowledge graph**,
similar to the Google Knowledge Graph we discussed during the lectures (used by Google to show the
information on the right displayed when querying certain information, such as people). It works similarly to
Wikipedia, that is volunteers are inserting high-quality information, in addition to scripts that might add
some simpler data.
Check the following resources and properties present on Wikidata:
- https://www.wikidata.org/wiki/Q1
- https://www.wikidata.org/wiki/Q2
- https://www.wikidata.org/wiki/Q3 (the indexes Q1, Q2, . . . organize resources according to their
importance, but not only)
- https://www.wikidata.org/wiki/Q666
- https://www.wikidata.org/wiki/Wikidata:List_of_properties/all_in_one_table


In this lab we will follow a tutorial, which in addition to introducing Wikidata, it contains also a very
good presentation of SPARQL, the query language used for extracting data from an RDF graph:

https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial

After you completed the tutorial, check the list of query examples present here: 
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples


**Exercise 1** Formulate a question that could be answered using Wikidata based on your interests and write the SPARQL code for it and run it on the SPARQL endpoint of Wikidata. Note that it is very likely you will use Wikidata for your project, so make sure you understand the main concepts in SPARQL. 

## Querying BnF data
We propose to look at the data shared by BnF. As with Wikidata, there is an online SPARQL endpoint for BnF: http://data.bnf.fr/sparql/ 
Answer the following questions. Pay attention, it is desirable to limit the number of data points returned, using the keyword LIMIT.

Check:
- What types of publications are available?
- What information is stored on the works?
- What information is stored on the writers?

For this, look at this description https://data.bnf.fr/images/modele_donnees_2018_02.pdf and try queries using this tutorial:
https://www.biblibre.com/fr/blog/comment-interroger-la-bnf-et-wikidata-avec-sparql/
Note that when you try the query on the "The hitch hiker's guide to the galaxy", you can find the equivalent work in Wikidata, under the property owl#sameAs. 

**Exercise 2** Using this information, answer the following questions using SPARQL queries.
- Are there works published after 2000? If yes, find some examples.
- Are there any works published before 1800? If yes, find some examples.
- Who are the authors who published novels in the 21st century? Return their city if possible.
- Who are the authors alive that have published at least one book in the last 20 years?
- How many authors have published a book in the 19th century?


## Equivalence Links in RDF
As discussed during the class and also seen in the previous exercise, in OWL, it is possible to express that two entities, two classes or two properties are equivalent. 

- If r_1 **owl:sameAs** r_2, for two resources r_1,r_2, 
it implies that r_1 is the same resource as r_2 and therefore they should be treated such as.
-  If p_1 **owl:equivalentProperty** p_2 for two properties p_1,p_2,
it implies that p_1 is the same property as p_2 and then they should be treated such as.
-  If c_1 **owl:equivalentClass** c_2 for two classes c_1,c_2, 
it implies that c_1 is the same class as c_2 and therefore they should be treated such as.


The website http://www.sameas.cc/ manages the matches between millions of links. 

**Exercise 3**
In the SPARQL endpoint of the French version of DBPedia,  http://fr.dbpedia.org/sparqlEditor/index.html, find: 1) the resources linked by the property  *owl:sameAs*, 2) find the property linked by the property  *owl:equivalentProperty*, and 2) find the classes linked by the property  *owl:equivalentClass*. 

**Exercise 4**
Check different datasets containing information about Victor Hugo: DBPedia http://fr.dbpedia.org/page/Victor_Hugo, BNF https://data.bnf.fr/11907966/victor_hugo and Wikidata https://www.wikidata.org/wiki/Q535. Check which knowledge graph contains more properties of Victor Hugo. 

## Similarity functions for strings in Python

**Exercise 5**
Please check the jellyfish library https://pypi.org/project/jellyfish/ . Choose a word, for example, *lillois* and different variations of it with caps or changing the letters, for example *lilloise*. Compare the results of the different distances.


In [None]:
# Solution exercise 5

**Exercise 6**
Please check the spacy library and in particular the part on word embeddings https://spacy.io/usage/linguistic-features#vectors-similarity . Note that vector embeddings are language dependent, so if you are working with English, you need to download and use the English package (*python -m spacy download en_core_web_lg*), while for French you need to use *python -m spacy download fr_core_news_lg*. Test the library by computing similarities between words or between sentences. 

## Similarity between Entities

First, we propose to look at how to match entities from different files.
The goal is not to implement a particular algorithm, but that you propose your method in the context of our datasets.

We propose to link the entities of two citation databases from two well-known websites  DBLP http://dblp.uni-trier.de and ACM https://dl.acm.org/ .
You can find the files on the website https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution at the following address http://dbs.uni-leipzig.de/file/DBLP-ACM.zip .
In this zip file, you will also find correct matches to compare your approach to the truth.

To read these files, you can use the csv library of Python or pandas as you have seen in previous labs. Please note that you might need to use the encoding *latin-1*.

**Exercise 7.**
Propose an algorithm to compute similar entities in the two files. Compare your results with the correct matches file and compute the different measures (precision, recall, F1) to evaluate your algorithm.

In [None]:
# Solution exercise 7

## Similarity between nodes and tuples

In this part, we propose you create an algorithm to link entities in an RDF graph to tuples in a CSV file. 

**Exercise.**
Using the files  deputy.csv and deputy.nt from Moodle, find matches between the entities of the two files.
The files gather data about French deputies and they are extracted from the website 
 nosdeputes.fr https://github.com/regardscitoyens/nosdeputes.fr/blob/master/doc/api.md#liste-des-parlementaires and the RDF file from the French DBPedia website http://fr.dbpedia.org/sparqlEditor/index.html
To read an RDF graph in Python, use this library https://rdflib.readthedocs.io/en/stable/#getting-started

In [None]:
# Solution exercise 8