# WebSem Project: Constructing and Querying a Knowledge Graph in the Cycling Domain

## Introduction

The goal of this project is to extract information from multilingual textual documents about cycling and create a knowledge graph (KG) using the extracted entities and relations. The KG will be compatible with a cycling ontology and queries will be written in SPARQL to retrieve specific information from the KG. The project will be implemented using Jupyter Notebook and the following steps will be followed:

* Collect multilingual textual documents about cycling.
* Pre-process the documents to get clean text files.
* Run named entity recognition (NER) on the documents to extract named entities of the type Person, Organization and Location using spaCy.
* Run co-reference resolution on the input text using spaCy.
* Disambiguate the entities with Wikidata using OpenTapioca.
* Run relation extraction using Stanford OpenIE.
* Implement some mappings between the entity types and relations returned with the cycling ontology you developed during the Assignment 1 in order to create a knowledge graph of the domain represented in RDF.
* Load the data in the Corese engine as you did for the Assignment 2 with your cycling ontology and the knowledge graph built in the previous step and write some SPARQL queries to retrieve specific information from the KG.

### Useful resources
* The github repository "Building knowledge graph from input data" at  https://github.com/varun196/knowledge_graph_from_unstructured_text can be used as an inspiration.

### References
* NLTK: https://www.nltk.org/
* spaCy: https://spacy.io/
* Stanford OpenIE: https://nlp.stanford.edu/software/openie.html
* OpenTapioca: https://opentapioca.org/
* Corese engine: https://project.inria.fr/corese/
* Wikidata: https://www.wikidata.org/

## Step 1: Collect multilingual textual documents about cycling
For this mini project, we will collect multilingual textual documents about cycling from various sources such as news articles, blog posts, and race reports. We will download the documents and save them in a directory called `cycling_docs`.

The list of documents to download are available at:

* English:
 - https://en.wikipedia.org/wiki/2022_Tour_de_France
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21
 - https://www.bbc.com/sport/cycling/61940037
 - https://www.bbc.com/sport/cycling/62017114 (stage 1)
 - https://www.bbc.com/sport/cycling/62097721 (stage 7)
 - https://www.bbc.com/sport/cycling/62153759 (stage 11)
 - https://www.bbc.co.uk/sport/cycling/62285420 (stage 21)

* French:
 - https://fr.wikipedia.org/wiki/Tour_de_France_2022
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html

In [1]:
#
# Feel free to install more dependencies if needed!
#

# Install jusText for automatically extracting text from web pages
!pip install --quiet jusText

# Install nltk for text processing
!pip install --quiet nltk

# Install spaCy for NER extraction
!pip install --quiet spacy
!python -m spacy download en_core_web_lg
!python -m spacy download fr_core_news_lg

# Install pycorenlp for Stanford CoreNLP
!pip install --quiet pycorenlp

# Install pandas for data visualization
!pip install --quiet pandas

# Install rdflib for writing RDF
!pip install --quiet rdflib

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.8/837.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h2024-01-17 13:25:46.808447: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-17 13:25:46.808515: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-17 13:25:46.809928: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-17 13:25:46.818634: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in

In [2]:
# Import necessary modules
import requests
import justext
import os
from urllib.parse import urlsplit


# Define a function to get filename from URL
def get_filename_from_url(url):
  urlpath = urlsplit(url).path
  return os.path.basename(urlpath)


# Define a function to download URLs and extract text
def download_urls(urls_list, language):
  # Loop over each URL in the list
  for url in urls_list:
    # Fetch and extract text from the URL using jusText
    response = requests.get(url)
    paragraphs = justext.justext(
      response.content,
      justext.get_stoplist(language.capitalize()),
      no_headings=True,
      max_heading_distance=150,
      length_low=70,
      length_high=140,
      stopwords_low=0.2,
      stopwords_high=0.3,
      max_link_density=0.4
    )
    extracted_text = '\n'.join(list(filter(None, map(
      lambda paragraph: paragraph.text if not paragraph.is_boilerplate else '',
      paragraphs
    ))))

    # Truncate text if it's too long
    extracted_text = extracted_text[0:10000]

    # Create the output directory if it does not exist
    output_dir = os.path.join('cycling_docs', language)
    os.makedirs(output_dir, exist_ok=True)

    # Save extracted text as a .txt file
    filename = get_filename_from_url(url)
    output_path = os.path.join(output_dir, f'{filename}.txt')
    with open(output_path, 'w') as f:
      f.write(extracted_text)

    print(f'Downloaded {url} into {output_path}')


# List of URLs to download
urls_list_english = [
  'https://en.wikipedia.org/wiki/2022_Tour_de_France',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21',
  'https://www.bbc.com/sport/cycling/61940037',
  'https://www.bbc.com/sport/cycling/62017114',
  'https://www.bbc.com/sport/cycling/62097721',
  'https://www.bbc.com/sport/cycling/62153759',
  'https://www.bbc.co.uk/sport/cycling/62285420',
]
urls_list_french = [
  'https://fr.wikipedia.org/wiki/Tour_de_France_2022',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html',
]

# Download the listed URLs
download_urls(urls_list_english, 'english')
download_urls(urls_list_french, 'french')

Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France into cycling_docs/english/2022_Tour_de_France.txt
Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11 into cycling_docs/english/2022_Tour_de_France,_Stage_1_to_Stage_11.txt
Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21 into cycling_docs/english/2022_Tour_de_France,_Stage_12_to_Stage_21.txt
Downloaded https://www.bbc.com/sport/cycling/61940037 into cycling_docs/english/61940037.txt
Downloaded https://www.bbc.com/sport/cycling/62017114 into cycling_docs/english/62017114.txt
Downloaded https://www.bbc.com/sport/cycling/62097721 into cycling_docs/english/62097721.txt
Downloaded https://www.bbc.com/sport/cycling/62153759 into cycling_docs/english/62153759.txt
Downloaded https://www.bbc.co.uk/sport/cycling/62285420 into cycling_docs/english/62285420.txt
Downloaded https://fr.wikipedia.org/wiki/Tour_de_France_2022 into cycling_docs/french/Tour_de_France_2022.txt
Down

## Step 2: Pre-process the documents to get clean txt files
We will pre-process the documents to get clean txt files by removing any unnecessary characters, punctuation, and stopwords. We will use Python's [re](https://docs.python.org/3/library/re.html) and [nltk](https://www.nltk.org/) libraries for this purpose. We will save the results in a `clean_docs` folder.

In [3]:
"""
Document class which holds all the necessary variables for the purpose of this
project.
"""
class Document:
  def __init__(self, text, language = None, raw_text = None, filepath = None):
    self.language = language   # Language of the document
    self.raw_text = raw_text   # Origial text before cleaning
    self.text = text           # Text after cleaning
    self.resolved_text = None  # Text after resolving co-references
    self.filepath = filepath   # Path to the document file
    self.spacy_entities = []   # List of spaCy entities
    self.coreferences = None   # CoreNLP coreferences object
    self.wiki_entities = {}    # Dictionary of Wikidata entities
    self.relations = []        # List of OpenIE relations

In [4]:
# 📝 TODO: Import the necessary libraries for natural language processing
import re
import nltk
import os

nltk.download('punkt')
nltk.download('stopwords')

def clean_text(dirty_text, language):
  # 📝 TODO: Define a function to clean text (words tokenization, stopwords
  #          removal, ...).
  # `cleaned_text = ...`

  # Create list with stop words
  stop_words = set(nltk.corpus.stopwords.words(language))

  # Convert text to lower case
  cleaned_text = dirty_text.lower()

  # Remove puntuation marks, spaces and all non necessary characters
  cleaned_text = re.sub("\[[^\]]*\]", " ", cleaned_text) # Remove annotations like "[2]"
  cleaned_text = re.sub("[^a-zA-Z0-9À-ÖØ-öø-ÿ']", " ", cleaned_text)
  cleaned_text = nltk.tokenize.word_tokenize(cleaned_text)

  # Remove all non essential words
  cleaned_text = [w for w in cleaned_text if not w in stop_words]

  # Return the cleaned text
  return " ".join(cleaned_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
# Define a function to process a file and write the result to a new file
def process_file(file, language):
  # Open the file in read-only mode and read all of its lines
  with open(file, 'r') as f:
    lines = f.readlines()

  # Concatenate all the lines into a single string
  raw_text = '\n'.join(lines)

  # Clean the text using the `clean_text` function
  cleaned_text = clean_text(raw_text, language)

  # Create a new document and return it
  doc = Document(cleaned_text, language=language, raw_text=raw_text, filepath=os.path.abspath(file))
  return doc


# Create a list to store all our documents
docs = []

# Loop through all the files in the "cycling_docs" folder
folder = 'cycling_docs'
for language in os.listdir(folder):
  for filename in os.listdir(os.path.join(folder, language)):
    # Construct the full path to the file
    file = os.path.join(folder, language, filename)

    # Check if the file is a regular file and has a .txt extension
    if os.path.isfile(file) and file.endswith('.txt'):
      # Process the file and append the new document to our list
      doc = process_file(file, language)
      docs.append(doc)

In [6]:
# Display the text of the first document
display(docs[0].text)

"yves lampaert stage one tour de france defending champion tadej pogacar took time main rivals opening individual time trial copenhagen lampaert rode superbly negotiate damp conditions finish five seconds clear fellow belgian wout van aert pogacar two seconds back third place nine seconds ahead slovenian compatriot primoz roglic britain 's adam yates geraint thomas finished 13th 18th fellow ineos grenadiers rider tom pidcock sandwiched pair taking advantage favourable later conditions roll line 15th thomas ' mixed fortunes thomas 36 tour 2018 rode stage gilet forgetting take start lost 18 seconds first part flat technical route around danish capital worst first half time trial ever done thomas said wanted start fairly conservatively power wise everyone telling go easy corners 's three weeks crash first corners cornered like wife n't ridden bike 12 years unbelievable realised still gilet went first time check 18 seconds took pin know could done better ride annoying ' 'm farmer 's son be

## Step 3: Run named entity recognition (NER) on the documents
We will use [spaCy](https://spacy.io)'s pre-trained models to perform NER on the documents and extract the entities of type PER/ORG/LOC. We will save the extracted entities in a file.

In [9]:
# 📝 TODO: Import spaCy and other libraries that might be required for entity
#          extraction
import spacy

def extract_entities(text, language):
  # 📝 TODO: Use spaCy to extract named entities and store them into a list.
  # The format of the end result should look like this:
  # ```
  # entities = [
  #   { "text": "Tour de France", "label": "ORG" },
  #   { "text": "Peter Sagan", "label": "PERSON" },
  # ]
  # ```
  if language == "english":
    nlp = spacy.load('en_core_web_lg')
  elif language == "french":
    nlp = spacy.load('fr_core_news_lg')
  else:
    raise Exception("Language not supported!")

  doc = nlp(text)
  entities = []
  for entity in doc.ents:
    entities.append({
        "text":entity.text,
        "label":entity.label_
    })

  # Return extracted entities
  return entities

In [None]:
# Extract entities for each document
for doc in docs:
  doc.spacy_entities = extract_entities(doc.text, doc.language)

Display entities which have been extracted:

In [None]:
# 📝 TODO: Display the extracted entities for the first document
for entity in docs[0].spacy_entities:
  print(entity)

## Step 4: Run co-reference resolution on the input text
We will use CoreNLP to perform [co-reference resolution](https://en.wikipedia.org/wiki/Coreference) on the input text and resolve coreferences.

For this project, we will use a hosted version of CoreNLP at: https://corenlp.tools.eurecom.fr/ (username: `websem`, password: `eurecom`). Feel free to try out the web interface before writing the code.

First, we compute the annotations and store them into the `coreferences` variable of our Document:

In [None]:
import json
from pycorenlp import StanfordCoreNLP


# Set up the CoreNLP client
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function which computes coreferences for a given text and language
def compute_coreferences(text, language):
  props = {
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,coref',
    'pipelineLanguage': language[:2],
    'outputFormat': 'json'
  }

  # Annotate the text for co-reference resolution
  corenlp_output = nlp.annotate(text, properties=props)
  try:
    corenlp_output = json.loads(corenlp_output)
  except Exception as err:
    print(f'Unexpected response: {corenlp_output}')
    raise

  return corenlp_output

In [None]:
# Test co-references computation
example = compute_coreferences("John is a software engineer. He is very talented. Sarah is a designer. She works with him.", language="en")

# Pretty-print them
print(json.dumps(example, indent=2))

{
  "sentences": [
    {
      "index": 0,
      "basicDependencies": [
        {
          "dep": "ROOT",
          "governor": 0,
          "governorGloss": "ROOT",
          "dependent": 5,
          "dependentGloss": "engineer"
        },
        {
          "dep": "nsubj",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 1,
          "dependentGloss": "John"
        },
        {
          "dep": "cop",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 2,
          "dependentGloss": "is"
        },
        {
          "dep": "det",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 3,
          "dependentGloss": "a"
        },
        {
          "dep": "compound",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 4,
          "dependentGloss": "software"
        },
        {
          "dep": "punct",
          "governor": 5,
          

In [None]:
# Compute co-references for all documents
for doc in docs:
  if doc.language == "english":  # CoreNLP Coref-resolution only supports english
    doc.coreferences = compute_coreferences(doc.raw_text, doc.language)

The first step is to display all co-references for each mentions in the text.

For example:

> "He" -> "John"
>
> "She" -> "Sarah"
>
> "him" -> "John"

In [None]:
for coref_cluster in example['corefs'].values():
  # 📝 TODO: Print each co-references like so: "He" -> "John"
  # 💡 Each cluster has one representative mention, flagged with `isRepresentativeMention: True`

  # Find first the representative mention
  representativeMention = None
  for r in coref_cluster:
    if r['isRepresentativeMention']:
      representativeMention = r
      break

  # For all the non representative mention we print them like: "him" -> "John"
  for r in coref_cluster:
    if not r['isRepresentativeMention']:
      print("\"{}\" -> \"{}\"".format(r['text'], representativeMention['text']))


"She" -> "Sarah"
"He" -> "John"
"him" -> "John"


### 🏆 Challenge

Replace values within the text with their resolved co-reference. For example, with the following text:

> **John** is a software engineer. **He** is very talented.

In the second sentence, the pronoun "He" would be replaced with its co-reference, and the final text would become:

> **John** is a software engineer. **John** is very talented.

In [None]:
# Define a function which resolves coreferences inside a document
def resolve_coreferences(raw_text, corefs):
  corenlp_output = corefs['corefs']
  resolved_text = raw_text

  # 📝 TODO: Replace values within the text with their resolved co-reference.
  # 💡 You can start by printing the `corenlp_output` object to understand its
  #    structure.

  # Build a dictionary with the strings we are going to replace and their replacement (representative mention)
  replacements = {}
  for coref_cluster in corenlp_output.values():
    # Find first the representative mention
    representativeMention = None
    for r in coref_cluster:
      if r['isRepresentativeMention']:
        representativeMention = r
        break
    # For all the non representative mention include them as keys in the replacements dictionary
    for r in coref_cluster:
      if not r['isRepresentativeMention']:
        replacements[r['text']] =  representativeMention['text']

  # Replace all non representative mentions!
  for key, value in replacements.items():
    resolved_text = resolved_text.replace(" " + key + " ", " " + value +  " ") # The key must be surrounded by spaces in order to be an independent word. EX: Can replace all "he" instead of " he "
    resolved_text = resolved_text.replace(" " + key + ".", " " + value + ".") # Or end with a period
    resolved_text = resolved_text.replace(" " + key + ",", " " + value + ",") # or end with a coma
    resolved_text = resolved_text.replace(" " + key + "'", " " + value + "'") # or end with a '

  return resolved_text

In [None]:
# Test resolving co-references
original_text = "John is a software engineer. He is very talented. Sarah is a designer. She works with him."
corefs = compute_coreferences(original_text, language="en")
resolved_text = resolve_coreferences(original_text, corefs)
print(original_text)
print(resolved_text)

John is a software engineer. He is very talented. Sarah is a designer. She works with him.
John is a software engineer. John is very talented. Sarah is a designer. Sarah works with John.


In [None]:
# Resolve co-references for all documents
for doc in docs:
  if doc.coreferences is not None:
    doc.resolved_text = resolve_coreferences(doc.raw_text, doc.coreferences)

In [None]:
# 📝 TODO: Display text with resolved co-references for the any document of your choice
print(docs[0].resolved_text)

The 2022 Tour de France is the 109th edition of The 2022 Tour de France. a furious fight for the break 's started in Copenhagen, Denmark on 1 July[1] and ended with the final stage at Champs-Élysées, Paris on 24 July.[2]

The twelfth stage featured the race 's's queen stage as the riders travelled from Briançon to Alpe d'Huez. the riders gradually climbed from the get-go, passing through the intermediate sprint in Le Monêtier-les-Bains after 11.8 kilometres (7.3 mi). Immediately afterwards, the riders made Team Jumbo -- Visma proceeded to block the road and allowed the break 's to extend their advantage to more than 14 minutes , ensuring that the break 's will fight for the race 's queen stage win way back up the Col du Galibier but this time, the riders went up the riders categorie Col du Lautaret side, which is 23 kilometres (14 mi) long with an average of 5.1 percent. After descending the Galibier and the Télégraphe, the riders made Team Jumbo -- Visma proceeded to block the road an

## Step 5: Disambiguate the entities with Wikidata using OpenTapioca
We will use [OpenTapioca](https://opentapioca.org/) to disambiguate the entities with Wikidata and retrieve their unique identifiers (QIDs).

In [None]:
import requests

# Define the API endpoint URL
opentapioca_url = 'https://opentapioca.wordlift.io/api/annotate'

def opentapioca_annotate(text, language):
  # Define the request parameters
  params = {
    'query': text,
    'lang': language[:2]
  }

  # Send the GET request to the OpenTapioca API endpoint
  response = requests.get(opentapioca_url, params=params)

  # 📝 TODO: Extract the entities from the API response object
  # 💡 You can start by printing the `response` object to understand its structure.

  # Convert response to json
  j = json.loads(response.text)

  # Save only the annotations that have a best_qid
  entities = {}
  for annotation in j['annotations']:
    start = annotation['start']
    end = annotation['end']
    if annotation['best_qid'] != None:
      entities[j['text'][start:end]] = annotation['best_qid']

  # Return entities
  return entities

In [None]:
for doc in docs:
  doc.wiki_entities = {}
  entities = {}
  for j in range(0, len(doc.raw_text), 4000):
    doc.wiki_entities |= opentapioca_annotate(doc.raw_text[j:j+4000], doc.language)

Display extracted Wikidata entities:

In [None]:
# 📝 TODO: Display extracted Wikidata entities for the first document
for entity in docs[0].wiki_entities.items():
    print("\"{}\" -> \"{}\"".format(entity[0], entity[1]))



## Step 6: Run relation extraction using Stanford OpenIE
We will use Stanford OpenIE to extract the relations between the entities in the input text.

In [None]:
import json
from pycorenlp import StanfordCoreNLP

# Create a StanfordCoreNLP object
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function to extract relations from input text using Stanford OpenIE
def extract_relations(input_text, language):
  output = nlp.annotate(input_text, properties={
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,openie',
    'outputFormat': 'json',
    'pipelineLanguage': language[:2]
  })
  try:
    output = json.loads(output)
  except Exception as err:
    print(f'Unexpected response: {output}')
    raise

  # 📝 TODO: Get relations from the `output` object (subject, relation, object)
  #    and append them to a `relations` list.
  # 💡 You can start by printing the `output` object to understand its structure.

  relations = []
  if 'sentences' in output:
      # Iterate over sentences
      for sentence in output['sentences']:
        # Check if 'openie' key has been found
        if 'openie' in sentence:
          # Iterate over openie entries
          for openie_entry in sentence['openie']:
            # Append (subject, relation, object) to list
            relations.append((openie_entry['subject'], openie_entry['relation'], openie_entry['object']))

  # Return relations
  return relations

In [None]:
for doc in docs:
  if doc.language == "english":  # CoreNLP OpenIE only supports english
    doc.relations = extract_relations(doc.raw_text, doc.language)

Display relations which have been extracted:

In [None]:
# 📝 TODO: Display extracted relations for the first document
print(docs[0].relations)

## Step 7: Implement some mappings between the entity types and relations returned with a given cycling ontology
We will implement mappings between the entity types and relations returned with the cycling ontology available at https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl.

In [12]:
import rdflib

s = rdflib.URIRef("https://raw.githubusercontent.com/efrenbg1/WebSem/main/Lab%203/cycling.owl")
g = rdflib.Graph(identifier=s)

# 📝 TODO: Create an RDF graph based on the cycling ontology and using the data
#    from `relations_en`, `entities_en`, and `wiki_entities_en`.

# Load the cycling ontology from the given URI
ontology_uri = "https://raw.githubusercontent.com/efrenbg1/WebSem/main/Lab%203/cycling.owl"
g.parse(ontology_uri, format="xml")

# Assuming `relations_en`, `entities_en`, and `wiki_entities_en` are available datasets
# You should replace the placeholder paths with the actual paths to your datasets.
relations_path = ontology_uri+"#relations_en"
entities_path = ontology_uri+"#entities_en"
wiki_entities_path = ontology_uri+"#wiki_entities_en"

# Load the datasets into the RDF graph
g.parse(relations_path, format="xml")
g.parse(entities_path, format="xml")
g.parse(wiki_entities_path, format="xml")


<Graph identifier=https://raw.githubusercontent.com/efrenbg1/WebSem/main/Lab%203/cycling.owl (<class 'rdflib.graph.Graph'>)>

In [13]:
# Save the result into a file
g.serialize(destination='output.ttl')

<Graph identifier=https://raw.githubusercontent.com/efrenbg1/WebSem/main/Lab%203/cycling.owl (<class 'rdflib.graph.Graph'>)>

## Step 8: Load the data in the Corese engine with the ontology and write the SPARQL queries to retrieve specific information from the KG
We will load the data in the [Corese](https://www.eurecom.fr/~troncy/teaching/websem2023/corese-3.2.3c.jar) engine (the same you used in the Assignment 2) with the ontology and write the SPARQL queries to retrieve specific information from the KG. We will write the following queries:

* 📝 List the name of the cycling teams

PREFIX cycling: <https://purl.org/websem/cycling#>

SELECT ?team ?teamName
WHERE {
  ?team rdf:type cycling:Team .
  OPTIONAL { ?team cycling:name ?teamName }
}

* 📝 List the name of the cycling riders

PREFIX cycling: <https://purl.org/websem/cycling#>

SELECT DISTINCT ?riderName
WHERE {
  ?rider rdf:type cycling:Rider .
  ?rider cycling:name ?riderName .
}


* 📝 Retrieve the name of the winner of the Prologue

PREFIX cycling: <https://purl.org/websem/cycling#>


SELECT DISTINCT ?winnerName
WHERE {
  ?prologue rdf:type cycling:Prologue .
  ?prologue cycling:isWinnerOf ?winnerStage .
  ?winnerStage rdf:type cycling:Stage .
  ?winnerStage cycling:composedOf ?winnerRider .
  ?winnerRider rdf:type cycling:Rider .
  ?winnerRider cycling:name ?winnerName .
}


📝 We will also write the same 3 queries on Wikidata starting from `Q98043180` to compare the results.