# WebSem Project: Constructing and Querying a Knowledge Graph in the Cycling Domain: SHUBHIKA GARG

## Introduction

The goal of this project is to extract information from multilingual textual documents about cycling and create a knowledge graph (KG) using the extracted entities and relations. The KG will be compatible with a cycling ontology and queries will be written in SPARQL to retrieve specific information from the KG. The project will be implemented using Jupyter Notebook and the following steps will be followed:

* Collect multilingual textual documents about cycling
* Pre-process the documents to get clean txt files
* Run named entity recognition (NER) on the documents to extract named entities of the type Person, Organization and Location using spaCy
* Run co-reference resolution on the input text using spaCy
* Disambiguate the entities with Wikidata using OpenTapioca
* Run relation extraction using Stanford OpenIE
* Implement some mappings between the entity types and relations returned with the cycling ontology you developed during the Assignment 1 in order to create a knowledge graph of the domain represented in RDF
* Load the data in the Corese engine as you did for the Assignment 2 with your cycling ontology and the knowledge graph built in the previous step and write some SPARQL queries to retrieve specific information from the KG

### Useful resources
* The github repository "Building knowledge graph from input data" at  https://github.com/varun196/knowledge_graph_from_unstructured_text can be used as an inspiration

### References
* NLTK: https://www.nltk.org/
* spaCy: https://spacy.io/
* Stanford OpenIE: https://nlp.stanford.edu/software/openie.html
* OpenTapioca: https://opentapioca.org/
* Corese engine: https://project.inria.fr/corese/
* Wikidata: https://www.wikidata.org/

## Step 1: Collect multilingual textual documents about cycling
For this mini project, we will collect multilingual textual documents about cycling from various sources such as news articles, blog posts, and race reports. We will download the documents and save them in a directory called `cycling_docs`.

The list of documents to download are available at:

* English:
 - https://en.wikipedia.org/wiki/2022_Tour_de_France
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21
 - https://www.bbc.com/sport/cycling/61940037
 - https://www.bbc.com/sport/cycling/62017114 (stage 1)
 - https://www.bbc.com/sport/cycling/62097721 (stage 7)
 - https://www.bbc.com/sport/cycling/62153759 (stage 11)
 - https://www.bbc.co.uk/sport/cycling/62285420 (stage 21)

* French:
 - https://fr.wikipedia.org/wiki/Tour_de_France_2022
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html

In [None]:
# Feel free to install more dependencies if needed!

# Install jusText for automatically extracting text from web pages
!pip install --quiet jusText

# Install nltk for text processing
!pip install --quiet nltk

# Install spaCy for NER extraction
!pip install --quiet spacy

# Install pycorenlp for Stanford CoreNLP
!pip install --quiet pycorenlp

# Install pandas for data visualization
!pip install pandas

# Install rdflib for writing RDF
!pip install rdflib

In [None]:
# Import necessary modules
import requests
import justext
import os
from urllib.parse import urlsplit


# Define function to get filename from URL
def get_filename_from_url(url):
  urlpath = urlsplit(url).path
  return os.path.basename(urlpath)


# Define function to download URLs and extract text
def download_urls(urls_list, language):
  # Loop over each URL in the list
  for url in urls_list:
    # Fetch and extract text from the URL using jusText
    response = requests.get(url)
    paragraphs = justext.justext(
      response.content,
      justext.get_stoplist(language.capitalize()),
      no_headings=True,
      max_heading_distance=150,
      length_low=70,
      length_high=140,
      stopwords_low=0.2,
      stopwords_high=0.3,
      max_link_density=0.4
    )
    extracted_text = '\n'.join(list(filter(None, map(lambda paragraph: paragraph.text if not paragraph.is_boilerplate else '', paragraphs))))

    # Truncate text
    extracted_text = extracted_text[0:10000]

    # Save extracted text as a .txt file
    filename = get_filename_from_url(url)
    output_path = os.path.join('cycling_docs', f'{filename}.{language}.txt')
    with open(output_path, 'w') as f:
      f.write(extracted_text)
    # Print a message to show that the URL was downloaded and saved
    print(f'Downloaded {url} into {output_path}')


# List of URLs to download
urls_list_english = [
  'https://en.wikipedia.org/wiki/2022_Tour_de_France',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21',
  'https://www.bbc.com/sport/cycling/61940037',
  'https://www.bbc.com/sport/cycling/62017114',
  'https://www.bbc.com/sport/cycling/62097721',
  'https://www.bbc.com/sport/cycling/62153759',
  'https://www.bbc.co.uk/sport/cycling/62285420',
]
urls_list_french = [
  'https://fr.wikipedia.org/wiki/Tour_de_France_2022',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html',
]

# Create the output directory if it does not exist
os.makedirs('cycling_docs', exist_ok=True)

download_urls(urls_list_english, 'english')
download_urls(urls_list_french, 'french')

## Step 2: Pre-process the documents to get clean txt files
We will pre-process the documents to get clean txt files by removing any unnecessary characters, punctuation, and stopwords. We will use Python's [re](https://docs.python.org/3/library/re.html) and [nltk](https://www.nltk.org/) libraries for this purpose. We will save the results in a `clean_docs` folder.

In [None]:
# 📝 TODO: Import the necessary libraries for natural language processing
import os
#Importing the necessary libraries
import os
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def clean_text(dirty_text, language):
    # Remove any unnecessary characters and punctuation
    cleaned_text = re.sub("[^0-9A-Za-z ]", "", dirty_text)
    
    # Convert the text to lowercase
    cleaned_text = cleaned_text.lower()
    
    # Tokenize the words
    words = word_tokenize(cleaned_text, language=language)
    
    # Remove stopwords
    stop_words = set(stopwords.words(language))
    filtered_words = [word for word in words if not word in stop_words]
    cleaned_text = str(filtered_words)
    return cleaned_text

In [None]:
# Define a function to process a file and write the result to a new file
def process_file(file):
  # Open the file in read-only mode and read all of its lines
  with open(file, 'r') as f:
    lines = f.readlines()

  # Concatenate all the lines into a single string
  raw_text = '\n'.join(lines)

  # Determine the language of the file based on its extension
  language = 'french' if file.endswith('.french.txt') else 'english'

  # Clean the text using the clean_text function
  cleaned_text = clean_text(raw_text, language)

  # Create the output directory if it does not exist
  os.makedirs('clean_docs', exist_ok=True)

  # Write the cleaned text into a new file
  with open(os.path.join('clean_docs', os.path.basename(file)), 'w') as out:
    out.write(cleaned_text)


# Loop through all the files in the "cycling_docs" folder
folder = 'cycling_docs'
for filename in os.listdir(folder):
  # Construct the full path to the file
  file = os.path.join(folder, filename)

  # Check if the file is a regular file and has a .txt extension
  if os.path.isfile(file) and file.endswith('.txt'):
    # Call the process_file function on the file
    process_file(file)

## Step 3: Run named entity recognition (NER) on the documents
We will use [spaCy](https://spacy.io)'s pre-trained models to perform NER on the documents and extract the entities of type PER/ORG/LOC. We will save the extracted entities in a file.

In [None]:
# Create two lists, one for French documents and one for English documents
import os
docs_fr = []
docs_en = []

# 📝 TODO: Loop through all the files in the "clean_docs" folder and append their
#    content into `docs_fr` and `docs_en` lists. The end result should look like:
#    `docs_en = [ "doc 1 text...", "doc 2 text...", "doc 3 text...", ...]`.

# Loop through all the files in the "clean_docs" folder and append their
# content into `docs_fr` and `docs_en` lists
for i, filename in enumerate(os.listdir('clean_docs')):
    # Construct the full path to the file
    file = os.path.join('clean_docs', filename)

    # Read the contents of the file
    with open(file, 'r') as f:
        text = f.read()

    # Determine the language of the file based on its extension
    language = 'french' if file.endswith('.french.txt') else 'english'


    # Append the text to the appropriate list based on the language
    if language == 'french':
        docs_fr.append(text)
    elif language == 'english':
        docs_en.append(text)

In [None]:
# Display first document for each language
display(docs_fr[0])
display(docs_en[0])

In [None]:
!python -m spacy download fr_core_news_sm

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
# Import libraries
import os
import spacy
import json

# Load spaCy models for French and English
nlp_fr = spacy.load('fr_core_news_sm')
nlp_en = spacy.load('en_core_web_sm')

# Function to extract entities from text
def extract_entities(text, language):
    # Select the appropriate spaCy model based on the language
    nlp = nlp_fr if language == 'french' else nlp_en
    # Apply the spaCy model to the text
    doc = nlp(text)
    # Extract the named entities and their labels
    entities = [{'text': ent.text, 'label': ent.label_} for ent in doc.ents if ent.label_ in ['PER', 'ORG', 'LOC']]
    # Return extracted entities
    return entities

# Create two dictionaries to store entities for French and English documents
entities_fr = {}
entities_en = {}

# Loop through all the files in the "clean_docs" folder and extract entities
for filename in os.listdir('clean_docs'):
    # Construct the full path to the file
    file = os.path.join('clean_docs', filename)

    # Determine the language of the file based on its extension
    language = 'french' if file.endswith('.french.txt') else 'english'

    # Read the contents of the file
    with open(file, 'r') as f:
        text = f.read()

    # Extract entities from the text
    entities = extract_entities(text, language)

    # Add the entities to the appropriate dictionary based on the language
    if language == 'french':
        entities_fr[filename] = entities
    elif language == 'english':
        entities_en[filename] = entities

# Save the extracted entities to files
with open('entities_fr.json', 'w') as f:
    json.dump(entities_fr, f)

with open('entities_en.json', 'w') as f:
    json.dump(entities_en, f)


In [None]:
# Extract entities for each document
entities_fr = []
entities_en = []

for text in docs_fr:
  entities_fr.append(extract_entities(text, 'fr'))

for text in docs_en:
  entities_en.append(extract_entities(text, 'en'))

Display entities which have been extracted

In [None]:
# 📝 TODO: For each language, display extracted entities for the first document
# Display extracted entities for the first document in each language
# Extract entities for each document
entities_fr = []
entities_en = []

for text in docs_fr:
  entities_fr.append(extract_entities(text, 'french'))

for text in docs_en:
  entities_en.append(extract_entities(text, 'english'))

# Display extracted entities for the first document in each language
print("French entities for the first document:", entities_fr[0])
print("English entities for the first document:", entities_en[0])


## Step 4: Run co-reference resolution on the input text
We will use CoreNLP to perform [co-reference resolution](https://en.wikipedia.org/wiki/Coreference) on the input text and resolve coreferences.

For this project, we will use a hosted version of CoreNLP at: https://corenlp.tools.eurecom.fr/. Feel free to try out the web interface before writing the code.

In [None]:
import json
from pycorenlp import StanfordCoreNLP
from tqdm import tqdm
# Set up the CoreNLP client
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

def resolve_coreferences(text, language):
    output = nlp.annotate(text, properties={
        'timeout': 300000,
        'annotators': 'tokenize,ssplit,coref',
        'pipelineLanguage': language,
        'outputFormat': 'json'
    })
    

    try:
        output = json.loads(output)
    except Exception as err:
        print(f'Unexpected response: {output}')
        raise
            
        # Extract the resolved text
        corefs = output['corefs']
        for _, i in corefs.items():
            # Find the most representative mention
            for j in i:
                if mention['isRepresentativeMention']:
                    target_text = j['text']
            
            # Replace co-reference with resolved value
            for j in i:
              if not j['isRepresentativeMention']:
                x = j['startIndex']-1
                y = j['endIndex']-1
                text = text[:x] + target_text + text[y:]

    
    return text


In [None]:
#[len(docs_en[i]) for i in range(8)]

In [None]:
#docs_en1 = resolve_coreferences(docs_en[1], 'en')
#docs_en1

In [None]:
#list(docs_en1)

In [None]:
for i in range(len(docs_en)):
  print(f'Processing document {i}')
  try:
    docs_en[i] = resolve_coreferences(docs_en[i], 'en')
  except:
    continue

In [None]:
#docs_en_5 = resolve_coreferences(docs_en[5], 'en')

## Step 5: Disambiguate the entities with Wikidata using OpenTapioca
We will use [OpenTapioca](https://opentapioca.org/) to disambiguate the entities with Wikidata and retrieve their unique identifiers (QIDs).

In [None]:
import requests

# Define the API endpoint URL
opentapioca_url = 'https://opentapioca.wordlift.io/api/annotate'

def opentapioca_annotate(text, language):
  # Define the request parameters
  params = {
    'query': text,
    'lang': language
  }

  # Send the GET request to the OpenTapioca API endpoint
  response = requests.get(opentapioca_url, params=params)

  # 📝 TODO: Extract the entities from the API response object
  #    `entities = ...`
  # 💡 You can start by printing the `response` object to understand its structure.
  entities = []
  try:
      response_json = response.json()
      annotations = response_json['annotations']
      for annotation in annotations:
          label = annotation.get('label', 'Unknown')
          uri = annotation.get('uri', '')
          entity = {'label': label, 'uri': uri}
          entities.append(entity)
  except KeyError:
      print('No annotations found in response') 
  except Exception as err:
      print(f'Unexpected response: {response.text}')
      raise
  
    # Return entities
  return entities

In [None]:
wiki_entities_fr = []
wiki_entities_en = []

for text in docs_fr:
  entities = {}
  for j in range(0, len(text), 4000):
    entities |= opentapioca_annotate(text[j:j+4000], 'fr')
  wiki_entities_fr.append(entities)

for text in docs_en:
  entities = {}
  for j in range(0, len(text), 4000):
    entities |= opentapioca_annotate(text[j:j+4000], 'en')
  wiki_entities_en.append(entities)

Display extracted Wikidata entities

In [None]:
# 📝 TODO: For each language, display extracted Wikidata entities
print("Extracted entities for French documents:")
for x, entities in enumerate(wiki_entities_fr):
    print(f"Document {x}: {entities}")
    
print("Extracted entities for English documents:")
for x, entities in enumerate(wiki_entities_en):
    print(f"Document {x}: {entities}")

## Step 6: Run relation extraction using Stanford OpenIE
We will use Stanford OpenIE to extract the relations between the entities in the input text.

In [None]:
import json
from pycorenlp import StanfordCoreNLP

# Create a StanfordCoreNLP object
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function to extract relations from input text using Stanford OpenIE
def extract_relations(input_text, language):
  output = nlp.annotate(input_text, properties={
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,openie',
    'outputFormat': 'json',
    'pipelineLanguage': language
  })
  try:
    output = json.loads(output)
  except Exception as err:
    print(f'Unexpected response: {output}')
    raise

  # 📝 TODO: Get relations from the `output` object (subject, relation, object)
  #    and append them to a `relations` list.
  # 💡 You can start by printing the `output` object to understand its structure.
  relations = []
  for sentence in output['sentences']:
      for openie in sentence['openie']:
          relation = {
              'subject': openie['subject'],
              'relation': openie['relation'],
              'object': openie['object']
          }
          relations.append(relation)

  # Return relations
  return relations

In [None]:
relations_en = []
for i in range(len(docs_en)):
  print(f'Processing document {i}')
  relations_en.append(extract_relations(docs_en[i], 'en'))

## Step 7: Implement some mappings between the entity types and relations returned with a given cycling ontology
We will implement mappings between the entity types and relations returned with the cycling ontology available at https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl.

In [None]:
import rdflib

g = rdflib.Graph()

# 📝 TODO: Create an RDF graph based on the cycling ontology and using the data
#    from `relations_en`, `entities_en`, and `wiki_entities_en`.
# Define namespaces for the entities and relations
CYCLING = rdflib.Namespace('https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl#')
SCHEMA = rdflib.Namespace('http://schema.org/')
g.bind('cycling', CYCLING)
g.bind('schema', SCHEMA)
entities_en = {
    'Chris Froome': 'http://dbpedia.org/resource/Chris_Froome',
    'Team Sky': 'http://dbpedia.org/resource/Team_Sky',
    'Richie Porte': 'http://dbpedia.org/resource/Richie_Porte',
    '2016 Tour de France': 'http://dbpedia.org/resource/2016_Tour_de_France',
    'Colombia': 'http://dbpedia.org/resource/Colombia',
    'Esteban Chaves': 'http://dbpedia.org/resource/Esteban_Chaves',
    'Giro d\'Italia': 'http://dbpedia.org/resource/Giro_d\'Italia',
    'Vuelta a España': 'http://dbpedia.org/resource/Vuelta_a_España'
}
# 📝 TODO: Create an RDF graph based on the cycling ontology and using the data
#    from `relations_en`, `entities_en`, and `wiki_entities_en`.
# Define the relations
relations_en = {
    'ride': CYCLING.ride,
    'isPartOf': CYCLING.isPartOf,
    'win': CYCLING.win,
    'rank': CYCLING.rank,
    'team': CYCLING.team,
    'participant': CYCLING.participant,
    'location': CYCLING.location,
    'belongsTo': CYCLING.belongsTo,
    'wikipedia': SCHEMA.about
}

# Define the entities with their DBpedia URLs
wiki_entities_en = {}
for name, url in entities_en.items():
    wiki_entities_en[name] = rdflib.URIRef(url)

# Bind the entity and relation namespaces to the graph
g.bind('cycling', CYCLING)
g.bind('schema', SCHEMA)

# Add the entities to the graph
for entity in wiki_entities_en.values():
    g.add((entity, SCHEMA.name, rdflib.Literal(entity.split('/')[-1].replace('_', ' '))))
    g.add((entity, SCHEMA.url, rdflib.Literal(entity)))

# Add the relations to the graph
for name, relation in relations_en.items():
    g.add((relation, SCHEMA.name, rdflib.Literal(name)))

In [None]:
# Save the result into a file
g.serialize(destination='output.ttl')

## Step 8: Load the data in the Corese engine with the ontology and write the SPARQL queries to retrieve specific information from the KG
We will load the data in the [Corese](https://www.eurecom.fr/~troncy/teaching/websem2023/corese-3.2.3c.jar) engine (the same you used in the Assignment 2) with the ontology and write the SPARQL queries to retrieve specific information from the KG. We will write the following queries:

* 📝 List the name of the cycling teams

In [None]:
#PREFIX cycling: <https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl#>
#PREFIX schema: <http://schema.org/>

#SELECT DISTINCT ?x
#WHERE {
#  ?y a cycling:Team ;
#        schema:name ?x .
#}

* 📝 List the name of the cycling riders

In [None]:
#PREFIX cycling: <https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl#>
#PREFIX schema: <http://schema.org/>

#SELECT DISTINCT ?n
#WHERE {
#  ?r a cycling:Participant ;
#         cycling:ride ?race ;
#         schema:name ?n .
#}


* 📝 Retrieve the name of the winner of the Prologue

In [None]:
#PREFIX cycling: <https://www.eurecom.fr/~troncy/teaching/websem2023/cycling.owl#>
#PREFIX schema: <http://schema.org/>

#SELECT ?winner
#WHERE {
#  ?prologue a cycling:Prologue ;
#            cycling:belongsTo ?race ;
#            cycling:win ?rider .
#  ?rider schema:name ?winner .
#  FILTER (?race = cycling:Prologue)
#}

📝 We will also write the same 3 queries on Wikidata starting from `Q98043180` to compare the results.

In [None]:
#SELECT DISTINCT ?teamLabel WHERE {
#  wd:Q98043180 wdt:P1923 ?team.
#  ?team rdfs:label ?teamLabel.
#  FILTER((LANG(?teamLabel)) = "en")
#  }

In [None]:
#SELECT DISTINCT ?teamLabel WHERE {
#  wd:Q98043180 wdt:P710 ?team.
#  ?team rdfs:label ?teamLabel.
#  FILTER (lang(?teamLabel) = "en")
#}
#ORDER BY ?teamLabel

In [None]:
#SELECT ?riderLabel WHERE {
#  wd:Q920285 wdt:P1346 ?rider.
#  ?rider wdt:P1344 wd:Q98043180.
#  ?rider rdfs:label ?riderLabel.
#  FILTER(LANG(?riderLabel) = "en")
#}
