### Import the knowledge graph

In [1]:
from rdflib.term import URIRef, Literal
import rdflib

In [2]:
graph = rdflib.Graph()
graph.parse('/Users/gianmarcoalbano/Desktop/Advanced topics in AI/Speakeasy Project/Datasets/14_graph.nt', format='turtle')

<Graph identifier=N0af5a034821f41829c12fcba854e7a1f (<class 'rdflib.graph.Graph'>)>

### Some info on the knowledge graph

The entities are stored with different URIs. The most common namespaces are the following:


In [3]:
# define some prefixes
WD = rdflib.Namespace('http://www.wikidata.org/entity/')
WDT = rdflib.Namespace('http://www.wikidata.org/prop/direct/')
DDIS = rdflib.Namespace('http://ddis.ch/atai/')
RDFS = rdflib.namespace.RDFS
SCHEMA = rdflib.Namespace('http://schema.org/')

In [4]:
print('Some subjects from the knowledge graph')
for objs in list(set(graph.subjects()))[:10]:
    print(objs)
    
print('\n Some objects from the knowledge graph')
for objs in list(set(graph.objects()))[10:20]:
    print(objs)

Some subjects from the knowledge graph
http://www.wikidata.org/entity/Q44362
http://www.wikidata.org/entity/Q707064
http://www.wikidata.org/entity/Q399495
http://www.wikidata.org/entity/Q1657593
http://www.wikidata.org/entity/Q1209782
http://www.wikidata.org/entity/Q1410667
http://www.wikidata.org/entity/Q51596819
http://www.wikidata.org/entity/Q1347014
http://www.wikidata.org/entity/Q42337579
http://www.wikidata.org/entity/Q266451

 Some objects from the knowledge graph
https://commons.wikimedia.org/wiki/File:Florian_Teichtmeister_Nestroy-Theaterpreis_2015.jpg
Oliver Debuschewitz
Kazuchika Kise
http://www.wikidata.org/entity/Q17274824
https://commons.wikimedia.org/wiki/File:Jordan_Todosey_10.jpg
2011 film by Martin Scorsese
Marion Lécrivain
https://commons.wikimedia.org/wiki/File:Revolver_Golden_Gods_Awards.jpg
Karl T. Wright
actor (1902-1976)


Some ways to access the label of an entity in the graph subjects given it's URI:

In [5]:
for node in graph.subjects():
    if graph.value(subject=node, predicate=RDFS.label): # Check if the triple exists
        print(f"node {node} has label {graph.value(subject=node, predicate=RDFS.label)}")
    break


node http://www.wikidata.org/entity/Q15922298 has label Hector and the Search for Happiness


We want to check if every subject in the graph has a label

In [6]:
i = 0
j = 0
for node in graph.subjects():
    j += 1
    if graph.value(subject=node, predicate=RDFS.label): # Check if the triple exists
        i += 1

print(f"Number of subjects with a label: {i}\n")
print(f"Number of subjects in the graph: {j}\n")
if i != j:
    print(f"There are {j-i} subject entities without a label")

Number of subjects with a label: 2051387

Number of subjects in the graph: 2056777

There are 5390 subject entities without a label


### Make a dictionary of nodes URIs with the respective labels

We want to make a dictionary in which the keys are the nodes URIs and the values are the nodes labels

In [7]:
# Function to extract the local part of a URI (e.g., after the last / or #)
def extract_label_from_uri(uri, namespaces):
    # Loop through all namespaces and remove the matching part
    for namespace in namespaces:
        if str(uri).startswith(str(namespace)):
            return str(uri).replace(str(namespace), "")
    # If no match, return the original URI
    return str(uri).split('/')[-1]

# Function to build a dictionary of nodes and their labels
def build_node_label_dict(graph, namespaces):
    nodes = {}
    
    for node in graph.all_nodes():
        if isinstance(node, rdflib.term.URIRef):  # Only process URIs
            # Check if the node has a label
            label = graph.value(node, RDFS.label)
            
            if label:
                # If label exists, use it
                nodes[node.toPython()] = str(label)
            else:
                # If no label, extract the local part of the URI
                local_label = extract_label_from_uri(node, namespaces)
                nodes[node.toPython()] = local_label
    
    return nodes

namespaces = [WD, WDT, DDIS, RDFS, SCHEMA]

# TODO: change the name of nodes into 'ent2lbl'
nodes = build_node_label_dict(graph, namespaces)

# Check the result
for uri, label in nodes.items():
    print(f"URI: {uri}, Label: {label}")
    break

URI: http://www.wikidata.org/entity/Q44362, Label: Theo Rossi


Make an inverse dictionary to find URIs of the entities given the labels

In [8]:
ent2uri = {ent: uri for uri, ent in nodes.items()}

We also make another dictionary specifically for predicates

In [9]:
# Function to build a dictionary of predicates and their labels
def build_pred_label_dict(graph, namespaces):
    predicates = {}
    
    for node in graph.predicates():
        if isinstance(node, rdflib.term.URIRef):  # Only process URIs
            # Check if the node has a label
            label = graph.value(node, RDFS.label)
            
            if label:
                # If label exists, use it
                predicates[node.toPython()] = str(label)

            # This condition is never evaluated cause all the predicates have labels
            else:
                # If no label, extract the local part of the URI
                local_label = extract_label_from_uri(node, namespaces)
                predicates[node.toPython()] = local_label
    
    return predicates

# TODO: change the name of predicates into 'pred2lbl'
predicates = build_pred_label_dict(graph, namespaces)

# Check the result
for uri, label in predicates.items():
    print(f"URI: {uri}, Label: {label}")
    break

URI: http://www.wikidata.org/prop/direct/P161, Label: cast member


Make an inverse dictionary to find URIs of the predicates given the labels

In [10]:
pred2uri = {pred: uri for uri, pred in predicates.items()}

### Matching function

Suppose we find an entity "Batman_1989" in the question we want to answer. However "Batman_1989" is registered in the knowledge graph as "Batman 1989". We need a function that takes the entity from the questions and finds the closest entity in the knowledge graph

In [70]:
import editdistance

def match_entity(entity, dictionary=nodes):
    
    tmp = 9999
    match_node = ""
    match_value = ""
    
    for key, value in dictionary.items():
        '''
        if editdistance.eval(value, entity) == 0:

            return key, value
        '''
        if editdistance.eval(value, entity) < tmp:
            tmp = editdistance.eval(value, entity)
            match_node = key
            match_value = value
    
    return match_node, match_value

We can also use the match_entity function to match a predicate to the closest predicate in the graph by specifing dictionary=predicates


For example:

In [14]:
match_node, match_value = match_entity('direcror', predicates)
print(f"URI: {match_node}, label: {match_value}")

match_node, match_value = match_entity('movie')
print(f"URI: {match_node}, label: {match_value}")

URI: http://www.wikidata.org/prop/direct/P57, label: director
URI: http://www.wikidata.org/entity/Q15112439, label: Rosie


### Processing questions

Our first approach to answer factual question is very naive. The following function takes a question and tries to fit it to a series of questions patterns to extract a relation and an entity. For example the question "Who is the director of Batman" corresponds to pattern "who is the (?P<relation>.*) of (?P<entity>.*)". When we call method re.match on the question and the pattern it produces a match object (we call it match in the function) that contains a dictionary: {'relation': 'director', 'entity': Batman}. To access this dictionary we call .groupdict() on the match object (so match.groupdict() will be the dictionary). We retrieve relation and entity from the dictionary unsing get('relation') and get('entity') and specifing that if the dictionary doesn't have that key it should output "".

In [18]:
import re

question_patterns = [
    
    # Pattern 0: who and what
    (r"who is the (?P<relation>.+?) of (?P<entity>.+)", 'who', 1),

    # Pattern 1: Find movies with (word) in their titles
    (r"(?:find|which) movies.*contain(?:s)?(?: the word)? (?P<word>\w+)", 'find_word_in_title', 0),
    (r"(?:find|which) movies with (?P<word>\w+) in (?:their )?titles?", 'find_word_in_title', 0),
    (r"(?:find|which) movies (?:whose )?(?:title|name) contains? (?P<word>\w+)", 'find_word_in_title', 0),
    
    # Pattern 2: Highest-rated movies above a certain rating
    (r"(?:what are|list)(?: the)?(?: highest-rated)? movies.*(?:rated )?(?:above|greater than) (?P<number>\d+(\.\d+)?)", 'movies_rating_above', 0),
    (r"movies rated above (?P<number>\d+(\.\d+)?)", 'movies_rating_above', 0),
    
    # Pattern 3: Lowest-rated movies below a certain rating
    (r"(?:what are|list)(?: the)?(?: lowest-rated)? movies.*(?:rated )?(?:below|less than) (?P<number>\d+(\.\d+)?)", 'movies_rating_below', 0),
    (r"movies rated below (?P<number>\d+(\.\d+)?)", 'movies_rating_below', 0),
    
    # Pattern 4: Entities in alphabetical order
    (r"which (?P<entity>.+) comes first alphabetically", 'entity_first_alphabetically', 1),
    (r"list (?P<entity>.+) in alphabetical order", 'entity_first_alphabetically', 1),
    
    # Pattern 5: Entities in reverse alphabetical order
    (r"which (?P<entity>.+) comes last alphabetically", 'entity_last_alphabetically', 1),
    (r"list (?P<entity>.+) in reverse alphabetical order", 'entity_last_alphabetically', 1),
]

""" 
TODO
include queries from the olat page of the 1 intermediate evaluation
"""



' \nTODO\ninclude queries from the olat page of the 1 intermediate evaluation\n'

In [19]:
def process_question(question, entity_dictionary, predicate_dictionary):
    
    for pattern, qtype, matching in question_patterns:
        match = re.match(pattern, question, re.IGNORECASE)
        
        if match:
            params = match.groupdict()
            params['type'] = qtype  # Add the question type to the params
            
            if matching:
                # Extract and match the relation and entity
                relation = params.get('relation', "").lower()  # Set default as empty string
                entity = params.get('entity', "") # Set default as empty string (don't lower it)
                
                # Match the entity to the closest in the knowledge graph
                matched_entity_uri, matched_entity_label = match_entity(entity, dictionary=entity_dictionary) if entity else (None, None)
                
                # Match the relation to the closest in the knowledge graph
                matched_predicate_uri, matched_predicate_label = match_entity(relation, dictionary=predicate_dictionary) if relation else (None, None)
                
                # Add the matched URIs and labels to params
                params['matched_entity_uri'] = matched_entity_uri
                params['matched_entity_label'] = matched_entity_label
                params['matched_predicate_uri'] = matched_predicate_uri
                params['matched_predicate_label'] = matched_predicate_label
                
            return params
            
    return None

In [20]:
# Example usage
user_input = {
    0: "Who is the director of batman",
    1: "Which movies whose name contains italy",
    2: "List the highest rated movies",
    3: "what are the lowest-rated movies?",
    4: "Which films comes first alphabetically",
    5: "list actors in reverse alphabetical order"
}

for pattern, question in user_input.items():
    
    params = process_question(question, nodes, predicates)
    
    if params:
        print(f"Pattern {pattern}: Question: {question}\n")
        
        for key, value in params.items():
            print(f"{key} : {params[key]}\n")
        print("\n\n")
        
    else:
        print(f"\nPattern {pattern} not matched\n\n\n\n")

Pattern 0: Question: Who is the director of batman

relation : director

entity : batman

type : who

matched_entity_uri : http://www.wikidata.org/entity/Q596699

matched_entity_label : Batman

matched_predicate_uri : http://www.wikidata.org/prop/direct/P57

matched_predicate_label : director




Pattern 1: Question: Which movies whose name contains italy

word : italy

type : find_word_in_title





Pattern 2 not matched





Pattern 3 not matched




Pattern 4: Question: Which films comes first alphabetically

entity : films

type : entity_first_alphabetically

matched_entity_uri : http://www.wikidata.org/entity/Q11424

matched_entity_label : film

matched_predicate_uri : None

matched_predicate_label : None




Pattern 5: Question: list actors in reverse alphabetical order

entity : actors

type : entity_last_alphabetically

matched_entity_uri : http://www.wikidata.org/entity/Q33999

matched_entity_label : actor

matched_predicate_uri : None

matched_predicate_label : None






In [21]:
def generate_sparql_query(params):
    qtype = params.get('type')

    if qtype == 'who':
        sparql_query = f"""
        SELECT ?result WHERE {{
            ?entity rdfs:label "{params['matched_entity_label']}"@en .  
            ?entity <{params['matched_predicate_uri']}> ?item . 
            ?item rdfs:label ?result .
            FILTER (lang(?result) = 'en')
        }}  
        """
        return sparql_query

    
    # Fix: this query returns names of all the entities whose label contains the word, not just movies
    elif qtype == 'find_word_in_title':
        word = params.get('word')
        sparql_query = f"""
        SELECT ?movieLabel WHERE {{
            ?movie rdfs:label ?movieLabel .
            FILTER(CONTAINS(LCASE(?movieLabel), LCASE("{word}"))) .
            FILTER (lang(?movieLabel) = 'en')
        }}
        """
        return sparql_query

    elif qtype == 'movies_rating_above':
        number = params.get('number')
        sparql_query = f"""
        SELECT ?movieLabel WHERE {{
            ?movie ddis:rating ?rating .
            FILTER(?rating > {number}) .
            ?movie rdfs:label ?movieLabel .
            FILTER (lang(?movieLabel) = 'en')
        }} ORDER BY DESC(?rating) LIMIT 1
        """
        return sparql_query

    elif qtype == 'movies_rating_below':
        number = params.get('number')
        sparql_query = f"""
        SELECT ?movieLabel WHERE {{
            ?movie ddis:rating ?rating .
            FILTER(?rating < {number}) .
            ?movie rdfs:label ?movieLabel .
            FILTER (lang(?movieLabel) = 'en')
        }} ORDER BY DESC(?rating)
        """
        return sparql_query

    elif qtype == 'entity_first_alphabetically':
        sparql_query = f"""
        SELECT ?entity_label WHERE {{
            ?entity wdt:P31 <{params['matched_entity_uri']}> .
            ?entity rdfs:label ?entity_label .
            FILTER (lang(?entity_label) = 'en')
        }} ORDER BY ASC(?entity_label)
        """
        return sparql_query

    elif qtype == 'entity_last_alphabetically':
        sparql_query = f"""
        SELECT ?entity_label WHERE {{
            ?entity wdt:P31 <{params['matched_entity_uri']}> .
            ?entity rdfs:label ?entity_label .
            FILTER (lang(?entity_label) = 'en')
        }} ORDER BY DESC(?entity_label)
        """
        return sparql_query

    else:
        return None

'''
TODO
you shoud include prefixes in each query!
'''


'\nTODO\nyou shoud include prefixes in each query!\n'

In [None]:
# Example usage
user_input = {
    0: "Who is the director of batman",
    1: "Which movies whose name contains italy",
    2: "List the highest rated movies",
    3: "what are the lowest-rated movies?",
    4: "Which films comes first alphabetically",
    5: "list actors in reverse alphabetical order"
}

for pattern, question in user_input.items():
    
    params = process_question(question, nodes, predicates)

    if params:

        sparql_query = generate_sparql_query(params)

        print(f"Question: {question} has generated query:\n{sparql_query}\n")

        print(f"Checking if the query runs on the graph:\n")

        # Check if SPARQL query runs on the graph
        res = graph.query(sparql_query)
        
        if res:
            '''
            print(res)
            results = []
            for row in res:
                results.append(row.result)
            for result in results:
                print(f"Answer: {result}")
            print("\n")
            '''
            result = [str(s) for s, in graph.query(sparql_query)]
            print(result)
            
        else:
            print("QUERY NOT WORKING\n")

    else:

        print(f"pattern {pattern} has failed the parameter matching")
    

In [22]:
# Example Usage
user_input = "who is the director of Apocalypse Now"
params = process_question(user_input, nodes, predicates)
sparql_query = generate_sparql_query(params)
print(sparql_query)

# Check if SPARQL query runs on the graph
res = graph.query(sparql_query)
results = []
for row in res:
    results.append(row.result)
if results:
    for result in results:
        print(f"Answer: {result}")
else:
    print("AAAA")


        SELECT ?result WHERE {
            ?entity rdfs:label "Apocalypse Now"@en .  
            ?entity <http://www.wikidata.org/prop/direct/P57> ?item . 
            ?item rdfs:label ?result .
            FILTER (lang(?result) = 'en')
        }  
        
Answer: Francis Ford Coppola


In [23]:
def query_graph(graph, sparql_query):
    
    print("--- SPARQL query ---")
    print(sparql_query)

    # Execute the query
    qres = graph.query(sparql_query)

    # Process the results
    results = []
    for row in qres:
        results.append(row.result)
    
    # Check if we have results, if not return a friendly message
    if results:
        for result in results:
            print(f"Answer: {result}")
    else:
        print("No results found for the given query.")
        return []


In [24]:
def generate_answer(graph, question):

    params = process_question(question, nodes, predicates)

    sparql_query = generate_sparql_query(params)

    answer = query_graph(graph, sparql_query)
    

In [None]:
user_input = "who is the director of Apocalypse Now"
generate_answer(graph, user_input)

One problem with this is that relation "release date" is not correctly catched because it is closer to relation "relative" than it is to relation "publication date" which actually exists as a predicate in the knowledge graph

IDEA TO FIX IT: modify the matching function for the predicates so that instead of relying on the edit distance it relies on embeddings similarity. Come back to this when you have started implementing embeddings

## Embeddings

We will now implement an approach that relies on embeddings rather than querying the graph directly. For this we will need to extract entities from the graph in a more dynamic way and will resort to NER

### NER

We choose model 'Babelscape' because it was already trained on a large wikidata dataset and it is by far the best at recognizing movie titles as 'MISC'

In [156]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### Synonyms handling

To account for the presence of synonyms in the question we decided to implement a model that computes the similarity between a phrase and the list of predicates from the knowledge graph and returns the most similar matches

In [157]:
import spacy

# this command downloads the Spacy model
spacy.cli.download("en_core_web_md")
    
nlp = spacy.load("en_core_web_md")

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [158]:
def find_match(phrase, predicate_dict, n=5, confidence=0.6):
    """
    Given a phrase, a dictionary of predicate values, an integer n, and a confidence threshold,
    return the top n most similar words to the phrase from the dictionary values
    that have a similarity score above the confidence threshold.
    """
    phrase_token = nlp(phrase)
    similarities = []

    # Calculate similarity between phrase and each predicate value
    for predicate in predicate_dict.values():
        predicate_token = nlp(predicate)
        similarity = phrase_token.similarity(predicate_token)
        
        # Only consider matches above the confidence threshold
        if similarity > confidence:
            similarities.append((predicate, similarity))

    # Sort by similarity in descending order and get the top n matches
    top_n_matches = sorted(similarities, key=lambda x: x[1], reverse=True)[:n]
    
    # Return only the most similar words
    return [match[0] for match in top_n_matches]


Some weakness of this methods: "Children" is not correctly associated to predicate "Child".

In [159]:
# Example usage
phrase = "release date"
n = 3
print(find_match(phrase, predicates, n))  # Should return the top 3 most similar predicates

['collection', 'publication date', 'student of']


### Extracting Predicates

We have implemented the following pipeline to extract predicates from the question:

- Extract meaningful words from the question with spacy
    - For exaple: from question 'who directed...' only 'directed' is extracted
- Generate ngrams from meaningful words
    - If the predicate is made of 2 words like "publication date" the meaningful word would be ['publication', 'date'] which would not be mapped to 'publication date' but to other words. This is why we generate a list of ngrams like ["publication 'date", "publication", "date"].
- Starting with the longest ngram, try to find the predicate from the predicate list that is closest to the ngram. If a match is found, we return it. This means that we prioritize matching longest ngrams
    - Before we compare the ngram to the list of predicates we lemmatize it and turn it into a noun using the verb_to_noun dictionary we wrote. This is because many predicates in the list are in the form "director", "writer" instead of "direct" and "write"


In [160]:
verb_to_noun = {
    "affiliate": "affiliation",
    "animate": "animator",
    "base": "based on",
    "cast": "cast member",
    "characterize": "characters",
    "depict": "depicts",
    "describe": "node description",
    "design": "designed by",
    "distribute": "distributed by",
    "educate": "educated at",
    "employ": "employer",
    "found": "founded by",
    "influence": "influenced by",
    "locate": "location",
    "narrate": "narrator",
    "originate": "country of origin",
    "participate": "participant in",
    "perform": "performer",
    "produce": "producer",
    "publish": "publication date",
    "rate": "rating",
    "receive": "award received",
    "represent": "represented by",
    "screen": "screenwriter",
    "study": "student of",
    "write": "screenwriter",
    "direct": "director",
    "photograph": "director of photography",
    "edit": "film editor",
    "speak": "languages spoken, written or signed",
    "produce": "production company",
    "confer": "conferred by",
    "broadcast": "broadcast by",
    "present": "presented in",
    "voice": "voice actor",
    "film": "filming location",
    "release": "publication date",
    "award": "award received",
    "create": "creator",
    "develop": "developer",
    "choreograph": "choreographer",
    "make": "production company",
    "assemble": "crew member(s)",
    "inspire": "inspired by",
    "contribute": "contributor to the creative work or subject",
    "style": "costume designer",
    "nominate": "nominated for",
    "portray": "cast member",
    "describe": "node description",
    "label": "node label",
    "set": "narrative location",
    "shot": "filming location",
    "main character" : "characters"
}


In [218]:
# Check if some values in the dict do not correspond to actual entities in the graph
for value in verb_to_noun.values():
    if value not in predicates.values():
        print(f"{value} to be deleted")

In [219]:
def extract_relation(sentence, predicate_dict, n=5, confidence=0.6, max_ngram_size=3):
    """
    Extracts the relation from a sentence by finding the most similar predicates,
    prioritizing longer n-grams first. If a match with similarity > confidence
    is found, it returns that result immediately.
    
    Args:
    - sentence (str): The input sentence from which to extract the relation.
    - predicate_dict (dict): Dictionary of known predicates with their descriptions.
    - n (int): Number of top matches to return.
    - confidence (float): Minimum similarity threshold for a match.
    - max_ngram_size (int): Maximum number of words in an n-gram to consider for matching.
    
    Returns:
    - list: Top `n` predicate matches that have a similarity score above the confidence threshold.
    """
    # Step 1: Parse the sentence to filter stop words and prioritize key phrases
    doc = nlp(sentence)
    meaningful_words = [token.text for token in doc if not token.is_stop and token.is_alpha]
    
    # Step 2: Generate prioritized n-grams from meaningful words (starting with the longest n-grams)
    ngrams = []
    for size in range(max_ngram_size, 0, -1):  # Start with larger n-grams
        ngrams += [" ".join(meaningful_words[i:i+size]) for i in range(len(meaningful_words) - size + 1)]
    
    # Step 3: Check each n-gram for similarity, starting with the longest
    for ngram in ngrams:

        # First check if the ngram corresponds exactly or almost exactly to a predicate using the editdistance:
        if match_entity_editdistance(ngram, dictionary=predicates, threshold=2):
            match_node, match_value, _ = match_entity_editdistance(ngram, dictionary=predicates, threshold=2)
            return [match_value]
        
        print(f"ngram: {ngram}")
        ngram = " ".join([verb_to_noun.get(token.lemma_, token.lemma_) for token in nlp(ngram)])
        print(f"lemma: {ngram}\n")
        
        matches = find_match(ngram, predicate_dict, n=n, confidence=confidence)
        
        # If a match with similarity > confidence is found, return immediately
        if matches:
            return matches
    
    # Step 4: If no matches above the confidence threshold are found, return an empty list
    return []


In [220]:
sentence = "Who is the release date of ()?"
relation = extract_relation(sentence, predicates, n=3, confidence=0.5)
print("Extracted Relation:", relation)

ngram: release date
lemma: publication date date

Extracted Relation: ['publication date', 'place of publication', 'student of']


### Entities handling

For entities the problem of syninyms is not that relevent because generally we can assume that people's names and movie's titles have no synonims. However we still need to make sure that the entities recognized by the NER algorithm correspond to real entities in the knowledge graph, otherwise we cannot map them to an embedding. To achieve this we can use the match_entity function based on editdistance:

In [203]:
import editdistance

In [204]:
def match_entity_editdistance(entity, dictionary=nodes, threshold=5):
    """
    Matches the given entity to the closest node in the dictionary based on edit distance.
    Returns None if the closest match exceeds the specified distance threshold.
    
    Args:
    - entity (str): The entity to match.
    - dictionary (dict): The graph dictionary with nodes to match against.
    - threshold (int): The maximum allowable edit distance for a match.
    
    Returns:
    - (str, str) or (None, None): Returns (node_key, node_value) if a match is found within the threshold,
      otherwise returns (None, None).
    """
    tmp = float('inf')  # Start with the highest possible distance
    match_node = None
    match_value = None
    
    for key, value in dictionary.items():
        # Calculate edit distance between the entity and current node value
        distance = editdistance.eval(value, entity)
        if distance < tmp:
            tmp = distance
            match_node = key
            match_value = value
    
    # Return None if the closest match exceeds the threshold
    if tmp > threshold:
        return None
    
    return match_node, match_value, tmp


In [205]:
# Example usage
phrase = "Incption"
print(match_entity_editdistance(phrase, nodes)) 

('http://www.wikidata.org/entity/Q25188', 'Inception', 1)


### Extract embeddings from the files

We extract embeddings from the files. We will explain how to use them after the process_question function. Since there is a problem with the relation embeddings we need to extract them now and account for that in the process_question function

In [230]:
import numpy as np
import csv

entity_matrix = np.load('/Users/gianmarcoalbano/Desktop/Advanced topics in AI/Chatbot-Project/ddis-graph-embeddings/entity_embeds.npy')
predicate_matrix = np.load('/Users/gianmarcoalbano/Desktop/Advanced topics in AI/Chatbot-Project/ddis-graph-embeddings/relation_embeds.npy')

with open('/Users/gianmarcoalbano/Desktop/Advanced topics in AI/Chatbot-Project/ddis-graph-embeddings/entity_ids.del') as ifile:
    ent2id = {ent: int(idx) for idx, ent in csv.reader(ifile, delimiter='\t')}
    id2ent = {v: k for k, v in ent2id.items()}
with open('/Users/gianmarcoalbano/Desktop/Advanced topics in AI/Chatbot-Project/ddis-graph-embeddings/relation_ids.del') as ifile:
    pred2id = {rel: int(idx) for idx, rel in csv.reader(ifile, delimiter='\t')}
    id2pred = {v: k for k, v in pred2id.items()}

### Predicates without embeddings

There seems to be a problem with the embeddings. Some of them are missing

In [272]:
print(f"predicates list: {len(predicates)}")
print(f"relation embeddings list: {len(predicate_matrix)}\n")
#pred2id['http://www.wikidata.org/prop/direct/P577']
pred_without_embeddings = []
# Which predicates are missing an embedding?
for predicate in predicates.values():
    try:
        id = pred2id[pred2uri[predicate]]
    except KeyError:
        print(f"{predicate} has no embedding")
        pred_without_embeddings.append(predicate)

predicates list: 255
relation embeddings list: 248

node label has no embedding
IMDb ID has no embedding
image has no embedding
tag has no embedding
node description has no embedding
publication date has no embedding
box office has no embedding
rating has no embedding


### NER Pipeline

We can now combine all these functions to successfully extract both predicates and entities from a question. We will use the model "..." (we can still decide on a different model) to recognize entities and proceed in the following way:
- Preprocess the question so that special characters that hold no important meaning like ! or : are removed
- Extract a list of dictionaries with all the entities from the question using NER
    - The dictionaries will look like: {'entity_group' : 'PER', 'word': Andrew Garfield'}
- Map the extracted entities to actual nodes in the graph via the editdistance function
    - If the distance from the entity in the question and the closest entity in the graph is > 5 then no entity is matched
    - If the distance from the entity in the question and the closest entity in the graph is < 5 but > 1 then we prompt the chatbot to ask the user to verify if they matched the right enitity
- Remove the entities from the question
- Pass the question without entities to the predicate_extraction function
- Add the extracted predicates to the list of dictionaries as {'entity_group' : 'predicate', 'word': 'screenwriter'}


Some notes on how to handle the 'Entity matching too distant' case. In the final notebook with the speakeasy infrastructure you should make a variable with the matched entities that were too distant. so that they are stored for generating the next message in case they answer 'yes' to the question 'did you mean -matched_entity-?'

In [206]:
from transformers import pipeline
import re

# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [282]:
def preprocess_question(question):
    # Remove symbols like :, !, -, etc., by replacing them with an empty string
    cleaned_question = re.sub(r'[:!\\-]', '', question)
    # Remove any extra spaces that might result from removing symbols
    cleaned_question = re.sub(r'\s+', ' ', cleaned_question).strip()
    return cleaned_question

# Define a function to extract entities and relation from a given question
def extract_entities(question, predicate_dict=predicates, n=5, confidence=0.6, max_ngram_size=3):
    extracted_entities = []

    exit_status = ""
    question = preprocess_question(question)
    print(f"Question after preprocessing: {question}\n")

    # Step 1: Use the NER pipeline to get entities in the question
    entities = ner_pipeline(question)

    # If there are no entities in the question return (maybe prompt the user to double check if the capitalized the right letters)
    if entities:
    
        # Step 2: Turn dictionaries in entities into simplified dictionaries and concatenate words to join_entity['word']
        for entity in entities:
            simplified_entity = {
                'entity_group': entity['entity_group'],
                'word': entity['word']
            }
            extracted_entities.append(simplified_entity)

        # Step 3: Remove extracted entities from the question to isolate the predicate phrase
        question_no_entities = question
        for entity in extracted_entities:
            print(f"Extracted entity: {entity['word']}\n")
            # Convert both the question and entity to lowercase for consistent replacement
            question_no_entities = re.sub(r'\b' + re.escape(entity['word'].lower()) + r'\b', '', question_no_entities.lower(), flags=re.IGNORECASE)
    
        # Replace multiple spaces with a single space and trim leading/trailing whitespace
        question_no_entities = re.sub(r'\s+', ' ', question_no_entities).strip()
        
        print(f"Question after removing entities: {question_no_entities}\n")

        # Step 3.5: Match each entity to the closest node in the graph. Remove them if there is no match
        for entity in extracted_entities:
    
            if match_entity_editdistance(entity['word'], threshold=5):
                match_node, match_value, distance = match_entity_editdistance(entity['word'])
            
                # If the closest entity we can find in the graph is still distant, return the best matched value
                # with exit status Entity matching too distant. Then ask the user if the match_value actually 
                # corresponds to what they wanted
                if distance > 3:
                    exit_status = 'Entity matching too distant'
                    return match_value, exit_status
                else:
                    # Update 'word' in entity to be the best-matching node's label
                    entity['word'] = match_value
            else:
                # Remove the entity from extracted_entities if no match was found
                extracted_entities.remove(entity)

    else:
        exit_status = 'No entities found by NER'

    # Step 4: Extract the relation from the modified question using the extract_relation function
    relations = extract_relation(question_no_entities if entities else question, predicates, n=n, confidence=confidence, max_ngram_size=max_ngram_size)

    # Step 4.5: Check if the extracted relations have an embedding
    for relation in relations:
        if relation in pred_without_embeddings:
            exit_status = 'predicate missing embedding'
            return relation, exit_status
                
    # Step 5: Add the relation to the extracted_entities list if a match is found
    if relations:
        print(f"Extracted predicates: {relations}\n")
        extracted_entities.append({'entity_group': 'predicate', 'word': []})
        for relation in relations:
            extracted_entities[-1]['word'].append(relation)

    return extracted_entities, exit_status
    

In [283]:
sentence = "Who is the director of the godfather"
extracted_entities = extract_entities(sentence, predicates, n=2, confidence=0.6)
print("Extracted entities:", extracted_entities[0])
print("Exit status:", extracted_entities[1])

Question after preprocessing: Who is the director of the godfather

ngram: director godfather
lemma: director godfather

Extracted predicates: ['director', 'developer']

Extracted entities: [{'entity_group': 'predicate', 'word': ['director', 'developer']}]
Exit status: No entities found by NER


### Embeddings

Now that we have a reliable way of extracting entities and predicates from the question we can turn them into embeddigs:

ent2id can be used to retrieve the index of an entity in the embedding matrix given it's Uri. Retriving the embedding of an entity given it's label would look like this:

In [284]:
entity_label = 'The Godfather'

# Turn label into URI
Uri = ent2uri[entity_label]
print(f"The URI of {entity_label} is {Uri}\n")

# Turn URI into a row index
id = ent2id[Uri]
print(f"The id of {entity_label} is {id}\n")

# Look up the row index in the embedding matrix
entity_embedding = entity_matrix[id]
print(f"The embedding of {entity_label} has lenght {len(entity_embedding)}\n") # I don't print it cause it's long 


The URI of The Godfather is http://www.wikidata.org/entity/Q47703

The id of The Godfather is 34515

The embedding of The Godfather has lenght 256



In [271]:
entity_label = 'director'

# Turn label into URI
Uri = pred2uri[entity_label]
print(f"The URI of {entity_label} is {Uri}\n")

# Turn URI into a row index
id = pred2id[Uri]
print(f"The id of {entity_label} is {id}\n")

# Look up the row index in the embedding matrix
entity_embedding = predicate_matrix[id]
print(f"The embedding of {entity_label} has lenght {len(entity_embedding)}\n") # I don't print it cause it's long 

The URI of director is http://www.wikidata.org/prop/direct/P57

The id of director is 12

The embedding of director has lenght 256



### Turn labels into embeddings

We write a function to make embedding retrival more straightforward:

In [232]:
def extract_embedding(label, type='entity'):

    if type=='entity':
        pipeline = [ent2uri, ent2id, entity_matrix]
    else:
        pipeline = [pred2uri, pred2id, predicate_matrix]

    
    Uri = pipeline[0][label]
    
    # Turn URI into a row index
    id = pipeline[1][Uri]
    
    # Look up the row index in the embedding matrix
    entity_embedding = pipeline[2][id]

    return entity_embedding

In [233]:
# Example usage

entity_label = 'The Godfather'
entity_embedding = extract_embedding(entity_label)
print(f"The embedding of {entity_label} has lenght {len(entity_embedding)}\n")

pred_label = 'director'
pred_embedding = extract_embedding(pred_label, 'predicate')
print(f"The embedding of {pred_label} has lenght {len(pred_embedding)}\n")

The embedding of The Godfather has lenght 256

The embedding of director has lenght 256



### Turn embeddings into labels

We need also a way to turn an embedding into a label

In [234]:
def extract_label(embedding, type='entity'):

    if type=='entity':
        pipeline = [entity_matrix, id2ent, nodes]
    else:
        pipeline = [predicate_matrix, id2pred, predicates]

    # Find the index in the entity embeddings matrix that corresponds to the embedding vector
    id = np.where((pipeline[0] == embedding).all(axis=1))[0][0]

    # Turn the id into a URI
    Uri = pipeline[1][id]

    # Turn the URI into a label
    label = pipeline[2][Uri]

    return label
    

In [235]:
# Example usage

entity_label = 'The Godfather'
entity_embedding = extract_embedding(entity_label)
print(f"The embedding of {entity_label} has lenght {len(entity_embedding)}\n")

# Turn the embedding back into a label
label = extract_label(entity_embedding)
print(f"The extracted label for entity: {entity_label} is {label}\n")

pred_label = 'characters'
pred_embedding = extract_embedding(pred_label, 'predicate')
print(f"The embedding of {pred_label} has lenght {len(pred_embedding)}\n")

# Turn the embedding back into a label
label = extract_label(pred_embedding, 'predicate')
print(f"The extracted label for predicate: {pred_label} is {label}")

The embedding of The Godfather has lenght 256

The extracted label for entity: The Godfather is The Godfather

The embedding of characters has lenght 256

The extracted label for predicate: characters is characters


### Evaluate embeddings similarity

Given the embedding of an entity we want to find the most similar entities in the graph to said entity

In [236]:
from sklearn.metrics import pairwise_distances

In [237]:
def find_similarities(embedding, n):

    embedding = np.atleast_2d(embedding)

    answer = []

    dist = pairwise_distances(embedding, entity_matrix)
    for idx in dist.argsort().reshape(-1)[:n]:
        answer.append(nodes[id2ent[idx]])

    return answer

In [238]:
# Example Usage

entity_embedding = extract_embedding('Batman')

print(find_similarities(entity_embedding, 5))

['Batman', 'Deathstroke', 'Harley Quinn', 'The Joker', 'Killer Croc']


### Answer questions

We can now use the following pipeline for answering questions:
- Extract the entities and relation from the question
- Turn entities and relation into embeddings
- If the entity is a subject, retrieve the object by: _object = subject + relation_
- If the entity is an object, retrieve the subject by _subject = object - relation_

In [295]:
def answer_question_embeddings(question):
    entities, exit_status = extract_entities(question, predicates, n=3, confidence=0.6)

    # Handle cases based on exit_status
    if exit_status == 'No entities found by NER':
        return "We could not find any entities in the question. Could you verify that you have capitalized the right letters, such as movie titles or people’s names?"

    elif exit_status == 'Entity matching too distant':
        match_value, _ = entities  # entities contains the match value in this case
        return f"The closest entity match found was '{match_value}', but it seems too distant. Could you rephrase it or specify it more clearly?"

    elif exit_status == 'predicate missing embedding':
        relation, _ = entities  # entities contains the relation in this case
        return f"Unfortunately, we were not provided with an embedding for the relation '{relation}'. Please try another question."

    # Proceed if everything worked correctly
    if not exit_status:
        # Extract predicates and entities
        extracted_predicates = [d['word'] for d in entities if d['entity_group'] == 'predicate'][0]
        extracted_entities = [d['word'] for d in entities if d['entity_group'] != 'predicate']

        # Convert predicates and entities to embeddings
        predicates_embeddings = [extract_embedding(pred, 'predicate') for pred in extracted_predicates]
        entities_embeddings = [extract_embedding(ent) for ent in extracted_entities]

        # Compute answer using similarity function
        answer = find_similarities(entities_embeddings[0] + predicates_embeddings[0], 10)
        return answer


In [296]:
# Example Usage

question = "Who is the director of Star Wars: Episode VI - Return of the Jedi?"

print(answer_question_embeddings(question))

Question after preprocessing: Who is the director of Star Wars Episode VI Return of the Jedi?

Extracted entity: Star Wars Episode VI Return of the Jedi

Question after removing entities: who is the director of ?

Extracted predicates: ['director']

['George Lucas', 'Anthony Daniels', 'Ellis Rubin', 'Lawrence Kasdan', 'Richard Driscoll', 'Mike Quinn', 'Kenny Baker', 'James Kahn', 'Sebastian Shaw', 'Ahmed Best']


# TODO

- Implement way to handle double questions like "who is the director of ... AND who is the screenwriter of ...."
- Finish and perfect factual questions queries
- Implement language model to generate more realistic responses