# Notebook 8 - Knowledge Representation (KR)

CSI4106 Artificial Intelligence   
Fall 2020  
Prepared by Julian Templeton and Caroline Barrière

***INTRODUCTION***:  

When reading text, understanding the type of entities with the text helps extract additional information about the entity. Through the use of Named Entity Recognition (NER), we are able to determine whether an entity is a Person, Organization, Country, ... When exploring text online, we also occassionally see entities have clickable links to webpages with more information on the entity. This is a form of enhancing the text to allow readers to easily access the information needed to understand each entity from the text and its content.    

In this notebook we will be revisiting the Covid-19 related news dataset from notebook 7 to explore how we can improve spaCy's NER disambiguation and enhance the text from the news articles through the use of entity linking. This will be done in two parts, where we first use text coherence for NER disambiguation and then perform text enhancement with entity linking.    

**For this notebook, do not modify function definitions and be sure to use the setup that is provided to you AND only submit this file, nothing else is needed**.

This notebook uses libraries that have been used in previous notebooks, including spaCy and pandas. Recall that if you run into any issues with loading 'en' to comment that line and uncomment the included line of code (the same way that you may have done in notebook 7).

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 30.  
Each **(TO DO)** has a number of points associated with it.*
***

In [1]:
# Before starting we will import every module that we will be using
import spacy
import pandas as pd

In [2]:
# The core spacy object that will be used for tokenization, lemmatization, POS Tagging, ...
# Note that this is specifically for the English language and requires the English package to be installed
# via pip to work as intended.

#sp = spacy.load('en')

# If the above causes an error after installing the package described in (2), install the package described
# in the Note section within the introduction and run this line of code instead of the above.
sp = spacy.load('en_core_web_sm')

**PART 1 - Text Coherence for NER Disambiguation**  
  
For the first part of this notebook we will use the modules from *spaCy* to help to perform NER disambiguation and better the results with the text coherence on documents from the included file on Covid-19 related news articles from CBC news (the same file from notebook 7). We will begin by looking at the NER disambiguation that is performed by spaCy and think of some simple methods to use the coherence of the entities within the text to potentially improve the NER disambiguation.   


As with last notebook, the dataset is included with this notebook, but details regarding it can be found [here](https://www.kaggle.com/ryanxjhan/cbc-news-coronavirus-articles-march-26?select=news.csv). The first thing that we will do, as usual, is load the file into a pandas dataframe.  

In [3]:
# Read the dataset, show top ten rows
df = pd.read_csv("news.csv")
df.head(10)

Unnamed: 0.1,Unnamed: 0,authors,title,publish_date,description,text,url
0,0,[],'More vital now:' Gay-straight alliances go vi...,2020-05-03 1:30,Lily Overacker and Laurell Pallot start each g...,Lily Overacker and Laurell Pallot start each g...,https://www.cbc.ca/news/canada/calgary/gay-str...
1,1,[],Scientists aim to 'see' invisible transmission...,2020-05-02 8:00,Some researchers aim to learn more about how t...,"This is an excerpt from Second Opinion, a week...",https://www.cbc.ca/news/technology/droplet-tra...
2,2,['The Canadian Press'],Coronavirus: What's happening in Canada and ar...,2020-05-02 11:28,Canada's chief public health officer struck an...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
3,3,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
4,4,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
5,5,"['Senior Writer', 'Chris Arsenault Is A Senior...",Brazil has the most confirmed COVID-19 cases i...,2020-05-02 8:00,"From describing coronavirus as a ""little flu,""...","With infection rates spiralling, some big city...",https://www.cbc.ca/news/world/brazil-has-the-m...
6,6,['Cbc News'],The latest on the coronavirus outbreak for May 1,2020-05-01 20:43,The latest on the coronavirus outbreak from CB...,Coronavirus Brief (CBC) Canada is officiall...,https://www.cbc.ca/news/the-latest-on-the-coro...
7,7,['Cbc News'],Coronavirus: What's happening in Canada and ar...,2020-05-01 11:51,Nova Scotia announced Friday it is immediately...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
8,8,"['Senior Writer', ""Adam Miller Is Senior Digit...",Did the WHO mishandle the global coronavirus p...,2020-04-30 8:00,The World Health Organization has come under f...,The World Health Organization has come under f...,https://www.cbc.ca/news/health/coronavirus-who...
9,9,['Thomson Reuters'],Armed people in Michigan's legislature protest...,2020-04-30 21:37,"Hundreds of protesters, some armed, gathered a...","Hundreds of protesters, some armed, gathered a...",https://www.cbc.ca/news/world/protesters-michi...


In the previous notebook, when we explored how spaCy can perform the various steps of the NLP pipeline, we saw that it was able to perform Named Entity Recognition (NER). Below is the same example that we saw from the last notebook to showcase how we can access spaCy's NER type predictions for tokens in a text.

In [4]:
# Same example from notebook 7, recall that we loop through the iterator found in the .ents property of a parsed sentence
sentence_example = "Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19."
sentence_example_content = sp(sentence_example)
# Loop through all tokens that contain a NER type and print the token along with the corresponding NER type
for token in sentence_example_content.ents:
    print("\"" + token.text + "\" is a " + token.label_ )

"Canada" is a GPE
"at least two metres" is a QUANTITY


**(TO DO) Q1**  
Before performing NER with text coherence, you will first explore how spaCy performs NER disambiguation. In the text of ***second document*** (index 1) of our corpus of documents, which words are *PER* (spaCy uses the *PERSON* type, rather than *PER*), *ORG* (Organiztion), and *GPE* (Geopolitical Entity). You must do the following for this question:    
a) Print each *PER*, *ORG*, and *GPE* along with its NER type from spaCy.     
b) Are all of these NER type predictions correct? If not, provide three examples of incorrect outputs.    
c) Do any of the problems with the NER type predictions come from an earlier step in the NLP pipeline that is performed by spaCy? Describe the problem for two examples from the output above.    

**(TO DO) Q1 (a) - 2 marks**  
Print each *PER*, *ORG*, and *GPE* along with its NER type from spaCy  

In [42]:
# Select the second document (index 1)
doc = df["text"][1]
# TODO
doc_piped = sp(doc)
for token in doc_piped.ents:
    if token.label_ in ("PERSON", "ORG", "GPE"):
        print(token.text, token.label_)

the World Health Organization ORG
Touches ORG
WHO ORG
the Public Health Agency ORG
W.F. Wells PERSON
Harvard School of Public Health ORG
Wells ORG
Canada GPE
Lydia Bourouiba PERSON
the Fluid Dynamics of Disease Transmission Laboratory ORG
the Massachusetts Institute of Technology ORG
Bourouiba PERSON
Mark Loeb PERSON
Hamilton PERSON
McMaster University ORG
RNA GPE
Wuhan GPE
China GPE
Nebraska GPE
Loeb PERSON
Loeb PERSON
Canada GPE
Gary Moore/CBC PERSON
Allison McGeer PERSON
Sinai Health ORG
Toronto GPE
particles  PERSON
McGeer ORG
McGeer PERSON
Bourouiba GPE
Bourouiba PERSON
Bourouiba/MIT/ ORG
Samira Mubareka PERSON
Sunnybrook Hospital ORG
Toronto GPE
Bourouiba PERSON
JAMA Insights ORG
McMaster PERSON
Loeb PERSON
N95 ORG
U.S. GPE
Justin Trudeau PERSON
the New England Journal of Medicine ORG
the U.S. National Institutes of Health ORG
U.S. National Institutes of Health ORG
Journal of the Royal Society Interface ORG
U.S. GPE
Singapore GPE
N95 ORG
Gary S. Settles PERSON
Penn State Universi

**(TO DO) Q1 (b) - 1 mark**   
Are all of these NER type predictions correct? If not, provide two examples of incorrect outputs.     

No, not all of the NER types are correct. For example, McGeer is labeled as both a PERSON and an ORG, N95 is labeled as an ORG, particles is labeled as a PERSON, etc. 


**(TO DO) Q1 (c) - 2 marks**   
Do any of the problems with the NER type predictions come from an earlier step in the NLP pipeline that is performed by spaCy? Describe the problem for two examples from the output above.    

Yes, some of the problems with predicting NER types is due to incorrectly parsing tokens in earlier NLP pipline steps, for example, N95 should have been considered an adjective to a noun (N95 masks), it is not a separate entity and therefore should not have an NER tag of ORG. Another example is Gary Moore/CBC labeled (partially) incorrect as a PERSON. CBC should be labeled separately as ORG. This error is due to an earlier NLP pipeline step not delimating this token correctly. This same error also caused Bourouibia to be labeled as an ORG since MIT was not tokenized as a separate entity.  


Now that you saw that spaCy NER does not always perform correctly, we will try to use text coherence to modify the NER types that spaCy gave.  In fact spaCy assigns the entity types one sentence at a time.  But when looking a the whole document, and knowing that text is usually coherent, we can do some post-processing to spaCy's NER module and correct some mistakes.  By text being coherent, we mean, for example, that if a person is referred to with a particular name, e.g. *McGeer*, chances are that each time we see *McGeer* in the document, it is the same person.  So it is unlikely that *McGeer* would be once a person and once an organization.  It is not always true, but it is a common assumption.  Therefore, we will explore two different strategies to use text coherence to post-process the output from the spaCy NER module.  

The first strategy (*explored in Q2/Q3*) is to find, among all NER types assigned, which is the most frequent one.  For example, the entity *Bourouiba* was assigned 1 time GPE, and 3 times PERSON, so this information can be used to modify the GPE type and change it to PERSON.  

The second strategy (explored in Q4) is to try to find a longer form in the text.  Since that longer form should be less ambiguous, we can use it to disambiguate the shorter, more ambiguous forms.  For example, *Lydia Bourouiba* occurs in the text and is assigned PERSON.  We can use that information to assign further occurrences of the short form *Bourouiba* to also be PERSON.   

Once we defined these two strategies, they can be combined in different ways. So, in Q5, you are asked to combine both strategies in a post-processing component for spaCy NER module.   Of course, using this text coherence will not work every time, and will unfortunately introduce some errors...  But let's try.

Through the remainder of this section, we will be working with the seventh document in the corpus (index 6). Below we load the document and explore all entities within the document along with their corresponding NER type.

In [12]:
# Load the document's text for the seventh document (index 6)
doc = df["text"][6]
# Parse the text with spaCy
doc_sp = sp(doc)

In [13]:
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ )

0: "Coronavirus Brief" is a ORG
1: "CBC" is a ORG
2: "Canada" is a GPE
3: "C.D. Howe" is a PERSON
4: "Ontario" is a GPE
5: "Monday" is a DATE
6: "Alberta" is a GPE
7: "first" is a ORDINAL
8: "Saturday" is a DATE
9: "Air Canada" is a ORG
10: "Christmas" is a DATE
11: "Canadians" is a NORP
12: "more than $1.2 million" is a MONEY
13: "England" is a GPE
14: "Peter Cziborra/Reuters" is a PERSON
15: "months" is a DATE
16: "CBC" is a ORG
17: "Andre Mayer" is a PERSON
18: "Canada" is a GPE
19: "19th-century" is a DATE
20: "2013" is a DATE
21: "Calgary" is a GPE
22: "John Brown" is a PERSON
23: "the University of Calgary" is a ORG
24: "two-metre" is a TIME
25: "Last week" is a DATE
26: "Italian" is a NORP
27: "Milan" is a GPE
28: "35 kilometres" is a QUANTITY
29: "Berlin" is a GPE
30: "Budapest" is a GPE
31: "Mexico City" is a GPE
32: "Ahsan Habib" is a PERSON
33: "Dalhousie University" is a ORG
34: "U.S." is a GPE
35: "Atlanta" is a GPE
36: "Chicago" is a GPE
37: "Denver" is a GPE
38: "Habib" 

**(TO DO) Q2 - 3 marks**  
As you can see in the results, sometimes the same entity was assigned different entity types (e.g. in document for Q1 *McGeer* was one time ORG, one time PERSON) since the NER algorithm looks sentence by sentence.  In the following function, the purpose will be to find all the possible entity types assigned to a single entity.

Complete the definition of the *find_entity_types* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter (from the *.ents* iterable of entities) and a list of all spaCy entities defined by the *entities* parameter.     

The function must find all entities of the same name as *entity* from *entities* (the same surface form). For each match between the entities, add the NER type of the entity from the list to the dictionary *type_counts* and track the number of times each NER type appears.     

Ex: type_counts\[NER type\] = total number of times the count appears

In [54]:
def find_entity_types(entity, entities):
    '''
    Given a specific entity and a list of entities, finds all entities from the list that match the specified
    entity, but are of a different type.
    
    Returns the different NER types that have been classified for an entity and the count per NER type
    as a dictionary with the keys as the NER type and the value as the count
    '''
    type_counts = {}
    for e in entities:
        if entity.text == e.text:
            if e.label_ not in type_counts:
                type_counts[e.label_] = 0
            type_counts[e.label_] += 1
    return type_counts

In [55]:
# Test the above to find the result when checking for the types of the entity 'Kenney' 
# from the document loaded above
print("All possible NER types for \"" + doc_sp.ents[85].text + "\" are " + str(find_entity_types(doc_sp.ents[85], doc_sp.ents)))


print("All possible NER types for \"" + doc_piped.ents[24].text + "\" are " + str(find_entity_types(doc_piped.ents[24], doc_piped.ents)))


All possible NER types for "Kenney" are {'ORG': 1, 'PERSON': 1}
All possible NER types for "Bourouiba" are {'PERSON': 3, 'GPE': 1}


**(TO DO) Q3 - 2 marks**  
In the previous method, *find_entity_types*, we found all the possible entity types for a single entity.  Now, we want to use these to find the most common type.  If we look again at the results for Q1, in the case of *McGeer*, it's a tie.  But for *Bourouiba*, there is one GPE type, and 3 PERSON type, so the most common would be PERSON.   

Complete the definition of the *most_common_type* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter (from the *.ents* iterable of entities) and a list of all spaCy entities defined by the *entities* parameter.        

Note: You can handle ties as you please.

In [75]:
def most_common_type(entity, entities):
    '''
    Given a specific entity and a list of entities, find the most similar entities and assign the
    NER type to entity based on the most common NER type assigned to entities of the same name (if there
    is a tie, you decide how to handle this).
    
    Returns the most common NER type based on similar entities
    '''
    # TODO
    type_counts = find_entity_types(entity, entities)
    max = 0
    max_type = []
    most_common_type = ""
    for type in type_counts:
        if type_counts[type] == max:
            max_type.append(type)
        if type_counts[type] > max:
            max = type_counts[type]
            tmp = []
            tmp.append(type)
            max_type = tmp
    if len(max_type) > 1:
        if "PERSON" in max_type:            # in the case of a tie of two or more NER types, I'm thinking that PERSON might be most likely to be incorrectly labeled, so I am preferring PERSON if it is a tied type. Otherwise, take whatever is first. 
            most_common_type = "PERSON"
    else:
        most_common_type = max_type[0]
    return most_common_type         



In [76]:
# Test the above to find the result when checking for the types of the entity 'Kenney' 
# from the document loaded above
print("The most common NER type to \"" + doc_sp.ents[85].text + "\" is " + most_common_type(doc_sp.ents[85], doc_sp.ents))


The most common NER type to "Kenney" is PERSON


**(TO DO) Q4 - 2 marks**  
Now we will work with a slightly more sophisticated method. We will once again work with the same *entity* and *entities* parameters, but this time you will need to assign *entity* the NER type of another entity in the *entities* iterator.    

Specifically, you must look through *entities* to find a normalized form of *entity*. In this scenario, any entity that contains *entity* as a substring will be considered a valid selection for the normalized form (where the selected entity does not have the same name as *entity*). If a normalized form is found, return the NER type of that entity, the name of that entity, and the entity itself.    

Ex: *CBC News Network* is the normalized form of *CBC*. Thus, if this entity is found, return the NER type from the entity (*ORG*) and the name of the entity (*CBC News Network*).

In [118]:
def assign_normalized_form(entity, entities):
    '''
    Given an entity and a list of entities, search the list of entities for any token that
    is does not have the exact same text as entity and assign entity that token's NER type
    if entity is a substring of that token.
    
    Returns the empty string if no normalized forms are found and the NER type of the normalized form if it is found.
    Also returns the name of the entity found, if any (along with the entity).
    '''
    # MAY BE DONE SO THAT THE LAST GETS ADDED INSTEAD, THIS IS FINE.
    # Recall to return the three requested components (NER type, the text, and the actual entity)
    normalized_form = entity
    for e in entities:
        if entity.text in e.text and len(normalized_form.text) < len(e.text):
            normalized_form = e
    if normalized_form.text == entity.text:
        return ""
    else:
        return normalized_form.label_, normalized_form.text, normalized_form

In [120]:
# Test the above to find the result when checking for the types of the entity 'Kenney' 
# from the document loaded above
print(assign_normalized_form(doc_sp.ents[85], doc_sp.ents))
# Test the above to find the result when checking for the types of the entity 'CBC News' 
# from the document loaded above
print(assign_normalized_form(doc_sp.ents[153], doc_sp.ents))

('PERSON', 'Jason Kenney', Jason Kenney)
('ORG', 'CBC News Network', CBC News Network)


**(TO DO) Q5**  
Now that you have defined several algorithms to perform NER disambiguation with text coherence, you will test your algorithms and use them to define a slightly more robust method of NER disambiguation by combining the techniques performed. You will then explore whether these techniques always help with NER disambiguation.       

a) Revisit the document that was used in Q1 (index 1) and, for each entity, retrieve the normalized form of the entity (if any) and display only the normalized forms along with their NER types in the following format (only if there is a normalized form returned):    
&emsp;*Original_entity refers to Normalized_entity, and is a NER_Type_of_Normalized_Form*    
b) Define a more robust algorithm that combines the algorithms designed in the past few questions. This algorithm should accept a specific entity and list of entities as input, find the specific entity's normalized form (if any), and return an NER type for the normalized form based on the most common NER type for that entity. If no normalized form is found, the algorithm should continue by using the specific entity. You should also return the name of the normalized form (or of the original entity if there is no normalized form).       
c) For the seventh document (index 6), run the algorithm defined in b) for each entity, printing the following for each entity:    
&emsp;*Original_entity refers to Normalized_entity (if none, same as the original), and is a Most_common_NER_type_of_normalized_form*    
d) Do any of the results found from performing NER disambiguation with text coherence Q5(c) seem problematic? Give an example of a problem that is occurring with our approaches and explain why this issue occurs.

**(TO DO) Q5 (a) - 1 mark**     
a) Revisit the document that was used in Q1 (index 1) and, for each entity, retrieve the normalized form of the entity (if any) and display only the normalized forms along with their NER types in the following format (only if there is a normalized form returned):    
&emsp;*Original_entity refers to Normalized_entity, and is a NER_Type_of_Normalized_Form*    
For example "Bourouiba refers to Lydia Bourouiba, and is a PERSON" would be printed for one entity.

In [121]:
# Select document 2
doc = df["text"][1]
sp_doc_test = sp(doc)

In [122]:
# TODO: Loop through and print the assigned phrase with the appropriate text
# Example of the print statement structure (from document 1): Bourouiba refers to Lydia Bourouiba, and is a PERSON
for entity in sp_doc_test.ents:
    normalized_form = assign_normalized_form(entity, sp_doc_test.ents)
    if normalized_form != "":
        print(entity.text, "refers to", normalized_form[1], "and is a", normalized_form[0])

two metres refers to farther than two metres and is a QUANTITY
two refers to farther than two metres and is a QUANTITY
two metres refers to farther than two metres and is a QUANTITY
Wells refers to W.F. Wells and is a PERSON
Bourouiba refers to Lydia Bourouiba and is a PERSON
Loeb refers to Mark Loeb and is a PERSON
Loeb refers to Mark Loeb and is a PERSON
McGeer refers to Allison McGeer and is a PERSON
McGeer refers to Allison McGeer and is a PERSON
2 refers to April 2020 and is a DATE
Bourouiba refers to Lydia Bourouiba and is a PERSON
Bourouiba refers to Lydia Bourouiba and is a PERSON
Bourouiba refers to Lydia Bourouiba and is a PERSON
two metres refers to farther than two metres and is a QUANTITY
two refers to farther than two metres and is a QUANTITY
two refers to farther than two metres and is a QUANTITY
Second refers to Second Opinion and is a LAW
McMaster refers to McMaster University and is a ORG
Loeb refers to Mark Loeb and is a PERSON
U.S. refers to the U.S. National Instit

**(TO DO) Q5 (b) - 2 marks**     
b) Define a more robust algorithm that combines the algorithms designed in the past few questions. This algorithm should accept a specific entity and list of entities as input, find the specific entity's normalized form (if any), and return an NER type for the normalized form based on the most common NER type for that entity. If no normalized form is found, the algorithm should continue by using the specific entity. You should also return the name of the normalized form (or of the original entity if there is no normalized form).   

In [123]:
def normalized_most_common_type(entity, entities):
    '''
    Determine the normalized form of an entity (if any; if none just use the entity) and
    return the most frequent NER type for that normalized form from a list of entities.
    '''    
    # TODO (Recall to return the name and the NER type that is found)
    normalized_form = assign_normalized_form(entity, entities)
    if normalized_form == "":
        most_common = most_common_type(entity, entities)
        return entity.text, most_common
    else:
        most_common = most_common_type(normalized_form[2], entities)
        return normalized_form[1], most_common
    

**(TO DO) Q5 (c) - 1 mark**     
c) For the seventh document (index 6), run the algorithm defined in b) for each entity, printing the following for each entity:    
&emsp;*Original_entity refers to Normalized_entity (if none, same as the original), and is a Most_common_NER_type_of_normalized_form*    

In [124]:
# Load the document's text
doc = df["text"][6]
sp_doc_test = sp(doc)

In [125]:
# TODO: Loop through and print the assigned phrase with the appropriate text
for e in sp_doc_test.ents:
    normalized_most_common_e = normalized_most_common_type(e, sp_doc_test.ents)
    print(e.text, "refers to", normalized_most_common_e[0], "and is a", normalized_most_common_e[1])

Coronavirus Brief refers to Coronavirus Brief and is a ORG
CBC refers to CBC News Network and is a ORG
Canada refers to Transport Canada and is a ORG
C.D. Howe refers to the C.D. Howe Institute's Business Cycle Council and is a ORG
Ontario refers to Ontario and is a GPE
Monday refers to Monday and is a DATE
Alberta refers to Alberta and is a GPE
first refers to first and is a ORDINAL
Saturday refers to Saturday and is a DATE
Air Canada refers to Air Canada and is a ORG
Christmas refers to Christmas and is a DATE
Canadians refers to Canadians and is a NORP
more than $1.2 million refers to more than $1.2 million and is a MONEY
England refers to England and is a GPE
Peter Cziborra/Reuters refers to Peter Cziborra/Reuters and is a PERSON
months refers to less than two months old and is a DATE
CBC refers to CBC News Network and is a ORG
Andre Mayer refers to Andre Mayer and is a PERSON
Canada refers to Transport Canada and is a ORG
19th-century refers to 19th-century and is a DATE
2013 refe

**(TO DO) Q5 (d) - 2 marks**     
d) Do any of the results found from performing NER disambiguation with text coherence Q5(c) seem problematic? Give an example of a problem that is occurring with our approaches and explain why this issue occurs.

TODO ...   

There are several issues using this technique. For one, many of the normalized forms are incorrect. For example, we have that "Canada" refers to "Transport Canada", which does not make sense, but is the correct output given our simple criteria for a normalized form (i.e. that one entity is a substring of another). There are other examples of this, like when "20" refers to "2013". Second, because these normalized forms are being incorrectly assigned, their associated NER types are also incorrect. "Canada" is now an ORG (should be GPE), "20" is now a DATE (should be CARDINAL). 


**PART 2 - Entity Linking / Text enhancement**  

For the second part of this notebook, we will be exploring how we can enhance the text of documents. In this scenario, we will be enhancing the text by performing entity linking. This means that we will attempt several methods of linking the entities that are detected by spaCy's NER to an active webpage that a reader can click on to obtain more information regarding the entity. Many websites, such as Wikipedia, perform Entity Linking to allow for more context to be obtained when reading a document.     

Before going straight into an example through code, below is an example of how a text with no entity linking compares to a text with entity linking:    

No entity linking:    
During the pandemic, U.S. cities such as Atlanta, Chicago and Denver have made several adjustments to their transit systems.      

With entity linking:    
During the pandemic, U.S. cities such as <a href="http://en.wikipedia.org/wiki/Atlanta">Atlanta</a>, <a href="http://en.wikipedia.org/wiki/Chicago">Chicago</a> and <a href="http://en.wikipedia.org/wiki/Denver">Denver</a> have made several adjustments to their transit systems.

Since you will be designing several methods of performing simple entity linking, below is an example that showcases how we can manually perform entity linking without any resources. This will showcase how it can be performed so that you will be able to use and create resources to create simple entity linking algorithms yourself.

In [126]:
sentence_example = "During the pandemic, U.S. cities such as Atlanta, Chicago and Denver have made several adjustments to their transit systems"
# Parse the example sentence
text_sp = sp(sentence_example)
# This will store the enhanced version of the text
enhanced_text = sentence_example
# Loop through the entities that spaCy has found and replace them as needed to be in expanded form 
for token in text_sp.ents:
    if token.text == "Atlanta":
        enhanced_text = enhanced_text.replace(token.text, "<a href=\"http://en.wikipedia.org/wiki/Atlanta\">Atlanta</a>")
    elif token.text == "Chicago":
        enhanced_text = enhanced_text.replace(token.text, "<a href=\"http://en.wikipedia.org/wiki/Chicago\">Chicago</a>")
    elif token.text == "Denver":
        enhanced_text = enhanced_text.replace(token.text, "<a href=\"http://en.wikipedia.org/wiki/Denver\">Denver</a>")
    
# Write the result as an HTML file (open to view the enhanced text!)
with open("enhanced_example.html", "w", encoding="utf-8") as f:
    f.write(enhanced_text)
    f.close()

By opening the *enhanced_example.html* file that is now in the same directory as this notebook, you will be able to see
that we linked the entities in the text.  

That said, the process above is quite poor. It required manually stating which entities to work with and which URL to link to it. Thus, you will be answering questions for the rest of this section where you use and/or create resources that were manually put together to link entities in more general/robust methods. There are many different string matching techniques that can be used to help with entity linking, but we will stick with basic approaches for this notebook.   

In the next question you will begin working with external resources. Thus, below we load the *US_Cities.csv* file to use to enhance the text with US cities in the following question. Note that each file contains two columns; *Text* and *URL*. *Text* refers to an entity name and *URL* refers to a corresponding *URL* that provides more information regarding the *Text*. The below example showcases how these files should be loaded and can be accessed.

In [127]:
# Start with the string match approach (exact match)
# File content extracted from https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population
df_cities = pd.read_csv("US_Cities.csv")
# Print the Text and URL from each row, showcasing how to loop through the contents 
for i, row in df_cities.iterrows():
    print(row["Text"] + " - " + row["URL"])

New York City - https://en.wikipedia.org/wiki/New_York_City
Los Angelas - https://en.wikipedia.org/wiki/Los_Angeles
Chicago - https://en.wikipedia.org/wiki/Chicago
Houston - https://en.wikipedia.org/wiki/Houston
Phoenix - https://en.wikipedia.org/wiki/Phoenix,_Arizona
Philadelphia - https://en.wikipedia.org/wiki/Philadelphia
San Antonio - https://en.wikipedia.org/wiki/San_Antonio
San Diego - https://en.wikipedia.org/wiki/San_Diego
Dallas - https://en.wikipedia.org/wiki/Dallas
San Jose - https://en.wikipedia.org/wiki/San_Jose,_California
Austin - https://en.wikipedia.org/wiki/Austin,_Texas
Jacksonville - https://en.wikipedia.org/wiki/Jacksonville,_Florida
Fort Worth - https://en.wikipedia.org/wiki/Fort_Worth,_Texas
Columbus - https://en.wikipedia.org/wiki/Columbus,_Ohio
Charlotte - https://en.wikipedia.org/wiki/Charlotte,_North_Carolina
San Francisco - https://en.wikipedia.org/wiki/San_Francisco
Indianapolis - https://en.wikipedia.org/wiki/Indianapolis
Seattle - https://en.wikipedia.org

**(TO DO) Q6 - 3 marks**  
Complete the *enhance_text_with_resource* function below. It receives the text from the document via *document_text*, the dataframe from the external resource to enhance the text with as *resource_df*, and the name of the file that you will output the results into (a .html file) as *filename*.   

This function parses the text of the document and replaces any *entities* (.ents) found within the text with:    
<a href=\"Some URL">Entity text</a\>     

After enhancing the text with entity linking, we write the enhanced text into an html file and return the enhanced text

In [210]:
def enhance_text_with_resource(document_text, resource_df, filename):
    '''
    With a resource and document's text, enhance any entity found in the resource by linking the entity to
    the appropriate webpage.
    Write the file to the appropriate filename and return the enhanced text
    '''
    enhanced_text = document_text
    # TODO: Parse the document with spaCy
    document_text_piped = sp(document_text)
    
    df = pd.read_csv(resource_df)
    
    # TODO: Go through the entities and edit the document's text accordingly
    # Note: Be sure to not duplicate your enhancementes
    replaced = []
    for entity in document_text_piped.ents:
        for i, row in df.iterrows():
            if entity.text == row["Text"] and entity.text not in replaced:
                url_string = "<a href=\"" + row["URL"] + "\">" + entity.text + "</a>"
                enhanced_text = enhanced_text.replace(entity.text, url_string)
                replaced.append(entity.text)
    with open(filename, "w", encoding="utf-8") as f:
        f.write(enhanced_text)
        f.close()
    return enhanced_text

**(TO DO) Q7 - 3 marks**  
With the text enhancement algorithm designed (*enhance_text_with_resource*), you will now test the functionality when running the algorithm with three different resources. You will test the algorithm for each document already loaded in the code cell and run the algorithms with the following three resources:    
1) A file containing several US cities: *US_Cities.csv*      
2) A file containing all Provinces in Canada: *Canada_Provinces.csv*        
3) A file containing several Canadian Universities: *Canada_Universities.csv*        

In [211]:
# Enhance the text for the document below with the US cities
doc = df["text"][6]
# TODO ...
enhance_text_with_resource(doc, "US_Cities.csv", "Enhanced_Text_USCities.html")
# Enhance the text for the document below with the Canadian provinces
# File extracted from https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
doc = df["text"][53]
# TODO ...
enhance_text_with_resource(doc, "Canada_Provinces.csv", "Enhanced_Text_Provinces.html")
# Enhance the text for the document below with the Canadian universities
# File extracted from https://en.wikipedia.org/wiki/List_of_universities_in_Canada
doc = df["text"][53]
# TODO ...
enhance_text_with_resource(doc, "Canada_Universities.csv", "Enhanced_Text_Universities.html")



Now, if you open the saved HTML files, you should find that the words that appear in the text and the resource now link directly to relevant information for that entity. We would also be able to enhance a document with many resources to link as many entities as possible.

**(TO DO) Q8 - 2 marks**  
Look through the enhanced texts that are generated by your tests in Q7. Do you see any universities that are not linked to when using the university resource? Why? Use the code cell below to output anything that you may need to investigate (if you have already noticed why this occurs earlier in the notebook, you do not need to investigate to find out why in the code cell) and answer the question in the markdown below that code cell.      

Note: To find out, you should look through the .csv files and the text itself (both the initial text and spaCy's entity detection).

In [213]:
# Look through any outputs that may seem off to help understand why (if not already known)

doc = df["text"][53]

doc_piped = sp(doc)

for token in doc_piped.ents:
    if "University" in token.text:
        print(token.text, token.label_)

the University of Calgary ORG
the University of Saskatchewan  ORG
Johns Hopkins University ORG


TODO ...    

None of the universities in the document were linked because their entity names were tokenized to include a "the" at the beginning, therefore it did not match identically. Also, some universities, such as the University of Saskatchewan, were not included in our links. 

**(TO DO) Q9 - 2 marks**  
We will now combine some of the work done in Part 1 of this notebook with the work done in this part of the notebook. Specifically, we will perform NER type validation to ensure that when we enhance text with a resource that it only enhanced entities of the correct NER type. For example, when we use the resource of cities or provinces, ensure the entitiy that we are looking at is classified as a GPE before expanding it. The same concept applies for universities, which should be classified as an ORG.     

Copy over your definition of the *enhance_text_with_resource* function, extend it to also accept an NER type as input (ex: *PERSON*, *ORG*, ...) and ensure that the text enhancement only occurs if *at least one entity with the same surface form* from the document contains the same NER type that was provided to the input parameter. This new function is named *enhance_text_with_resource_and_type*.    

*NOTE (can ignore - just for more information):* In reality we would like to have it set such that only an entity of a specified type has its corresponding set of tokens within the text to be linked to the resource. However, this process can be tricky since the logic will involve creating flags within the text to know which entities have already been checked (ex: If *Nova Scotia* appears twice in the text, each instance with its own NER type, then we need to know the set of tokens that we are editing for each of the entities). Thus, you only need to ensure that if at least one entity of the same surface form contains the NER type and is in the text, then all instances of those entities are updated. If a resource contains the entities in the text, but they are all of a different type, ignore them. You are free to implement the more robust method detailed above if you like, but are recommended to perform the more simplistic approach that the question asks for.

In [214]:
def enhance_text_with_resource_and_type(document_text, resource_df, filename, NER_type):
    '''
    With a NER type, a resource and document's text, enhance any entity found in the resource by linking the entity to
    the appropriate webpage if at least one surface form contains the specified NER type.
    Write the file to the appropriate filename and return the enhanced text
    '''
    enhanced_text = document_text
    # TODO: Parse the document with spaCy
    document_text_piped = sp(document_text)
    
    df = pd.read_csv(resource_df)
    
    # TODO: Go through the entities and edit the document's text accordingly
    # Note: Be sure to not duplicate your enhancementes
    replaced = []
    for entity in document_text_piped.ents:
        for i, row in df.iterrows():
            if entity.text == row["Text"] and entity.label_ == NER_type and entity.text not in replaced:
                url_string = "<a href=\"" + row["URL"] + "\">" + entity.text + "</a>"
                enhanced_text = enhanced_text.replace(entity.text, url_string)
                replaced.append(entity.text)
    with open(filename, "w", encoding="utf-8") as f:
        f.write(enhanced_text)
        f.close()
    return enhanced_text

**(TO DO) Q10 - 2 marks**  
Redo the tests performed in Q7 with the newly defined *enhance_text_with_resource* function. Ensure that you use the appropriate NER type depending on the resource being used for entitly linking.

In [215]:
# Enhance the text for the document below with the US cities
doc = df["text"][6]
enhance_text_with_resource_and_type(doc, "US_Cities.csv", "Enhanced_Text_USCities_with_type.html", "GPE")

# Enhance the text for the document below with the Canadian provinces
# File extracted from https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada
doc = df["text"][53]
enhance_text_with_resource_and_type(doc, "Canada_Provinces.csv", "Enhanced_Text_Provinces_with_type.html", "GPE")

# Enhance the text for the document below with the Canadian universities
# File extracted from https://en.wikipedia.org/wiki/List_of_universities_in_Canada
doc = df["text"][53]
enhance_text_with_resource_and_type(doc, "Canada_Universities.csv", "Enhanced_Text_Universities_with_type.html", "ORG")



***SIGNATURE:***
My name is --------------------------.
My student number is -----------------.
I certify being the author of this assignment.

In [216]:
print("My name is Charie Kaylan Brady. My student number is 300043672. I certify being the author of this assignment.")

My name is Charie Kaylan Brady. My student number is 300043672. I certify being the author of this assignment.
