# LAB3 - Entity Linking/Named Entity Disambiguation

In this notebook we will learn about the task of Named entity linking/disambiguation. We will cover the following aspects:
1. Task definition
2. Opportunities and challenges
3. The three components of every Entity linking system
4. Evaluating entity linking
5. Code example for evaluating entity linking
6. Performing disambiguation with existing tools
7. Entity disambiguation from scratch: recognition and disambiguation together

### 1. Task definition

**Entity tasks so far** So far, we have seen two tasks that relate to the entities mentioned in text: 
1. recognizing/spotting entity mentions in text in the task of *Named Entity Recognition*
2. classifying these entity mentions to their *type* (for example, Person or City) - this is done in the task of *Named Entity Classification/Typing*

**NED** Here, we will introduce Named Entity Disambiguation - NED, also called (Named) Entity Linking - (N)EL. NED is a central task in information extraction. The goal is to take the entity mentions that were found in text with the task of NER and "disambiguate" them. In this sense, the task of Entity Disambiguation builds on top of the output of the NER task. Sometimes the tasks are combined together in a task called Named Entity Recognition and Disambiguation (NERD).

**Disambiguation** Ok, so WHY do we need to disambiguate entity mentions found in text and HOW do we do that?

*WHY*: Let's take an entity mention we find in text, like "JFK". This phrase can mean different things. It can refer to John F. Kennedy (the former American president), or the airport in NYC with the same name , or to "Justice For Khojaly", etc. So, with this disambiguation we want to say precisely which of these world entities is the correct one in a specific textual document.

*HOW*: To disambiguate, we need a way to map this ambiguous mention found in text to a unique "representation" that already exists which has a clear meaning. Such representations, for example, are the Wikipedia pages of these world entities, because each Wikipedia page has a URL ("http:///wikipedia.org/...") and each URL describes exactly one entity. For example, https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport  describes the JFK airport and not the president.

Resources like Wikipedia are called *Knowledge Bases (KBs)*, because they contain knowledge about entities in the world. There are two types of knowledge bases: structured and unstructured. Named Entity Disambiguation in practice is sometimes performed with respect to unstructured and sometimes with respect to structured KBs.

Wikipedia is an example for an unstructured knowledge base, because most of its content is in unstructured (running text) form, such as: "John F. Kennedy international airport is a public airport owned by the city of New York ..." 

Examples for structured knowledge bases are DBpedia and Wikidata. In a structured knowledge base, we would not have textual description, but rather a structured list of facts, such as:

John F. Kennedy International airport
```
* airport type: public
* owner: city of New York
* ...
```

Here is the representation of the JFK airport in these structured knowledge bases:
* http://dbpedia.org/resource/John_F._Kennedy_International_Airport
* https://www.wikidata.org/wiki/Q8685


Note that most entities have information in these three knowledge bases (Wikipedia, DBpedia, and Wikidata). For example, we can find informatiom about the John F. Kennedy airport in Wikipedia (https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport) or in DBpedia (http://dbpedia.org/resource/John_F._Kennedy_International_Airport). Actually, it is always the case the last part of the URL is the same between Wikipedia and DBpedia ("John_F.\_Kennedy\_International_Airport" in this case), which is really convenient for us to use information from both places if we want.

To summarize, we perform disambiguation of entity mentions in text by connecting them to existing entities in a knowledge base, like Wikipedia. This kind of disambiguation "links" the textual mention to an existing representation - for this reason the task is also called Entity Linking.

**Example** For example, let's consider the following sentence:

"_JetBlue_ begins direct service between _Barnstable Airport_ and _JFK_."

The entity mentions we find here are: "JetBlue", "Barnstable Airport" and "JFK". Let's say that we perform linking to DBpedia. Then, “JetBlue” should be linked to the entity http://dbpedia.org/resource/JetBlue, and “JFK” to http://dbpedia.org/resource/John_F._Kennedy_International_Airport. 

However, there is no entry in DBpedia for the Barnstable Municipal Airport, which is the meaning of the mention “Barnstable Airport”. We cannot link this entity then. The entities for which there is no representation in a chosen knowledge base are called *NIL entities*. When a system processes the text, it should simply say that the meaning of “Barnstable Airport” is _NIL_.

### 2. Opportunities and challenges

**Connecting text and knowledge bases** This is the first time we encounter such a connection between the information in text and the knowledge bases in the external world in this course. Note that these knowledge bases were not created to improve the text processing. Instead, they exist independently in order to provide knowledge about the world - for example, Wikipedia, DBpedia, and Wikidata give us encyclopedic knowledge. 

**Opportunities** By creating a link between a phrase in text and a unique entry in a knowledge base, we directly get access to much more knowledge that we can use to enhance the information in text. If we know that 'JFK' refers to the airport, we allow our tools to have access to all facts about this airport, such as its location and founding year. In addition, if we want, we can now extract facts from text and store them in these knowledge bases, but this is another task for later :)

**Challenges** So, why is entity linking not an easy task? This relates to two aspects: ambiguity and variance. 

*Ambiguity* is the amount of meanings that a certain entity mention can have. For example, imagine how many people in the world are called "John Smith". DBpedia contains entries for a few hundreds of them, see http://dbpedia.org/page/John_Smith. How can we teach a computer to decide which of these is the one mentioned in text? And, what if the John Smith mentioned in text is a NIL entity and is not stored in DBpedia?

There are also many cases where it is quite easy to link an entity to a knowledge base. Often the mentions in text have a small ambiguity (for example, "Barack Obama"). Or, they have multiple meanings but one of them is used almost always: for example, there are multiple cities called "Paris", but the French capital will be most often mentioned in text.

*Variance* is the amount of different mentions that refer to the same entity. For example, http://dbpedia.org/resource/John_F._Kennedy_International_Airport can be called "JFK", or "John F. Kennedy Airport", or "The NYC airport" in text.

### 3. Named Entity Disambiguation in practice: 3 phases

In practice, most NED systems consist of three phases:

1. **Entity recognition/spotting** - this is done as described in the NER(C) task. In the example sentence "_JetBlue_ begins direct service between _Barnstable Airport_ and _JFK_.", the recognition phase will detect the entity mentions: "JetBlue", "Barnstable Airport", and "JFK".
2. **Candidate generation** - here we take each of the recognized mentions and look up in the knowledge base for potential meanings. For example, the phrase "JFK" could have these candidates:
    * http://dbpedia.org/resource/John_F._Kennedy
    * http://dbpedia.org/resource/John_F._Kennedy_International_Airport
    * http://dbpedia.org/resource/JFK_(film)
    * http://dbpedia.org/resource/JFK_University
    * http://dbpedia.org/resource/Justice_for_Khojaly
    
   and so on. Similar lists will be generated for the other mentions found in text: "JetBlue" and "Barnstable Airport". The candidate generation phase is not trivial because of the ambiguity and variation described above. Also, new entities are appearing all the time in news articles, so the number of options grows over time.
   
To understand the complexity of this step, think about the "candidate" classes in sentiment analysis. There, for each case we had to perform classification to one of the same three categories (positive, neutral, negative), while in enitty linking the number of classes is different for each mention and can sometimes be very large.

3. **Disambiguation** - the goal of this final phase is to take the list of potential meanings generated in the candidate generation phase for each of the mentions and make a decision on which instance is the correct one. This decision can either be: choosing one of the possible candidates, or deciding that no candidate is the correct one (NIL entity).

As the list of candidates is different for each mention, it is not easy to perform this disambiguation with supervised learning approaches as in other tasks (e.g., NER). In practice, most systems use different methods; we will briefly describe two such methods in part 6 below.

### 4. Evaluating entity linking

**Metric** The correctness of an entity linking system is measured in terms of precision, recall, and F1-score. For each of the mentions in text, we compare the system decision against the gold data:
* If the system chose entity X and the gold entity is also X, then we count a *true positive (TP)*
* If the system chose entity X, but the gold entity is Y, then we count a *false positive (FP)* and a *false negative (FN)*
* If the system opted for a NIL entity and the gold entity is X, then we count a *false negative (FN)*
* If the system opted for an entity X but the gold entity is NIL, then we count a *false positive (FP)*

Afterwards, we use these numbers for TP, FP, and FN, to compute precision, recall, and F1-score:

* `precision=TP/(TP+FP)` -> From the decisions made by the system, how many were true
* `recall=TP/(TP+FN)` -> From the gold entities, how many were found correctly by the system
* `f1=2*precision*recall/(precision+recall)` -> compute a harmonic mean between precision and recall, called F1-score

Note that precision, recall, and F1-score would all be the same in case all entities in the system output and the gold output are not NIL entities.

**Example** For the example sentence above, let's say that a system made the following decisions:
* "JetBlue" means http://dbpedia.org/resource/JetBlue (true positive)
* "Barnstable Airport" means http://dbpedia.org/resource/Barnstable,_Massachusetts (false positive)
* "JFK" means http://dbpedia.org/resource/John_F._Kennedy (false positive and false negative)

Then, we have in total: `TP=1, FP=2, FN=1`. 

The resulting precision is `1/3=0.33` and the resulting recall is `1/2=0.5`. 

The F1-score of this system on this sentence would be `0.40`. 

### 5. An example evaluation in code

Now we provide a code for this scenario. Note that for simplicity we assume that the entity recognition by the system is perfect. Also, we use a simple format of the gold and the system output as a list, in practice this requires some more preprocessing. 

In [3]:
def evaluate_entity_linking(system_decisions, gold_decisions):
    """
    Compute precision, recall, and F1-score by comparing two paired lists of: system decisions and gold data decisions.
    """
    tp=0
    fp=0
    fn=0

    for mention_id in range(num_entities):
        gold_entity=gold_decisions[mention_id]
        system_entity=system_decisions[mention_id]
        if gold_entity=='NIL' and system_entity=='NIL': continue
        if gold_entity==system_entity:
            tp+=1
        else:
            if gold_entity!='NIL':
                fn+=1
            if system_entity!='NIL':
                fp+=1

    print('TP: %d; \nFP: %d, \nFN: %d' % (tp, fp, fn))            

    precision=tp/(tp+fp)
    recall=tp/(tp+fn)
    f1=2*precision*recall/(precision+recall)
    
    return precision, recall, f1

In [2]:
text="JetBlue begins direct service between Barnstable Airport and JFK."

gold_decisions=['http://dbpedia.org/resource/JetBlue', 
                'NIL',
                'http://dbpedia.org/resource/John_F._Kennedy_International_Airport']
system_decisions=['http://dbpedia.org/resource/JetBlue', 
                  'http://dbpedia.org/resource/Barnstable,_Massachusetts',
                 'http://dbpedia.org/resource/John_F._Kennedy']

num_entities=len(gold_decisions)

precision, recall, f1 = evaluate_entity_linking(system_decisions, gold_decisions)

print('Precision: %.2f, \nrecall: %.2f, \nf1-score: %.2f' % (precision, recall, f1))

TP: 1; 
FP: 2, 
FN: 1
Precision: 0.33, 
recall: 0.50, 
f1-score: 0.40


### 6. Disambiguating with existing tools

Here we will load a small dataset called Reuters-128 which contains 128 news documents with annotated entity mentions with their links in DBpedia.

Then we will parse this dataset with two modern tools, called AGDISTIS and DBpedia Spotlight.

#### 6.1 Setup your environment

**6.1.1 Install the needed modules**

For the purpose of this module, we will need some new libraries that are probably not installed on your computer:
* [rdflib](https://pypi.org/project/rdflib/)
* [agdistispy](https://pypi.org/project/agdistispy/)
* [pyspotlight](https://pypi.org/project/pyspotlight/)
* [lxml](https://pypi.org/project/lxml/)

You should install these libraries using `conda` or `pip`.

Let's now check if all needed libraries are installed on your computer and can be imported. 

In [5]:
from rdflib import Graph, URIRef
import urllib
from tqdm import tqdm
import sys
import requests
import urllib.parse
import xml.etree.cElementTree as ET
from lxml import etree
import time

from agdistispy.agdistis import Agdistis
import spotlight

If you see any errors with the imports: 
* read the error carefully
* install the library that is missing
* try to execute the imports again

**6.1.2 Let's download the N3 Reuters-128 dataset**

In [None]:
# This code will not work on Windows. 
# If you are running Windows, please download the dataset manually from 
# https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl .

%%bash 
if [ ! -f Reuters-128.ttl ]; then
    wget https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl
fi

If the above command executed successfully, then you should see the dataset file 'Reuters-128.ttl' in the same directory as this notebook. In case you experience any errors, you can download the dataset manually from https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl.

This dataset contains 128 Reuters documents annotated with entity mentions and links to DBpedia. It is in a .ttl format called NIF (don't worry about these formats for now, we provide a function to parse this dataset to python classes).

*Before proceding, please verify that your setup of libraries (6.1.1) is correct and that the dataset is downloaded in the right location (6.1.2).*

#### 6.2 Load the data from N3

**Let's now parse this dataset to a list of news items objects that contain the text and the entity mentions for each news item.**

In [None]:
reuters_file='Reuters-128.ttl'

In [None]:
class NewsItem:
    """
    class containing information about a news item
    """
    def __init__(self, identifier, content="",
                 dct=None):
        self.identifier = identifier  # string, the original document name in the dataset
        self.dct = dct                # e.g. "2005-05-14T02:00:00.000+02:00" -> document creation time
        self.content = content        # the text of the news article
        self.entity_mentions = []  # set of instances of EntityMention class
        
class EntityMention:
    """
    class containing information about an entity mention
    """

    def __init__(self, mention,
                 begin_index, end_index,
                 gold_link=None,
                 the_type=None, sentence=None, agdistis_link=None,
                 spotlight_link=None): #, exact_match=False):
        self.sentence = sentence         # e.g. 4 -> which sentence is the entity mentioned in
        self.mention = mention           # e.g. "John Smith" -> the mention of an entity as found in text
        self.the_type = the_type         # e.g. "Person" | "http://dbpedia.org/ontology/Person"
        self.begin_index = begin_index   # e.g. 15 -> begin offset
        self.end_index = end_index       # e.g. 25 -> end offset
        self.gold_link = gold_link       # gold link if existing
        self.agdistis_link = agdistis_link    # AGDISTIS link
        self.spotlight_link = spotlight_link             # Spotlight link

In [None]:
def normalizeURL(s):
    """
    Normalize a URI by removing its Wikipedia/DBpedia prefix.
    """
    if s:
        if s.startswith('http://aksw.org/notInWiki'):
            return 'NIL'
        else:
            return urllib.parse.unquote(s.replace("http://en.wikipedia.org/wiki/", "").
                                        replace("http://dbpedia.org/resource/", ""). 
                                        replace("http://dbpedia.org/page/", "").
                                        strip().
                                        strip('"'))
    else:
        return 'NIL'

In [None]:
def load_article_from_nif_file(nif_file):
    """
    Load a dataset in NIF format.
    """
    g=Graph()
    g.parse(nif_file, format="n3")

    news_items=[]

    articles = g.query(
    """ SELECT ?articleid ?date ?string
    WHERE {
        ?articleid nif:isString ?string .
        OPTIONAL { ?articleid <http://purl.org/dc/elements/1.1/date> ?date . }
    }
    """)
    for article in articles:
        news_item_obj=NewsItem(
            content=article['string'],
            identifier=article['articleid'], 
            dct=article['date']
        )
        query=""" SELECT ?id ?mention ?start ?end ?gold
        WHERE {
            ?id nif:anchorOf ?mention ;
            nif:beginIndex ?start ;
            nif:endIndex ?end ;
            nif:referenceContext <%s> .
            OPTIONAL { ?id itsrdf:taIdentRef ?gold . }
        } ORDER BY ?start""" % str(article['articleid'])
        qres_entities = g.query(query)
        for entity in qres_entities:
            gold_link=normalizeURL(str(entity['gold']))
            if gold_link.startswith('http://aksw.org/notInWiki'):
                gold_link='NIL'
            entity_obj = EntityMention(
                begin_index=int(entity['start']),
                end_index=int(entity['end']),
                mention=str(entity['mention']),
                gold_link=gold_link
            )
            news_item_obj.entity_mentions.append(entity_obj)
        news_items.append(news_item_obj)
    return news_items

In [None]:
articles=load_article_from_nif_file(reuters_file)

#### 6.3 AGDISTIS

AGDISTIS (also called Multilingual AGDISTIS, or MAG) is an entity linking system that puts all entity candidates in a graph network and then performs a probabilistic optimization to find the best connected candidate in this graph for each of the entity mentions. 

More description can be found in their paper: https://arxiv.org/pdf/1707.05288.pdf

You can play with AGDISTIS by using their demo page: http://agdistis.aksw.org/demo/.

In [None]:
ag = Agdistis()

In [None]:
def agdistis_disambiguation(articles):
    """
    Perform disambiguation with AGDISTIS.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
                                    
            # AGDISTIS expects entity mentions that are pre-marked inside text. 
            # For example, the sentence "Obama visited Paris today", 
            # should be transformed to "<entity>Obama</entity> visited <entity>Paris</entity> today."
            # We do this in the next 5 lines of code.
            original_content = article.content
            new_content=original_content
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '<entity>' + entity_span + '</entity>' + new_content[entity.end_index:]

            # Now, we can run the AGDISTIS library with this string.
            results = ag.disambiguate(new_content)
            
            # Let's normalize the disambiguated entiies.
            # This means mostly removing the first part of the URI which is always the same (http://dbpedia.org/resource)
            # and leaving only the entity identification part (e.g., Barack_Obama).
            dis_entities={}
            for dis_entity in results:
                dis_entities[str(dis_entity['start'])] = normalizeURL(dis_entity['disambiguatedURL'])
                
            # We can now store the entity to our class instance for later processing.
            for entity in article.entity_mentions:
                start = entity.begin_index
                dis_url = dis_entities[str(start)]
                entity.agdistis_link = dis_url

            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

In [None]:
processed_agdistis=agdistis_disambiguation(articles)

#### 6.4 DBpedia Spotlight

DBpedia Spotlight is an entity recognition and linking tool that performs linking to DBpedia. The core of their method is a vector space model to compute similarity between the text to annotate and the Wikipedia pages of all entity candidates for a mention. Then the entities with largest similarity are chosen.

Here is their paper: http://oa.upm.es/8923/1/DBpedia_Spotlight.pdf

In [4]:
def spotlight_disambiguate(articles, spotlight_url):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
            # Similar as with AGDISTIS, we first prepare the document text and the mentions
            # in order to provide these to Spotlight as input.
            annotation = etree.Element("annotation", text=article.content)
            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')
            
            # Send a disambiguation request to spotlight
            results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': 0.5}), 
                                  headers={'Accept': 'application/json'})
            
            # Process the results and normalize the entity URIs
            j=results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = normalizeURL(dis_entity['@URI'])
            
            # Let's now store the URLs by Spotlight to our class for later analysis.
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url
    
            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
                
            # Pause for 100ms to prevent overloading the server
            time.sleep(0.1)
    return articles

In [None]:
#spotlight_url="http://model.dbpedia-spotlight.org/en/disambiguate" # Uses data from February 2018
spotlight_url="http://spotlight.fii800.lod.labs.vu.nl/rest/disambiguate" # Uses data from April 2016 (same as AGDISTIS)

processed_spotlight=spotlight_disambiguate(processed_agdistis, spotlight_url)

#### 6.5 Comparing the output of AGDISTIS and Spotlight to the gold links

Let's pick an article and print the decisions made by AGDISTIS and Spotlight on that article, and then compare that to the gold link.

In [None]:
article_number=3

for m in articles[article_number].entity_mentions:
    print('%s\t%s\t%s' % (m.gold_link, m.agdistis_link, m.spotlight_link))

### 7. Entity annotation from scratch: Performing recognition and disambiguation together

Some tools only perform disambiguation (AGDISTIS is an example), whereas others (like Spotlight) can perform both recognition and disambiguation.

Here we will use Spotlight annotate some text with entities without prior marking of entities.

In [None]:
text='''
On November 24, Aziz Karimov, a journalist based in Baku, received an email from Facebook notifying him of a request to reset his password. 
Karimov knew something was wrong since he hadn’t requested a password change. 
Ninety minutes later, as he struggled to regain access to his account, he received four more notifications from Facebook. 
He was informed that he had also been removed as an administrator from four other pages, including one belonging to Turan News Agency, 
Azerbaijan’s only independent news agency.
'''
annotations = spotlight.annotate('http://model.dbpedia-spotlight.org/en/annotate',
                                  text,
                                  confidence=0.5, support=20)

In [None]:
annotations