# LAB3 - Entity Linking/Named Entity Disambiguation

### 1. Task definition

**Entity tasks so far** So far, we have seen two tasks that relate to the entities mentioned in text: 
1. recognizing/spotting entity mentions in text in the task of Named Entity Recognition
2. classifying these entity mentions in a semantic class that they belong to, in the task of Named Entity Classification/Typing

**NED** Here, we will introduce Named Entity Disambiguation - NED, also called (Named) Entity Linking - (N)EL. NED is a central task in information extraction. The goal is to take the entity mentions that were found in text with the task of NER and resolve their meaning. In this sense, the task of Entity Disambiguation builds on top of the output of the NER task. For this reason, sometimes the tasks are combined together in a task called Named Entity Recognition and Disambiguation (NERD).

**Disambiguation in practice** The disambiguation in this task is done by linking a phrase found in text (for example, 'JFK') to its existing representation in an entity-centered knowledge base (for example, https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport ). 

An entity-centered knowledge base is a collection of facts about entities. It can be in a structured or unstructured format. Wikipedia is an example for an unstructured knowledge base, because most of its content is in unstructured (running text) form. Examples for structured knowledge bases are DBpedia and Wikidata. Here is the representation of the JFK airport in these structured knowledge bases:
* http://dbpedia.org/resource/John_F._Kennedy_International_Airport
* https://www.wikidata.org/wiki/Q8685

**Example** For example, let's consider the following sentence:

"_JetBlue_ begins direct service between _Barnstable Airport_ and _JFK_."

Let's say that we perform linking to DBpedia. Then, “JetBlue” should be linked to the entity http://dbpedia.org/resource/JetBlue, and “JFK” to http://dbpedia.org/resource/John_F._Kennedy_International_Airport. 

However, there is no entry in DBpedia for the Barnstable Municipal Airport, which is the meaning of the mention “Barnstable Airport”. We cannot link this entity then. The entities for which there is no representation in a chosen knowledge base are called *NIL entities*. When a system processes the text, it should simply say that the meaning of “Barnstable Airport” is _NIL_.

### 2. Opportunities and challenges

**Connecting text and knowledge bases** This is the first time we encounter such a connection between the information in text and the knowledge bases in the external world in this course. Note that these knowledge bases were not created to improve the text processing. Instead, they exist in order to provide knowledge about the world - for example, Wikipedia, DBpedia, and Wikidata give us encyclopedic knowledge. 

**Opportunities** By creating a link between a phrase in text and a unique entry in a knowledge base, we directly get access to much more knowledge that we can use to enhance the information in text. If we know that 'JFK' refers to the airport, we allow our tools to have access to all facts about this airport, such as its location and founding year. In addition, if we want, we can now extract facts from text and store them in these knowledge bases, but this is another task for later :)

**Challenges** So, why is entity linking not an easy task? This relates to two aspects: ambiguity and variance. 

*Ambiguity* is the amount of meanings that a certain entity mention can have. For example, imagine how many people in the world are called "John Smith". DBpedia contains entries for a few hundreds of them, see http://dbpedia.org/page/John_Smith. How can we teach a computer to decide which of these is the one mentioned in text? And, what if the John Smith mentioned in text is a NIL entity and is not stored in DBpedia?

There are also many cases where it is quite easy to link an entity to a knowledge base. Often the mentions in text have a small ambiguity (for example, "Barack Obama"). Or, they have multiple meanings but one of them is used almost always: for example, there are multiple cities called "Paris", but the French capital will be most often mentioned in text.

*Variance* is the amount of different mentions that refer to the same entity. For example, http://dbpedia.org/resource/John_F._Kennedy_International_Airport can be called "JFK", or "John F. Kennedy Airport", or "The NYC airport" in text.

### 3. Named Entity Disambiguation in practice: 3 phases

In practice, most NED systems consist of three phases:

1. **Entity recognition/spotting** - this is done as described in the NER(C). In the example sentence "_JetBlue_ begins direct service between _Barnstable Airport_ and _JFK_.", the recognition phase will detect the entity mentions: "JetBlue", "Barnstable Airport", and "JFK".
2. **Candidate generation** - here we take each of the recognized mentions and look up in the knowledge base for potential meanings. For example, the phrase "JFK" could have these candidates:
    * http://dbpedia.org/resource/John_F._Kennedy
    * http://dbpedia.org/resource/John_F._Kennedy_International_Airport
    * http://dbpedia.org/resource/JFK_(film)
    * http://dbpedia.org/resource/JFK_University
    * http://dbpedia.org/resource/Justice_for_Khojaly
    
   and so on. Similar lists will be generated for the other mentions found in text: "JetBlue" and "Barnstable Airport". The candidate generation phase is not trivial because of the ambiguity and variation described above. Also, new entities are appearing all the time in news articles, so the number of options grows over time.

3. **Disambiguation** - the goal of this final phase is to take the list of potential meanings generated in the candidate generation phase for each of the mentions and make a decision on which instance is the correct one. This decision can either be: choosing one of the possible candidates, or deciding that no candidate is the correct one (NIL entity).

### 4. Evaluating entity linking

**Metric** The correctness of an entity linking system is measured in terms of precision, recall, and F1-score. For each of the mentions in text, we compare the system decision against the gold data:
* If the system chose entity X and the gold entity is also X, then we count a *true positive (TP)*
* If the system chose entity X, but the gold entity is Y, then we count a *false positive (FP)* and a *false negative (FN)*
* If the system opted for a NIL entity and the gold entity is X, then we count a *false negative (FN)*
* If the system opted for an entity X but the gold entity is NIL, then we count a *false positive (FP)*

Afterwards, we use these numbers for TP, FP, and FN, to compute precision, recall, and F1-score:

* `precision=TP/(TP+FP)` -> From the decisions made by the system, how many were true
* `recall=TP/(TP+FN)` -> From the gold entities, how many were found correctly by the system
* `f1=2*precision*recall/(precision+recall)` -> compute a harmonic mean between precision and recall, called F1-score

Note that precision, recall, and F1-score would all be the same in case all entities in the system output and the gold output are not NIL entities.

**Example** For the example sentence above, let's say that a system made the following decisions:
* "JetBlue" means http://dbpedia.org/resource/JetBlue (true positive)
* "Barnstable Airport" means http://dbpedia.org/resource/Barnstable,_Massachusetts (false positive)
* "JFK" means http://dbpedia.org/resource/John_F._Kennedy (false positive and false negative)

Then, we have in total: `TP=1, FP=2, FN=1`. 

The resulting precision is `1/3=0.33` and the resulting recall is `1/2=0.5`. 

The F1-score of this system on this sentence would be `0.40`. 

### 5. An example in code

Now we provide a code for this scenario. Note that for simplicity we assume that the entity recognition by the system is perfect. Also, we use a simple format of the gold and the system output as a list, in practice this requires some more preprocessing. 

In [9]:
text="JetBlue begins direct service between Barnstable Airport and JFK."

gold_decisions=['http://dbpedia.org/resource/JetBlue', 
                'NIL',
                'http://dbpedia.org/resource/John_F._Kennedy_International_Airport']
system_decisions=['http://dbpedia.org/resource/JetBlue', 
                  'http://dbpedia.org/resource/Barnstable,_Massachusetts',
                 'http://dbpedia.org/resource/John_F._Kennedy']

num_entities=len(gold_decisions)

tp=0
fp=0
fn=0

for mention_id in range(num_entities):
    gold_entity=gold_decisions[mention_id]
    system_entity=system_decisions[mention_id]
    if gold_entity=='NIL' and system_entity=='NIL': continue
    if gold_entity==system_entity:
        tp+=1
    else:
        if gold_entity!='NIL':
            fn+=1
        if system_entity!='NIL':
            fp+=1

print('TP: %d; \nFP: %d, \nFN: %d' % (tp, fp, fn))            
            
precision=tp/(tp+fp)
recall=tp/(tp+fn)
f1=2*precision*recall/(precision+recall)

print('Precision: %.2f, \nrecall: %.2f, \nf1-score: %.2f' % (precision, recall, f1))

TP: 1; 
FP: 2, 
FN: 1
Precision: 0.33, 
recall: 0.50, 
f1-score: 0.40
