# LAB3.3 - Disambiguating with existing tools

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

There are many existing entity linking tools out there. We will not build our own, but instead run two existing tools and dig deeper into their output. In order to do this, we will do the following steps:
1. Setup your environment (installation of modules and download of dataset)
2. Load the dataset
3. Run disambiguation with our tools on top of recognized mentions
4. Run entity annotation from scratch: recognition and disambiguation together

### 1. Setup your environment

#### 1.1 Install the needed modules

For the purpose of this week's coding exercises, we will need some new libraries that are probably not installed on your computer:
* [rdflib](https://pypi.org/project/rdflib/)
* (perhaps) [lxml](https://pypi.org/project/lxml/)

You should install these libraries using `conda` or `pip`.

Let's now check if all needed libraries are installed on your computer and can be imported. 

In [1]:
from rdflib import Graph, URIRef
from tqdm import tqdm
import sys
import requests
import urllib
import urllib.parse
from urllib.request import urlopen, Request
from urllib.parse import urlencode
import xml.etree.cElementTree as ET
from lxml import etree
import time
import json

# import our own utility functions and classes
import lab3_utils as utils
import lab3_classes as classes

**If you see any errors with the imports:**
* read the error carefully
* install the library that is missing
* try to execute the imports again

Don't proceed before the imports work; we will use these modules in the explanation below and in the assignment.

#### 1.2 Obtain the N3 Reuters-128 dataset

**This step should be done if you downloaded or cloned the course repository from GitHub.**

We will work with a small dataset called Reuters-128. This dataset contains 128 Reuters documents annotated with entity mentions and links to DBpedia. You probably have this dataset if you downloaded or cloned the course repository from GitHub. Otherwise, you can download it from https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl.

Store the dataset file 'Reuters-128.ttl' in the same directory as this notebook.

*Before proceding, please verify that your setup of libraries (1.1) is correct and that the dataset is downloaded in the right location (1.2).*

Congratulations, you are all set up!

### 2. Load the data from N3

**Let's now parse this dataset to a list of news item objects that contain: 1) the text and 2) the entity mentions+links for each news item.**

The data is in a .ttl format called [NIF](https://nif.readthedocs.io/en/latest/) (don't worry about these formats for now, we provide a function to parse this dataset to python classes).
You can open the .ttl file in Jupyter or a text editor for inspection. You will see records as the following, where the first represents a Reuters text and the second record represents an entity ''Dillon Read and Co'' that is mentioned in the text:

```
<http://aksw.org/N3/Reuters-128/41#char=0,481>
      a       nif:String , nif:Context , nif:RFC5147String ;
      nif:beginIndex "0"^^xsd:nonNegativeInteger ;
      nif:endIndex "481"^^xsd:nonNegativeInteger ;
      nif:isString "Avnet Inc said it filed with the Securities and Exchange Commission a registration statement for a proposed public offering of 150 mln dlrs of convertible subordinated debentures due 2012. Avnet said it will use the net proceeds for general working capital purposes and the anticipated domestic and foreign expansion of its distribution, assembly and manufacturing businesses. The company said an investment banking group managed by Dillon Read and Co Inc will handle the offering."@en ;
      nif:sourceUrl <http://www.research.att.com/~lewis/Reuters-21578/15054> .
      
<http://aksw.org/N3/Reuters-128/41#char=433,455>
      a       nif:RFC5147String ;
      nif:anchorOf "Dillon Read and Co Inc"^^xsd:string ;
      nif:beginIndex "433"^^xsd:nonNegativeInteger ;
      nif:endIndex "455"^^xsd:nonNegativeInteger ;
      nif:referenceContext
              <http://aksw.org/N3/Reuters-128/41#char=0,481> ;
      itsrdf:taIdentRef <http://dbpedia.org/resource/Dillon,_Read_%26_Co.> ;
      itsrdf:taSource "DBpedia_en_3.9"^^xsd:string .
```

We provided the python code in *lab3_utils.py* and *lab3_classes.py* to process these NIF structures and to get the data. Make sure these files are in the same location as the notebook that you are running. If not, you should have seen an error with the import function above. If succesfully imported, you can now use the functions that are defined in these files.

In [16]:
reuters_file='Reuters-128.ttl'

In [17]:
articles=utils.load_article_from_nif_file(reuters_file)

**Data description**

If you check the *lab3_utils.py* file, you will see that `articles` is a list of news articles (members of the `NewsItem` which we defined in the file *lab3_classes.py*). To help you understand, we now present the information that we store for each news article. 

Each news item contains the following fields: `identifier` (the original identifier of this document in the dataset), `dct` (the document creation time, if known), `content` (the text of this document), and `entity_mentions` (a list of entity mention occurrences that were found in this document).

For each of the entity mentions, we store the following information: `sentence` (in which sentence was this mention found), `mention` (the exact phrase of this mention), `the_type` (its type, if known), `begin_index` (the starting offset of this mention), `end_index` (the ending offset of this mention), `gold_link` (the gold link for this mention), `aida_link` (the link for this mention proposed by AIDA), and `spotlight_link` (the link for this mention proposed by Spotlight).

In [18]:
print(len(articles))

128


In [19]:
article=articles[0]
type(article)

lab3_classes.NewsItem

### 3. Disambiguation with existing entity linking tools

We will perform entity linking on this dataset with two modern tools, called AIDA and DBpedia Spotlight.

#### 3.1 AIDA

AIDA is an entity linking system that puts all entity candidates in a graph network and then performs a probabilistic optimization to find the best connected candidate in this graph for each of the entity mentions. We call this **collective** or **global** disambiguation, because all entity mentions are disambiguated together.

AIDA uses two types of connections in their algorithm: 
* The first type is between a mention and an entity instance and tells us how often is a mention referred to by an instance. For example, the mention "Jordan" refers most often to the country, while less often to the basketball player, Michael Jordan. 
* The second connection type is between two entity instances; it tells us how well-connected are two entities. For example, the country Jordan is better connected with Saudi Arabia then it is with the basketball club San Antonio.

You can play with AIDA by using their demo page: https://gate.d5.mpi-inf.mpg.de/webaida/.

Here is a text you can try:
```
The President opened a window into the state of his mind Sunday when he lashed out against Jennifer Williams, an aide to Vice President Mike Pence, who described his July 25 call with Ukrainian President Volodymyr Zelensky in her deposition as "inappropriate."
```

More description can be found in their paper: http://www.aclweb.org/anthology/D11-1072.

In the function `aida_disambiguation` below, we do the following: 
* we iterate the 128 documents of the Reuters-128 dataset (called `articles`)
* for each document, we combine the text and the marking of the entity mentions in a single string `new_content`
* we send a request to AIDA to disambiguate the entities in this string
* we store the entity links back to the `articles` in the field `aida_link`

**Note:** You might encounter an SSL error such as `SSL: CERTIFICATE_VERIFY_FAILED` when running the urlopen command. This happens sometimes when your browser does not trust the certificate of some pages, it is a safety mechanism. Do the following to tell your browser that it can trust the web service we use:

```import ssl
context = ssl._create_unverified_context()
```
Then use this context in the urlopen call:

`urllib.urlopen("https://your-url", context=context)`

If you can't fix this immediately, please ask us during the lab session.

In [27]:
def aida_disambiguation(articles, aida_url):
    """
    Perform disambiguation with AIDA.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:  #pbar provides a nice progress bar for the interation over the articles
        for i, article in enumerate(articles):
                                    
            # AIDA expects entity mentions that are pre-marked inside text. 
            # For example, the sentence "Obama visited Paris today", 
            # should be transformed to "[[Obama]] visited [[Paris]] today."
            # We do this in the next 5 lines of code.
            original_content = article.content
            new_content=original_content
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '[[' + entity_span + ']]' + new_content[entity.end_index:]

            # Now, we can run the AIDA library with this string.
            params={"text": new_content, "tag_mode": 'manual'}
            request = Request(aida_url, urlencode(params).encode())
            this_json = urlopen(request).read().decode('unicode-escape')
            try:
                results=json.loads(this_json)
            except:
                continue
           # print(this_json)
            # Let's normalize the disambiguated entities.
            # This means mostly removing the first part of the URI which is always the same (YAGO:)
            # and leaving only the entity identification part (e.g., Barack_Obama).
            dis_entities={}
            for dis_entity in results['mentions']:
               # print(dis_entity)
                if 'bestEntity' in dis_entity.keys():
                    best_entity=dis_entity['bestEntity']['kbIdentifier']
                    clean_url=best_entity[5:] #SKIP YAGO:
                else:
                    clean_url='NIL'
                dis_entities[str(dis_entity['offset'])] = clean_url # BECOMES THE VALUE IN THE DICTIONARY FOR THE OFFSET(REPRESENTING THE START OF THE MENTION) IN THE TEXT
                
            # We can now store the entity to our class instance for later processing.
            for entity in article.entity_mentions:
                start = entity.begin_index
                try:
                    dis_url = str(dis_entities[str(start)])  # WE GET THE DISAMBIGUATED URL
                except:
                    dis_url='NIL'
                entity.aida_link = dis_url  # THE ENTITY IS ENRICHED WITH THE AIDA_LINK

            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

In [28]:
# AIDA is running in an external location - for this reason, we need to send an HTTP request. This should take a few minutes.
aida_disambiguation_url = "https://gate.d5.mpi-inf.mpg.de/aida/service/disambiguate"
### We define a subset to test the function
test_items=articles[0:5]
processed_aida=aida_disambiguation(test_items, aida_disambiguation_url)

processed: 1:  20%|██        | 1/5 [00:01<00:05,  1.41s/it]



processed: 5: 100%|██████████| 5/5 [00:02<00:00,  2.10it/s]


In [8]:
#### Now we are ready to do the full set:
processed_aida=aida_disambiguation(articles, aida_disambiguation_url)

processed: 1:   1%|          | 1/128 [00:00<00:20,  6.31it/s]



processed: 12:   9%|▉         | 12/128 [00:26<04:17,  2.22s/it]


KeyboardInterrupt: 

**Note:** The progress bar is sometimes stopping before 128. To be sure whether the command is done executing, please pay attention to the asterisk sign left from the cell. Once the '\*' sign turns into a number, you can be sure that you can proceed with the notebook! Also, look at the first printed number (left from the progress bar), it should say “processed: 128” when it is done.

In [29]:
print(len(processed_aida))

5


In [31]:
# print the URLs for the entities in the articles, we break after the first
for article in processed_aida:
    for entity in article.entity_mentions:
        print(entity.aida_link)
    

Japan
Banque_de_France
Bank_of_England
NIL
NIL
NIL
NIL
NIL
Atlantic_City,_New_Jersey
New_Jersey
NIL
Atlantic_City,_New_Jersey
NIL
NIL
NIL
NIL


#### 3.2 DBpedia Spotlight

DBpedia Spotlight (or only "Spotlight" for brevity) is an entity recognition and linking tool that performs linking to DBpedia. The core of their method is a vector space model to compute similarity between the text to annotate and the Wikipedia pages of all entity candidates for a mention. Then the entities with largest similarity are chosen.

You can try out their demo as well: http://dbpedia-spotlight.github.com/demo/.

Here is their paper: http://oa.upm.es/8923/1/DBpedia_Spotlight.pdf

In the function `spotlight_disambiguation` below, we do the following: 
* we iterate the 128 documents of the Reuters-128 dataset (called `articles`)
* for each document, we combine the text and the marking of the entity mentions in a single string `new_content`
* we send a request to Spotlight to disambiguate the entities in this string
* we store the entity links back to the `articles` in the field `spotlight_link`

The DB Spotlight API expects XML as the input. Below is how the XML looks like for the first article:
```
<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<annotation text="West Germanys total net direct investments abroad fell to 11.2 billion marks in 1986 from 13.6 billion in 1985, but investments in developing countries rose to 683 mln marks from 358 mln, the Economics Ministry said. Foreign investments in West Germany were a net 5.8 billion marks in 1986, up from 3.6 billion in 1985, with higher European investments largely responsible for this increase. The ministry noted that, despite last years rise in West German investments in developing countries, the 1986 level was below the investments of over two billion marks seen in 1981, 1982 and 1983.">
<surfaceForm name="West Germanys" offset="0"/>
<surfaceForm name="Economics Ministry" offset="192"/>
<surfaceForm name="West Germany" offset="240"/>
<surfaceForm name="The ministry" offset="392"/>
</annotation>'
```

The XML consists of an ```<annotation>``` element with the text as an attribute and a list of surface forms and their offset position in the text to be annotated.
To create this XML, we use the *lxml* package and specifically the *etree* function.

In [33]:
def spotlight_disambiguate(articles, spotlight_url):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
            # Similar as with AIDA, we first prepare the document text and the mentions
            # in order to provide these to Spotlight as input.
            
            # We build up the XML structure that Spotligh wants as input
            # The next function Element creates the XML element with the text attribute
            annotation = etree.Element("annotation", text=article.content)
            
            # We iterate over the eneity mentions from our Reuters data to create the surface form elements
            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')
            # Send a disambiguation request to spotlight
            results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': 0.5}), 
                                  headers={'Accept': 'application/json'})
            # Note that you can adjust the confidence value. Check the online demo to see the effect. 
            # What will happen with the recall and precision if you increase the confidence?
            
            # Process the results and normalize the entity URIs
            j=results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = utils.normalizeURL(dis_entity['@URI'])
            
            # Let's now store the URLs by Spotlight to our class for later analysis.
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url
    
            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
                
            # Pause for 100ms to prevent overloading the server
            time.sleep(0.1)
    return articles

In [34]:
# Spotlight is running in an external location - for this reason, we need to send an HTTP request. Hopefully, this will not take a long time.

spotlight_disambiguation_url="http://model.dbpedia-spotlight.org/en/disambiguate"

### We add the dbspotlight links to same processed_data
processed_both=spotlight_disambiguate(processed_aida, spotlight_disambiguation_url)

processed: 5: 100%|██████████| 5/5 [00:05<00:00,  1.01s/it]


#### 3.3 Comparing the output of AIDA and Spotlight to the gold links

Because we stored the decisions by both tools in our list `articles`, we can now compare their output.

Let's pick an article and print the decisions made by AIDA and Spotlight on that article, and then compare that to the gold link.

In [35]:
article_number=1

an_article=processed_both[article_number]
doc_id=an_article.identifier
print(doc_id)
for m in an_article.entity_mentions:
    print('|mention: %s\t|gold:\t%s\t|aida:\t%s\t|spotlight:\t%s |' % (m.mention, m.gold_link, m.aida_link, m.spotlight_link))

http://aksw.org/N3/Reuters-128/108#char=0,453
|mention: Bank of France	|gold:	Banque_de_France	|aida:	Banque_de_France	|spotlight:	Bank_of_France |
|mention: Bank	|gold:	Banque_de_France	|aida:	Bank_of_England	|spotlight:	Bank |


## End of this notebook