# Disambiguating with existing tools

There are many existing entity linking tools out there. We will not build our own, but instead run two existing tools and dig deeper into their output. In order to do this, we will do the following steps:
1. Setup your environment (installation of modules and download of dataset)
2. Load the dataset
3. Run disambiguation with our tools on top of recognized mentions
4. Run entity annotation from scratch: recognition and disambiguation together

### 1. Setup your environment

#### 1.1 Install the needed modules

For the purpose of this week's coding exercises, we will need some new libraries that are probably not installed on your computer:
* [rdflib](https://pypi.org/project/rdflib/)
* [pyspotlight](https://pypi.org/project/pyspotlight/)
* (perhaps) [lxml](https://pypi.org/project/lxml/)

You should install these libraries using `conda` or `pip`.

Let's now check if all needed libraries are installed on your computer and can be imported. 

In [1]:
from rdflib import Graph, URIRef
from tqdm import tqdm
import sys
import requests
import urllib
import urllib.parse
from urllib.request import urlopen, Request
from urllib.parse import urlencode
import xml.etree.cElementTree as ET
from lxml import etree
import time
import json

import spotlight

import lab4_utils as utils
import lab4_classes as classes

**If you see any errors with the imports:**
* read the error carefully
* install the library that is missing
* try to execute the imports again

Don't proceed before the imports work; we will use this modules in the explanation below and in the assignment.

#### 1.2 Obtain the N3 Reuters-128 dataset

We will work with a small dataset called Reuters-128. This dataset contains 128 Reuters documents annotated with entity mentions and links to DBpedia. 
You can download this dataset from Canvas or from https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl.

Store the dataset file 'Reuters-128.ttl' in the same directory as this notebook.

*Before proceding, please verify that your setup of libraries (1.1) is correct and that the dataset is downloaded in the right location (1.2).*

Congratulations, you are all set up!

### 2. Load the data from N3

**Let's now parse this dataset to a list of news items objects that contain: 1) the text and 2) the entity mentions+links for each news item.**

The data is in a .ttl format called NIF (don't worry about these formats for now, we provide a function to parse this dataset to python classes).

In [2]:
reuters_file='Reuters-128.ttl'

In [3]:
articles=utils.load_article_from_nif_file(reuters_file)

**Data description**

`articles` is a list of news articles (members of the `NewsItem` which we define ourselves). To help you understand, we now present the information that we store for each news article. 

Each news item contains the following fields: `identifier` (the original identifier of this document in the dataset), `dct` (the document creation time, if known), `content` (the text of this document), and `entity_mentions` (a list of entity mention occurrences that were found in this document).

For each of the entity mentions, we store then the following information: `sentence` (in which sentence was this mention found), `mention` (the exact phrase of this mention), `the_type` (its type, if known), `begin_index` (the starting offset of this mention), `end_index` (the ending offset of this mention), `gold_link` (the gold link for this mention), `agdistis_link` (the link for this mention proposed by Agdistis), and `spotlight_link` (the link for this mention proposed by Spotlight).

### 3. Disambiguation with existing entity linking tools

We will perform entity linking on this dataset with two modern tools, called AGDISTIS and DBpedia Spotlight.

#### 3.1 AGDISTIS

AGDISTIS (also called Multilingual AGDISTIS, or MAG) is an entity linking system that puts all entity candidates in a graph network and then performs a probabilistic optimization to find the best connected candidate in this graph for each of the entity mentions. 

More description can be found in their paper: https://arxiv.org/pdf/1707.05288.pdf

You can play with AGDISTIS by using their demo page: http://agdistis.aksw.org/demo/.

In the function `agdistis_disambiguation` below, we do the following: 
* we iterate the 128 documents of the Reuters-128 dataset (called `articles`)
* for each document, we combine the text and the marking of the entity mentions in a single string `new_content`
* we send a request to Agdistis to disambiguate the entities in this string
* we store the entity links back to the `articles` in the field `agdistis_link`

In [7]:
def agdistis_disambiguation(articles, agdistis_url):
    """
    Perform disambiguation with AGDISTIS.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
                                    
            # AGDISTIS expects entity mentions that are pre-marked inside text. 
            # For example, the sentence "Obama visited Paris today", 
            # should be transformed to "<entity>Obama</entity> visited <entity>Paris</entity> today."
            # We do this in the next 5 lines of code.
            original_content = article.content
            new_content=original_content
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '<entity>' + entity_span + '</entity>' + new_content[entity.end_index:]

            # Now, we can run the AGDISTIS library with this string.
            #results=ag.disambiguate(new_content)
            params={"text": new_content, "type": 'agdistis'}
            request = Request(agdistis_url, urlencode(params).encode())
            this_json = urlopen(request).read().decode()
            results=json.loads(this_json)
            
            # Let's normalize the disambiguated enties.
            # This means mostly removing the first part of the URI which is always the same (http://dbpedia.org/resource)
            # and leaving only the entity identification part (e.g., Barack_Obama).
            dis_entities={}
            for dis_entity in results:
                dis_entities[str(dis_entity['start'])] = utils.normalizeURL(dis_entity['disambiguatedURL'])
                
            # We can now store the entity to our class instance for later processing.
            for entity in article.entity_mentions:
                start = entity.begin_index
                dis_url = dis_entities[str(start)]
                entity.agdistis_link = dis_url

            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

In [8]:
agdistis_url = "http://akswnc9.informatik.uni-leipzig.de:8113/AGDISTIS"

processed_agdistis=agdistis_disambiguation(articles, agdistis_url)

processed: 128: 100%|██████████| 128/128 [02:42<00:00,  1.45it/s]


#### 3.2 DBpedia Spotlight

DBpedia Spotlight (or only "Spotlight" for brevity) is an entity recognition and linking tool that performs linking to DBpedia. The core of their method is a vector space model to compute similarity between the text to annotate and the Wikipedia pages of all entity candidates for a mention. Then the entities with largest similarity are chosen.

Here is their paper: http://oa.upm.es/8923/1/DBpedia_Spotlight.pdf

You can try out their demo as well: http://dbpedia-spotlight.github.com/demo/.

In the function `spotlight_disambiguation` below, we do the following: 
* we iterate the 128 documents of the Reuters-128 dataset (called `articles`)
* for each document, we combine the text and the marking of the entity mentions in a single string `new_content`
* we send a request to Spotlight to disambiguate the entities in this string
* we store the entity links back to the `articles` in the field `spotlight_link`

In [10]:
def spotlight_disambiguate(articles, spotlight_url):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
            # Similar as with AGDISTIS, we first prepare the document text and the mentions
            # in order to provide these to Spotlight as input.
            annotation = etree.Element("annotation", text=article.content)
            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')
            
            # Send a disambiguation request to spotlight
            results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': 0.5}), 
                                  headers={'Accept': 'application/json'})
            
            # Process the results and normalize the entity URIs
            j=results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = utils.normalizeURL(dis_entity['@URI'])
            
            # Let's now store the URLs by Spotlight to our class for later analysis.
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url
    
            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
                
            # Pause for 100ms to prevent overloading the server
            time.sleep(0.1)
    return articles

In [11]:
#spotlight_url="http://model.dbpedia-spotlight.org/en/disambiguate" # Uses data from February 2018
spotlight_url="http://spotlight.fii800.lod.labs.vu.nl/rest/disambiguate" # Uses data from April 2016 (same as AGDISTIS)

processed_both=spotlight_disambiguate(processed_agdistis, spotlight_url)

processed: 128: 100%|██████████| 128/128 [00:19<00:00,  6.64it/s]


#### 3.3 Comparing the output of AGDISTIS and Spotlight to the gold links

Because we stored the decisions by both tools in our list `articles`, we can now compare their output.

Let's pick an article and print the decisions made by AGDISTIS and Spotlight on that article, and then compare that to the gold link.

In [13]:
article_number=3

an_article=processed_both[article_number]
doc_id=an_article.identifier
print(doc_id)
for m in an_article.entity_mentions:
    print('%s\t%s\t%s\t%s' % (m.mention, m.gold_link, m.agdistis_link, m.spotlight_link))

http://aksw.org/N3/Reuters-128/16#char=0,409
Toronto Dominion Bank	Toronto-Dominion_Bank	Toronto-Dominion_Bank	Toronto-Dominion_Bank
Hambros Bank Ltd	Hambros_Bank	Hambros_Bank	NIL
London	London	London	London


### 4. Entity annotation from scratch: Performing recognition and disambiguation together

Some tools only perform disambiguation (AGDISTIS is an example), whereas others (like Spotlight) can perform both recognition and disambiguation.

Here we will use Spotlight annotate some text with entities without prior marking of entities. Notice that now we call the function `annotate` instead of sending a request to `disambiguate`.

In [14]:
text='''
On November 24, Aziz Karimov, a journalist based in Baku, received an email from Facebook notifying him of a request to reset his password. 
Karimov knew something was wrong since he hadn’t requested a password change. 
Ninety minutes later, as he struggled to regain access to his account, he received four more notifications from Facebook. 
He was informed that he had also been removed as an administrator from four other pages, including one belonging to Turan News Agency, 
Azerbaijan’s only independent news agency.
'''
annotations = spotlight.annotate('http://model.dbpedia-spotlight.org/en/annotate',
                                  text,
                                  confidence=0.5, support=20)

Let's see which entity mentions were recognized by Spotlight and which links were assigned to these entity mentions.

In [15]:
annotations

[{'URI': 'http://dbpedia.org/resource/Islam_Karimov',
  'support': 363,
  'types': 'Http://xmlns.com/foaf/0.1/Person,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Person,DBpedia:OfficeHolder,DBpedia:Agent',
  'surfaceForm': 'Karimov',
  'offset': 22,
  'similarityScore': 0.9999999994710151,
  'percentageOfSecondRank': 0.0},
 {'URI': 'http://dbpedia.org/resource/Baku',
  'support': 10476,
  'types': 'Wikidata:Q486972,Schema:Place,DBpedia:Settlement,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location',
  'surfaceForm': 'Baku',
  'offset': 53,
  'similarityScore': 0.9998148103451905,
  'percentageOfSecondRank': 0.00018500987497263095},
 {'URI': 'http://dbpedia.org/resource/Facebook',
  'support': 24873,
  'types': 'Wikidata:Q43229,Wikidata:Q24229398,DUL:SocialPerson,DUL:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:Company,DBpedia:Agent',
  'surfaceForm': 'Facebook',
  'offset': 82,
  'similarityScore': 0.9999997646980595,
  '