# Disambiguating with existing tools

There are many existing entity linking tools out there. We will not build our own, but instead run two existing tools and dig deeper into their output. In order to do this, we will do the following steps:
1. Setup your environment (installation of modules and download of dataset)
2. Load the dataset
3. Run disambiguation with our tools on top of recognized mentions
4. Run entity annotation from scratch: recognition and disambiguation together

### 1. Setup your environment

#### 1.1 Install the needed modules

For the purpose of this week's coding exercises, we will need some new libraries that are probably not installed on your computer:
* [rdflib](https://pypi.org/project/rdflib/)
* [agdistispy](https://pypi.org/project/agdistispy/)
* [pyspotlight](https://pypi.org/project/pyspotlight/)
* [lxml](https://pypi.org/project/lxml/)

You should install these libraries using `conda` or `pip`.

Let's now check if all needed libraries are installed on your computer and can be imported. 

In [5]:
from rdflib import Graph, URIRef
import urllib
from tqdm import tqdm
import sys
import requests
import urllib.parse
import xml.etree.cElementTree as ET
from lxml import etree
import time

from agdistispy.agdistis import Agdistis
import spotlight

**If you see any errors with the imports:**
* read the error carefully
* install the library that is missing
* try to execute the imports again

#### 1.2 Let's download the N3 Reuters-128 dataset

We will work with a small dataset called Reuters-128. This dataset contains 128 Reuters documents annotated with entity mentions and links to DBpedia. 
You can download this dataset from Canvas or from https://raw.githubusercontent.com/dice-group/n3-collection/master/Reuters-128.ttl.

Store the dataset file 'Reuters-128.ttl' in the same directory as this notebook.

*Before proceding, please verify that your setup of libraries (1.1) is correct and that the dataset is downloaded in the right location (1.2).*

#### 2. Load the data from N3

**Let's now parse this dataset to a list of news items objects that contain the text and the entity mentions for each news item.**

The data is in a .ttl format called NIF (don't worry about these formats for now, we provide a function to parse this dataset to python classes).

In [None]:
reuters_file='Reuters-128.ttl'

In [None]:
class NewsItem:
    """
    class containing information about a news item
    """
    def __init__(self, identifier, content="",
                 dct=None):
        self.identifier = identifier  # string, the original document name in the dataset
        self.dct = dct                # e.g. "2005-05-14T02:00:00.000+02:00" -> document creation time
        self.content = content        # the text of the news article
        self.entity_mentions = []  # set of instances of EntityMention class
        
class EntityMention:
    """
    class containing information about an entity mention
    """

    def __init__(self, mention,
                 begin_index, end_index,
                 gold_link=None,
                 the_type=None, sentence=None, agdistis_link=None,
                 spotlight_link=None): #, exact_match=False):
        self.sentence = sentence         # e.g. 4 -> which sentence is the entity mentioned in
        self.mention = mention           # e.g. "John Smith" -> the mention of an entity as found in text
        self.the_type = the_type         # e.g. "Person" | "http://dbpedia.org/ontology/Person"
        self.begin_index = begin_index   # e.g. 15 -> begin offset
        self.end_index = end_index       # e.g. 25 -> end offset
        self.gold_link = gold_link       # gold link if existing
        self.agdistis_link = agdistis_link    # AGDISTIS link
        self.spotlight_link = spotlight_link             # Spotlight link

In [None]:
def normalizeURL(s):
    """
    Normalize a URI by removing its Wikipedia/DBpedia prefix.
    """
    if s:
        if s.startswith('http://aksw.org/notInWiki'):
            return 'NIL'
        else:
            return urllib.parse.unquote(s.replace("http://en.wikipedia.org/wiki/", "").
                                        replace("http://dbpedia.org/resource/", ""). 
                                        replace("http://dbpedia.org/page/", "").
                                        strip().
                                        strip('"'))
    else:
        return 'NIL'

In [None]:
def load_article_from_nif_file(nif_file):
    """
    Load a dataset in NIF format.
    """
    g=Graph()
    g.parse(nif_file, format="n3")

    news_items=[]

    articles = g.query(
    """ SELECT ?articleid ?date ?string
    WHERE {
        ?articleid nif:isString ?string .
        OPTIONAL { ?articleid <http://purl.org/dc/elements/1.1/date> ?date . }
    }
    """)
    for article in articles:
        news_item_obj=NewsItem(
            content=article['string'],
            identifier=article['articleid'], 
            dct=article['date']
        )
        query=""" SELECT ?id ?mention ?start ?end ?gold
        WHERE {
            ?id nif:anchorOf ?mention ;
            nif:beginIndex ?start ;
            nif:endIndex ?end ;
            nif:referenceContext <%s> .
            OPTIONAL { ?id itsrdf:taIdentRef ?gold . }
        } ORDER BY ?start""" % str(article['articleid'])
        qres_entities = g.query(query)
        for entity in qres_entities:
            gold_link=normalizeURL(str(entity['gold']))
            if gold_link.startswith('http://aksw.org/notInWiki'):
                gold_link='NIL'
            entity_obj = EntityMention(
                begin_index=int(entity['start']),
                end_index=int(entity['end']),
                mention=str(entity['mention']),
                gold_link=gold_link
            )
            news_item_obj.entity_mentions.append(entity_obj)
        news_items.append(news_item_obj)
    return news_items

In [None]:
articles=load_article_from_nif_file(reuters_file)

### 3. Disambiguation with existing entity linking tools

Then we will parse this dataset with two modern tools, called AGDISTIS and DBpedia Spotlight.

#### 3.1 AGDISTIS

AGDISTIS (also called Multilingual AGDISTIS, or MAG) is an entity linking system that puts all entity candidates in a graph network and then performs a probabilistic optimization to find the best connected candidate in this graph for each of the entity mentions. 

More description can be found in their paper: https://arxiv.org/pdf/1707.05288.pdf

You can play with AGDISTIS by using their demo page: http://agdistis.aksw.org/demo/.

In [None]:
ag = Agdistis()

In [None]:
def agdistis_disambiguation(articles):
    """
    Perform disambiguation with AGDISTIS.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
                                    
            # AGDISTIS expects entity mentions that are pre-marked inside text. 
            # For example, the sentence "Obama visited Paris today", 
            # should be transformed to "<entity>Obama</entity> visited <entity>Paris</entity> today."
            # We do this in the next 5 lines of code.
            original_content = article.content
            new_content=original_content
            for entity in reversed(article.entity_mentions):
                entity_span=new_content[entity.begin_index: entity.end_index]
                new_content=new_content[:entity.begin_index] + '<entity>' + entity_span + '</entity>' + new_content[entity.end_index:]

            # Now, we can run the AGDISTIS library with this string.
            results = ag.disambiguate(new_content)
            
            # Let's normalize the disambiguated entiies.
            # This means mostly removing the first part of the URI which is always the same (http://dbpedia.org/resource)
            # and leaving only the entity identification part (e.g., Barack_Obama).
            dis_entities={}
            for dis_entity in results:
                dis_entities[str(dis_entity['start'])] = normalizeURL(dis_entity['disambiguatedURL'])
                
            # We can now store the entity to our class instance for later processing.
            for entity in article.entity_mentions:
                start = entity.begin_index
                dis_url = dis_entities[str(start)]
                entity.agdistis_link = dis_url

            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
    return articles

In [None]:
processed_agdistis=agdistis_disambiguation(articles)

#### 3.2 DBpedia Spotlight

DBpedia Spotlight is an entity recognition and linking tool that performs linking to DBpedia. The core of their method is a vector space model to compute similarity between the text to annotate and the Wikipedia pages of all entity candidates for a mention. Then the entities with largest similarity are chosen.

Here is their paper: http://oa.upm.es/8923/1/DBpedia_Spotlight.pdf

In [4]:
def spotlight_disambiguate(articles, spotlight_url):
    """
    Perform disambiguation with DBpedia Spotlight.
    """
    with tqdm(total=len(articles), file=sys.stdout) as pbar:
        for i, article in enumerate(articles):
            # Similar as with AGDISTIS, we first prepare the document text and the mentions
            # in order to provide these to Spotlight as input.
            annotation = etree.Element("annotation", text=article.content)
            for mention in article.entity_mentions:
                sf = etree.SubElement(annotation, "surfaceForm")
                sf.set("name", mention.mention)
                sf.set("offset", str(mention.begin_index))
            my_xml=etree.tostring(annotation, xml_declaration=True, encoding='UTF-8')
            
            # Send a disambiguation request to spotlight
            results=requests.post(spotlight_url, urllib.parse.urlencode({'text':my_xml, 'confidence': 0.5}), 
                                  headers={'Accept': 'application/json'})
            
            # Process the results and normalize the entity URIs
            j=results.json()
            dis_entities={}
            if 'Resources' in j: 
                resources=j['Resources']
            else: 
                resources=[]
            for dis_entity in resources:
                dis_entities[str(dis_entity['@offset'])] = normalizeURL(dis_entity['@URI'])
            
            # Let's now store the URLs by Spotlight to our class for later analysis.
            for entity in article.entity_mentions:
                start = entity.begin_index
                if str(start) in dis_entities:
                    dis_url = dis_entities[str(start)]
                else:
                    dis_url = 'NIL'
                entity.spotlight_link = dis_url
    
            # The next two lines only update the progress bar
            pbar.set_description('processed: %d' % (1 + i))
            pbar.update(1)
                
            # Pause for 100ms to prevent overloading the server
            time.sleep(0.1)
    return articles

In [None]:
#spotlight_url="http://model.dbpedia-spotlight.org/en/disambiguate" # Uses data from February 2018
spotlight_url="http://spotlight.fii800.lod.labs.vu.nl/rest/disambiguate" # Uses data from April 2016 (same as AGDISTIS)

processed_spotlight=spotlight_disambiguate(processed_agdistis, spotlight_url)

#### 3.3 Comparing the output of AGDISTIS and Spotlight to the gold links

Let's pick an article and print the decisions made by AGDISTIS and Spotlight on that article, and then compare that to the gold link.

In [None]:
article_number=3

for m in articles[article_number].entity_mentions:
    print('%s\t%s\t%s' % (m.gold_link, m.agdistis_link, m.spotlight_link))

### 4. Entity annotation from scratch: Performing recognition and disambiguation together

Some tools only perform disambiguation (AGDISTIS is an example), whereas others (like Spotlight) can perform both recognition and disambiguation.

Here we will use Spotlight annotate some text with entities without prior marking of entities.

In [None]:
text='''
On November 24, Aziz Karimov, a journalist based in Baku, received an email from Facebook notifying him of a request to reset his password. 
Karimov knew something was wrong since he hadn’t requested a password change. 
Ninety minutes later, as he struggled to regain access to his account, he received four more notifications from Facebook. 
He was informed that he had also been removed as an administrator from four other pages, including one belonging to Turan News Agency, 
Azerbaijan’s only independent news agency.
'''
annotations = spotlight.annotate('http://model.dbpedia-spotlight.org/en/annotate',
                                  text,
                                  confidence=0.5, support=20)

In [None]:
annotations