# Entity Recognition Testing Notebook

**Goal:** Locate, extract, and merge all entities of interest from plain text inputs

Starting from plain text files in the MPQA dataset, we will use Stanford CoreNLP to find all the relevant entities. Entities will also be merged with several heuristics, including using information from Freebase (Wikidata?) and coreference chains. From there, we can start calculating sentiment and faction relationships between them.

This notebook is essentially implementing Section 4.1 (_Document Preprocessing_).

## Setup

We'll start by setting up the data and the tools. We start by locating all the files we could use in MPQA. Then we setup the `pycorenlp` Python wrapper for Stanford CoreNLP's API (running locally).

In [1]:
import os
mpqa_dir = '../data/database.mpqa.3.0'
mpqa_doclist_path = os.path.join(mpqa_dir, 'doclist')
mpqa_file_paths = [
    os.path.join(mpqa_dir, 'docs', path.strip()) 
    for path in open(mpqa_doclist_path, 'r')]

In [2]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

Let's only use the first file in the list for testing and development. Below are the relative paths to the file and the file contents.

In [3]:
# test_file = mpqa_file_paths[0]
test_file = '../data/database.mpqa.3.0/docs/20010926/23.17.57-23406'
print(test_file)

../data/database.mpqa.3.0/docs/20010926/23.17.57-23406


In [4]:
text = open(test_file, 'r').read()
print(text)

	
TAIPEI, Sept 26 (AFP) -- Taiwan President Chen Shui-bian on Wednesday reiterated Taipei's full support for the United States as Washington prepared to launch reprisals against Afghanistan. 
	
	
 "On behalf of the government and people of the Republic of China (Taiwan's official name), I would like to extend our full support to the George W. Bush administration in its any decision and act against terrorists," Chen said while meeting Oregon governor John Kitzhaber. 
	
	
 Taiwan "would not stand idly by" because "the attacks were not only a challenge to the US but also a disruption of peace for mankind," Chen said in a statement released by the presidential office. 
	
	
 "The ROC government will be with the US government firmly." 
	
	
 Chen again voiced his condolences to the families of the thousands of Americans killed when hijacked planes plunged into the New York World Trade Center and Pentagon on September 11. 
	
	
 Chen's remarks came as the US was massing forces to launch reprisa

## Parsing

Now that we have our text input, let's feed it into CoreNLP to get the annotations from a parse. We will need named entity recognition (`ner`), and coreference resolution (`coref`).

In [5]:
output = nlp.annotate(text, properties={
    'annotators': 'ner,coref',
    'outputFormat': 'json'
})

### Extracting Entities with Named Entity Recognition (`ner` annotater)

The NER parse is stored at the token annotation level under the `ner` key. The tokens are annotated with the entity type or `'O'` (for outside?) in the case of non-entity tokens.

Ex) `output['sentences'][0]['tokens'][4]['ner']` will contain the entity type for the fifth token in the first sentence.

In [6]:
annotation = output['sentences'][0]['tokens'][0]
(annotation['originalText'], annotation['ner'])

('TAIPEI', 'LOCATION')

And this is every (token, entity type) pair in the sample text.

In [7]:
[(
    annotated['originalText'], 
    annotated['ner'] if annotated['ner'] != 'O' else None
)
 for sentence in output['sentences'] 
 for annotated in sentence['tokens']]

[('TAIPEI', 'LOCATION'),
 (',', None),
 ('Sept', 'DATE'),
 ('26', 'DATE'),
 ('(', None),
 ('AFP', 'ORGANIZATION'),
 (')', None),
 ('--', None),
 ('Taiwan', 'LOCATION'),
 ('President', None),
 ('Chen', 'PERSON'),
 ('Shui-bian', 'PERSON'),
 ('on', None),
 ('Wednesday', 'DATE'),
 ('reiterated', None),
 ('Taipei', 'LOCATION'),
 ("'s", None),
 ('full', None),
 ('support', None),
 ('for', None),
 ('the', None),
 ('United', 'LOCATION'),
 ('States', 'LOCATION'),
 ('as', None),
 ('Washington', 'LOCATION'),
 ('prepared', None),
 ('to', None),
 ('launch', None),
 ('reprisals', None),
 ('against', None),
 ('Afghanistan', 'LOCATION'),
 ('.', None),
 ('"', None),
 ('On', None),
 ('behalf', None),
 ('of', None),
 ('the', None),
 ('government', None),
 ('and', None),
 ('people', None),
 ('of', None),
 ('the', None),
 ('Republic', 'LOCATION'),
 ('of', 'LOCATION'),
 ('China', 'LOCATION'),
 ('(', None),
 ('Taiwan', 'LOCATION'),
 ("'s", None),
 ('official', None),
 ('name', None),
 (')', None),
 (',', Non

Next, we'll need to identify all the relevant entities in the annotated text. The original authors of the paper omitted entities of type date, duration, money, time, and number. We include that filter in this step.

We will build a mapping of entities to list of occurances in the text `(sentence_index, start_token_index, end_token_index)`. This will initially be matched solely on token matching but then will be extended to include entity merging heuristics mentioned in the paper.

Entities can span multiple tokens but never span sentences. Annotation is done on a per-token basis so any adjacent tokens with the same entity type will be merged into a single multi-token entity span.

In [8]:
next_id = 0
text_to_id = {} # this raw text (tuple of N strings) to id mapping will make entity merging easier
occurances = {} # entity occurances will be mapped by id

def clear_occurances():
    global next_id, text_to_id, occurances
    next_id = 0
    text_to_id = {}
    occurances = {}
    
def create_entity(name, eid):
    global text_to_id, occurances
    assert name not in text_to_id
    text_to_id[name] = eid
    occurances[eid] = set() if eid not in occurances else occurances[eid]

def add_occurance(entity, sentence_idx, start_token_idx, end_token_idx):
    global next_id
    if entity not in text_to_id:
        create_entity(entity, next_id)
        next_id += 1
    add_occurances(entity, {(sentence_idx, start_token_idx, end_token_idx)})

def add_occurances(entity, occurances_iter):
    global next_id
    if entity not in text_to_id:
        create_entity(entity, next_id)
        next_id += 1
    occurances[text_to_id[entity]].update(set(occurances_iter))

def get_occurances_by_entity(type_entity_pair):
    if type_entity_pair not in text_to_id:
        return None
    return occurances[text_to_id[type_entity_pair]]

def get_text(tokens_slice):
    raw_text = []
    for token in tokens_slice:
        raw_text.append(token['originalText'])
        raw_text.append(token['after'])
    return ''.join(raw_text[:-1])


In [9]:
clear_occurances()
named_types = set(['PERSON', 'LOCATION', 'ORGANIZATION', 'MISC'])
for sentence in output['sentences']:
    start_idx = None
    curr_type = None
    for token in sentence['tokens']:
#         if token['ner'] != 'O' and token['ner'] in named_types:
#             print('\t', token['originalText'], token['ner'])
        if token['ner'] != curr_type and curr_type is not None:
            end_idx = token['index'] - 1
            raw_text = get_text(sentence['tokens'][start_idx-1:end_idx])
            add_occurance((curr_type, raw_text), sentence['index'], start_idx, end_idx)
            start_idx = curr_type = None
        if token['ner'] in named_types and curr_type is None:
            start_idx = token['index']
            curr_type = token['ner']
    if curr_type is not None:
        raw_text = (token['originalText'] for token in sentence[start_idx-1:token['index']])
        end_idx = token['index']
        add_occurance(raw_text, sentence['index'], start_idx, end_idx)

Let's take a look at all the entities that we have found!

In [10]:
print('count, entity')
for entity, entity_id in text_to_id.items():
    print(', '.join([str(len(occurances[entity_id])), str(entity)]))

count, entity
1, ('LOCATION', 'TAIPEI')
1, ('ORGANIZATION', 'AFP')
7, ('LOCATION', 'Taiwan')
1, ('PERSON', 'Chen Shui-bian')
3, ('LOCATION', 'Taipei')
1, ('LOCATION', 'United States')
3, ('LOCATION', 'Washington')
2, ('LOCATION', 'Afghanistan')
1, ('LOCATION', 'Republic of China')
1, ('PERSON', 'George W. Bush')
4, ('PERSON', 'Chen')
1, ('LOCATION', 'Oregon')
1, ('PERSON', 'John Kitzhaber')
5, ('LOCATION', 'US')
1, ('LOCATION', 'ROC')
1, ('MISC', 'Americans')
1, ('LOCATION', 'New York World Trade Center')
1, ('ORGANIZATION', 'Pentagon')
1, ('PERSON', 'Usama bin Laden')
1, ('PERSON', 'Tien Hung-mao')
2, ('PERSON', 'Bush')
1, ('PERSON', 'Bill Clinton')
1, ('LOCATION', 'China')
1, ('LOCATION', 'Beijing')


In [11]:
get_occurances_by_entity(('LOCATION', 'US'))

{(2, 20, 20), (3, 9, 9), (5, 7, 7), (6, 12, 12), (8, 7, 7)}

### Extracting Entities with Coreference (`coref` annotator)

**TODO**

## Merging Extracted Entities

The paper's original authors merged named entities with several heuristics. These include:
> merging acronyms, merging named entity of person type with the same last name ... names listed as an alias on Freebase ... mentions in a co-reference chain with the named entity

### Merging Aliases and Acronyms using Wikidata

[Freebase shutdown at the end of August 2016](https://developers.google.com/freebase/). Much of that data was migrated to [Wikidata](https://wikidata.org). We will be using Wikidata's search and lookup API's rather than running a local instance of the Freebase data dump.

In [12]:
import importlib # needed for `importlib.reload` - used while developing

# import our bespoke Wikidata search/retrieval bindings
import wikidata # `wikidata.search` and `wikidata.get`

In [13]:
importlib.reload(wikidata);

Testing out `wikidata.search` and `wikidata.get` with entities found in our sample document.

A search for both 'Republic of China' and 'Taiwan' should yield the same entity on Wikidata (allowing us to merge the two entities).

In [14]:
search_result = wikidata.search('Republic of China')
search_result

{'aliases': ['Republic of China'],
 'concepturi': 'http://www.wikidata.org/entity/Q865',
 'description': 'state in East Asia',
 'id': 'Q865',
 'label': 'Taiwan',
 'match': {'language': 'en', 'text': 'Republic of China', 'type': 'alias'},
 'pageid': 1185,
 'repository': '',
 'title': 'Q865',
 'url': '//www.wikidata.org/wiki/Q865'}

In [15]:
entity = wikidata.get(search_result['id'])
entity

{'aliases': {'en': [{'language': 'en', 'value': 'ROC'},
   {'language': 'en', 'value': 'Chinese Taipei'},
   {'language': 'en', 'value': 'Chunghwa Minkwo'},
   {'language': 'en', 'value': 'Chunghwa Minkuo'},
   {'language': 'en', 'value': 'Republic of China'},
   {'language': 'en', 'value': '🇹🇼'},
   {'language': 'en', 'value': 'tw'}]},
 'descriptions': {'en': {'language': 'en', 'value': 'state in East Asia'}},
 'id': 'Q865',
 'labels': {'en': {'language': 'en', 'value': 'Taiwan'}},
 'type': 'item'}

In [16]:
entity == wikidata.get_id('Taiwan')

True

They match! As long as this keeps working for ~~all~~ most entities with aliases and acronyms, we'll be all set.

Now for the actual merging...

In [17]:
# Recreate mappings to merge and use Wikidata id's when available
old_text_to_id = text_to_id
old_occurances = occurances
text_to_id = {}
occurances = {}

for (e_type, text), old_eid in old_text_to_id.items():
    entity = wikidata.get_id(text)
    entity_id = ('wikidata', entity['id']) if entity is not None and 'id' in entity else ('manual', old_eid)
    create_entity((e_type, text), entity_id)
    add_occurances((e_type, text), old_occurances[old_eid])

In [18]:
print('count, id, entity')
for entity, entity_id in text_to_id.items():
    print(', '.join([str(len(occurances[entity_id])), str(entity_id), str(entity)]))

count, id, entity
4, ('wikidata', 'Q1867'), ('LOCATION', 'TAIPEI')
1, ('wikidata', 'Q40464'), ('ORGANIZATION', 'AFP')
9, ('wikidata', 'Q865'), ('LOCATION', 'Taiwan')
1, ('wikidata', 'Q22368'), ('PERSON', 'Chen Shui-bian')
4, ('wikidata', 'Q1867'), ('LOCATION', 'Taipei')
6, ('wikidata', 'Q30'), ('LOCATION', 'United States')
3, ('wikidata', 'Q61'), ('LOCATION', 'Washington')
2, ('wikidata', 'Q889'), ('LOCATION', 'Afghanistan')
9, ('wikidata', 'Q865'), ('LOCATION', 'Republic of China')
1, ('wikidata', 'Q207'), ('PERSON', 'George W. Bush')
4, ('wikidata', 'Q804988'), ('PERSON', 'Chen')
1, ('wikidata', 'Q824'), ('LOCATION', 'Oregon')
1, ('wikidata', 'Q740345'), ('PERSON', 'John Kitzhaber')
6, ('wikidata', 'Q30'), ('LOCATION', 'US')
9, ('wikidata', 'Q865'), ('LOCATION', 'ROC')
1, ('wikidata', 'Q846570'), ('MISC', 'Americans')
1, ('manual', 16), ('LOCATION', 'New York World Trade Center')
1, ('wikidata', 'Q127840'), ('ORGANIZATION', 'Pentagon')
1, ('wikidata', 'Q1317'), ('PERSON', 'Usama bin 

Sanity check: make sure that the new mapping has fewer (or the same) number of keys as the old mapping (because we merged 0+ entities together).

In this case, we expect to have several fewer entities through merges.

In [19]:
len(old_occurances), len(occurances)

(24, 20)

### Merging last names

The paper authors also merged person named entities when there were exact last name matches. Example from the paper:
> e.g. Tiger Woods to Woods

The direction of merging might seem somewhat ambigious. Is merging in the direction of `person_entity -> last_name` or `last_name -> person_entity`? In either case, we need to take into account what happens when there are multiple people mentioned with the same last name. Only one-to-one mappings will be merged, making the ambiguity irrelevant.

**NOTE** Because the parts of a name are not annotated, it is not possible to distinguish the "family name" or "surname" portion of full names. As a result, the assumptions around last names will fail in the case of cultures where the family name comes first in the full name. In the sample document the Chinese full name "Chen Shui-bian" will not be merged with his last name "Chen".

In our sample document, we will only be merging "Bush" and "George W. Bush".

In [20]:
names = list(filter(lambda pair: pair[0][0] == 'PERSON', text_to_id.items()))
# sort by last name (last name = last token (tokenized by string split))
names.sort(key=lambda pair: (pair[0][1].split(' ')[-1], len(pair[0][1])))
names

[(('PERSON', 'Bush'), ('wikidata', 'Q42295')),
 (('PERSON', 'George W. Bush'), ('wikidata', 'Q207')),
 (('PERSON', 'Chen'), ('wikidata', 'Q804988')),
 (('PERSON', 'Bill Clinton'), ('wikidata', 'Q1124')),
 (('PERSON', 'Tien Hung-mao'), ('wikidata', 'Q9317972')),
 (('PERSON', 'John Kitzhaber'), ('wikidata', 'Q740345')),
 (('PERSON', 'Usama bin Laden'), ('wikidata', 'Q1317')),
 (('PERSON', 'Chen Shui-bian'), ('wikidata', 'Q22368'))]

In [None]:
# Adversarial test to ensure that conflicting matches are filtered out properly
# Expect only (6, 7)
# names = [
#     (('', 'A'), 0),
#     (('', 'X A'), 1),
#     (('', 'XX A'), 1),
#     (('', 'XXX A'), 1),
#     (('', 'B'), 6),
#     (('', 'X B'), 7),
#     (('', 'XX B'), 7),
#     (('', 'C'), 2),
#     (('', 'X C'), 1),
#     (('', 'XX C'), 4),
#     (('', 'XXXX C'), 5),
# ]

In [21]:
potential_merges = set()
curr_last_name = None
curr_last_name_eid = None

for (_, name), eid in names:
    if len(name.split(' ')) == 1:
        curr_last_name = name
        curr_last_name_eid = eid
    elif curr_last_name is not None:
        if curr_last_name == name.split(' ')[-1]:
            potential_merges.add((curr_last_name_eid, eid))
            
potential_merges

{(('wikidata', 'Q42295'), ('wikidata', 'Q207'))}

In [22]:
# filter down to remove conflicting last name -> full name matches
from collections import Counter
count = Counter([eid for pair in potential_merges for eid in pair])
merges = [pair for pair in potential_merges if count[pair[0]] == 1 and count[pair[1]] == 1]
merges

[(('wikidata', 'Q42295'), ('wikidata', 'Q207'))]

In [23]:
for eid_dest, eid_additional in merges:
    keys_to_change = [entity_key for entity_key, eid in text_to_id.items() if eid == eid_additional]
    for key in keys_to_change:
        occurances[eid_dest] |= occurances[eid_additional]
        text_to_id[key] = eid_dest
        del occurances[eid_additional]

In [24]:
len(occurances)

19

Successfully merged "Bush" and "George W. Bush"!

### Merging Co-reference Mentions

**TODO**