# Entity Recognition Testing Notebook

**Goal:** Locate, extract, and merge all entities of interest from plain text inputs

Starting from plain text files in the MPQA dataset, we will use Stanford CoreNLP to find all the relevant entities. Entities will also be merged with several heuristics, including using information from Freebase (Wikidata?) and coreference chains. From there, we can start calculating sentiment and faction relationships between them.

This notebook is essentially implementing Section 4.1 (_Document Preprocessing_).

## Setup

We'll start by setting up the data and the tools. We start by locating all the files we could use in MPQA. Then we setup the `pycorenlp` Python wrapper for Stanford CoreNLP's API (running locally).

In [1]:
import os
mpqa_dir = '../data/database.mpqa.3.0'
mpqa_doclist_path = os.path.join(mpqa_dir, 'doclist')
mpqa_file_paths = [
    os.path.join(mpqa_dir, 'docs', path.strip()) 
    for path in open(mpqa_doclist_path, 'r')]

In [2]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

Let's only use the first file in the list for testing and development. Below are the relative paths to the file and the file contents.

In [3]:
test_file = mpqa_file_paths[0]
print(test_file)

../data/database.mpqa.3.0/docs/temp_fbis/20.46.58-22510


In [4]:
text = open(test_file, 'r').read()
print(text)

	
["Opinion" U.S. Human Rights Claims Only Empty Rhetoric] 
	
	
 The U.S. State Department on Monday published its annual report on the status of human rights in other countries in the year 2001. In this report, when referring to Iran, the United States repeated its allegations against the Islamic Republic but failed to provide any evidence in support of its baseless charges. 
	
	
 Among the unfounded allegations was the claim that the Islamic Republic enjoys no social base and is an unpopular system because of its human rights violations. However, the massive participation of millions of Iranians in the grand rallies on Feb. 11 to mark the anniversary of the victory of the Islamic Revolution and defy U.S. threats against this country once again revealed the emptiness of U.S. charges against Iran. It is quite clear that such baseless accusations are only made to tarnish the image of Iran, since it follows an independent policy and refuses to bow to U.S. domination. 
	
	
 It is interest

## Parsing

Now that we have our text input, let's feed it into CoreNLP to get the annotations from a parse. We will need the dependency parse (`depparse`), named entity recognition (`ner`), and coreference resolution (`coref`).

In [5]:
output = nlp.annotate(text, properties={
    'annotators': 'depparse,ner',
    'outputFormat': 'json'
})

### Extracting Entities with Named Entity Recognition (`ner` annotater)

The NER parse is stored at the token annotation level under the `ner` key. The tokens are annotated with the entity type or `'O'` (for outside?) in the case of non-entity tokens.

Ex) `output['sentences'][0]['tokens'][4]['ner']` will contain the entity type for the fifth token in the first sentence.

In [6]:
annotation = output['sentences'][0]['tokens'][4]
(annotation['originalText'], annotation['ner'])

('U.S.', 'LOCATION')

And this is every (token, entity type) pair in the sample text.

In [7]:
[(
    annotated['originalText'], 
    annotated['ner'] if annotated['ner'] != 'O' else None
)
 for sentence in output['sentences'] 
 for annotated in sentence['tokens']]

[('[', None),
 ('"', None),
 ('Opinion', None),
 ('"', None),
 ('U.S.', 'LOCATION'),
 ('Human', None),
 ('Rights', None),
 ('Claims', None),
 ('Only', None),
 ('Empty', None),
 ('Rhetoric', None),
 (']', None),
 ('The', None),
 ('U.S.', 'ORGANIZATION'),
 ('State', 'ORGANIZATION'),
 ('Department', 'ORGANIZATION'),
 ('on', None),
 ('Monday', 'DATE'),
 ('published', None),
 ('its', None),
 ('annual', 'SET'),
 ('report', None),
 ('on', None),
 ('the', None),
 ('status', None),
 ('of', None),
 ('human', None),
 ('rights', None),
 ('in', None),
 ('other', None),
 ('countries', None),
 ('in', None),
 ('the', 'DATE'),
 ('year', 'DATE'),
 ('2001', 'DATE'),
 ('.', None),
 ('In', None),
 ('this', None),
 ('report', None),
 (',', None),
 ('when', None),
 ('referring', None),
 ('to', None),
 ('Iran', 'LOCATION'),
 (',', None),
 ('the', None),
 ('United', 'LOCATION'),
 ('States', 'LOCATION'),
 ('repeated', None),
 ('its', None),
 ('allegations', None),
 ('against', None),
 ('the', None),
 ('Islamic'

Next, we'll need to identify all the relevant entities in the annotated text. The original authors of the paper omitted entities of type date, duration, money, time, and number. We include that filter in this step.

We will build a mapping of entities to list of occurances in the text `(sentence_index, start_token_index, end_token_index)`. This will initially be matched solely on token matching but then will be extended to include entity merging heuristics mentioned in the paper.

Entities can span multiple tokens but never span sentences. Annotation is done on a per-token basis so any adjacent tokens with the same entity type will be merged into a single multi-token entity span.

In [8]:
next_id = 0
text_to_id = {} # this raw text (tuple of N strings) to id mapping will make entity merging easier
occurances = {} # entity occurances will be mapped by id

def clear_occurances():
    global next_id, text_to_id, occurances
    next_id = 0
    text_to_id = {}
    occurances = {}

def add_occurance(entity, sentence_idx, start_token_idx, end_token_idx):
    global next_id
    if entity not in text_to_id:
        text_to_id[entity] = next_id
        next_id += 1
    if text_to_id[entity] not in occurances:
        occurances[text_to_id[entity]] = set()
    occurances[text_to_id[entity]].add((sentence_idx, start_token_idx, end_token_idx))
#     print('observed {} at {} {} {}'.format(entity, sentence_idx, start_token_idx, end_token_idx))

def get_occurances_by_entity(entity):
    if entity not in text_to_id:
        return None
    return occurances[text_to_id[entity]]

In [9]:
clear_occurances()
named_types = set(['PERSON', 'LOCATION', 'ORGANIZATION', 'MISC'])
for sentence in output['sentences']:
    start_idx = None
    curr_type = None
    for token in sentence['tokens']:
#         if token['ner'] != 'O' and token['ner'] in named_types:
#             print('\t', token['originalText'], token['ner'])
        if token['ner'] != curr_type and curr_type is not None:
            end_idx = token['index'] - 1
            raw_text = tuple(t['originalText'] for t in sentence['tokens'][start_idx-1:end_idx])
            add_occurance(raw_text, sentence['index'], start_idx, end_idx)
            start_idx = curr_type = None
        if token['ner'] in named_types and curr_type is None:
            start_idx = token['index']
            curr_type = token['ner']
    if curr_type is not None:
        raw_text = (token['originalText'] for token in sentence[start_idx-1:token['index']])
        end_idx = token['index']
        add_occurance(raw_text, sentence['index'], start_idx, end_idx)

In [10]:
print(text_to_id.keys())

dict_keys([('U.S.',), ('U.S.', 'State', 'Department'), ('Iran',), ('United', 'States'), ('Islamic', 'Republic'), ('Iranians',), ('Islamic', 'Revolution'), ('Washington',), ('Afghanistan',), ('Taleban',), ('Al-Qaeda',), ('Guantanamo', 'Bay'), ('U.S', 'Administration'), ('Cincinnati',), ('Afro-American',)])
