# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

# CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. 

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_ents

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
test = conll.read_corpus_conll('dataset/test.txt')
test = [my_ents.Sentence(sent) for sent in test if '-DOCSTART-' not in sent[0][0]]
# test = test[:100]

# 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

Conversion from Ontonotes tags to CoNLL format

In [2]:
# conversion of tags from Ontonotes (spacy) to CoNLL format
def from_spacy_to_conll(predictions_spacy):
    switcher = {
                ' ': '',
                '': '',
                'ORG': '-ORG',
                'PER': '-PER',
                'LOC': '-LOC',
                'PERSON': '-PER',
                'GPE': '-LOC'
            }
    
    # LOC, PER, ORG, MISC
    predictions = []
    
    for sent in predictions_spacy:
        new = []
        
        for ent in sent:
            # merge iob and entity type
            new.append((ent.text, ent.ent_iob_ + switcher.get(ent.ent_type_, '-MISC')))
        
        predictions.append(new)
        
    return predictions

## SpaCy predictions with custom tokenizer
Define custom tokenizer for spacy, otherwise spaCy will tokenize differently from how the CoNLL dataset has been tokenized. This would produce different tokens in output, rendering impossible to compute the accuracy.
Because the prediction of each sentence is indipendent of the other sentences, so I used the Pool function to parallelize the task across multiple processes to performe the computation faster. However, for convinience I used a test set of only 100 samples during development, and that case is faster to not use multi processing.

In [3]:
%%time

import spacy
from multiprocessing import Pool

nlp = spacy.load('en_core_web_sm')

def tokenizer_(sent):
    return spacy.tokens.Doc(nlp.vocab, sent.split())

nlp.tokenizer = tokenizer_

if len(test) > 100:
    with Pool(4) as p:
        # spacy predictions, multi-process
        predictions_spacy = p.map(nlp, [str(sent) for sent in test])
else:
    # spacy predictions
    predictions_spacy = [nlp(str(sent)) for sent in test]

CPU times: user 7.77 s, sys: 947 ms, total: 8.71 s
Wall time: 11 s


In [4]:
# convert to NLTK format so that conll.evaluate can be used
predictions = from_spacy_to_conll(predictions_spacy)

predictions[6]

[('Oleg', 'B-PER'),
 ('Shatskiku', 'I-PER'),
 ('made', 'O'),
 ('sure', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('win', 'O'),
 ('in', 'O'),
 ('injury', 'O'),
 ('time', 'O'),
 (',', 'O'),
 ('hitting', 'O'),
 ('an', 'O'),
 ('unstoppable', 'O'),
 ('left', 'O'),
 ('foot', 'O'),
 ('shot', 'O'),
 ('from', 'O'),
 ('just', 'O'),
 ('outside', 'O'),
 ('the', 'O'),
 ('area', 'O'),
 ('.', 'O')]

## Token-level accuracy, total and per-class

In [5]:
# organize test data in tuples (entity, tag)
test_set = [[(ent.text, ent.ent_tag) for ent in sent.ents] for sent in test]

test_set[6]

[('Oleg', 'B-PER'),
 ('Shatskiku', 'I-PER'),
 ('made', 'O'),
 ('sure', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('win', 'O'),
 ('in', 'O'),
 ('injury', 'O'),
 ('time', 'O'),
 (',', 'O'),
 ('hitting', 'O'),
 ('an', 'O'),
 ('unstoppable', 'O'),
 ('left', 'O'),
 ('foot', 'O'),
 ('shot', 'O'),
 ('from', 'O'),
 ('just', 'O'),
 ('outside', 'O'),
 ('the', 'O'),
 ('area', 'O'),
 ('.', 'O')]

### Total accuracy

In [6]:
def total_accuracy(predictions_labels, labels):
    if len(predictions_labels) != len(labels):
        raise Exception('Prediction labels and test labels have different lenght')
    
    correct = 0
    for i in range(len(predictions_labels)):
        if predictions_labels[i] == labels[i]:
            correct += 1
    
    return correct/len(labels)

pred_labels = [ent[1] for sent in predictions for ent in sent]
test_labels = [ent[1] for sent in test_set for ent in sent]

round(total_accuracy(pred_labels, test_labels), 3)

0.811

### Per-class accuracy

In [7]:
from sklearn.metrics import classification_report
import pandas as pd

per_class_metrics = classification_report(test_labels, pred_labels, output_dict=True)
pd_tbl_class = pd.DataFrame().from_dict(per_class_metrics).transpose()
pd_tbl_class.round(decimals=3)

Unnamed: 0,precision,recall,f1-score,support
B-LOC,0.775,0.704,0.738,1668.0
B-MISC,0.108,0.564,0.181,702.0
B-ORG,0.5,0.303,0.378,1661.0
B-PER,0.786,0.609,0.686,1617.0
I-LOC,0.602,0.623,0.612,257.0
I-MISC,0.053,0.403,0.094,216.0
I-ORG,0.42,0.52,0.464,835.0
I-PER,0.815,0.756,0.785,1156.0
O,0.945,0.862,0.901,38323.0
accuracy,0.811,0.811,0.811,0.811


### Chunk-level accuracy

In [8]:
results = conll.evaluate(test_set, predictions)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.766,0.695,0.729,1668
PER,0.761,0.59,0.665,1617
ORG,0.448,0.272,0.339,1661
MISC,0.105,0.55,0.177,702
total,0.397,0.523,0.451,5648


# 1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

In [9]:
def is_in_a_group(ent, groups):
    for group in groups:
        for elem in group:
            if elem.i == ent.i:
                return True
    return False


def insert_in_group(ent, groups):
    # search the group containing the previous token in sentence order
    for i, group in enumerate(groups):
        for elem in group:
            if elem.i == ent.i - 1:
                # add the ent in next position
                return groups.insert(i+1, [ent])


def group_by_noun_chunk(sent_doc):
    groups = [[ent for ent in chunk if ent.ent_type_ != ''] for chunk in sent_doc.noun_chunks]
    groups = [group for group in groups if len(group) > 0]

    for ent in sent_doc:
        # check if ent already in a group
        if ent.ent_type_ != '' and not is_in_a_group(ent, groups):
            # insert the missing entity in sentence order
            insert_in_group(ent, groups)

    return groups


predictions_grouped = [group_by_noun_chunk(sent) for sent in predictions_spacy]
print(predictions_spacy[6])
predictions_grouped[6]

Oleg Shatskiku made sure of the win in injury time , hitting an unstoppable left foot shot from just outside the area . 


[[Oleg, Shatskiku]]

Swap each token with its tag

In [10]:
def to_tag_groups(grouped_sents):
    for i,sent in enumerate(grouped_sents):
        for j,group in enumerate(sent):
            for k,ent in enumerate(group):
                grouped_sents[i][j][k] = ent.ent_type_

to_tag_groups(predictions_grouped)
predictions_grouped[6]

[['PERSON', 'PERSON']]

## Count frequencies of named entities combinations

In [11]:
from collections import defaultdict

def get_group_freqs(grouped_sents):
    freqs = defaultdict(lambda: 0, {})
    
    for i,sent in enumerate(grouped_sents):
        for j,group in enumerate(sent):
            freqs['-'.join([str(ent) for ent in group])] += 1
    
    return freqs


pd_freq_tbl = pd.DataFrame().from_dict(get_group_freqs(predictions_grouped), orient='index', columns=['Count'])
pd_freq_tbl.sort_values(by='Count', ascending=False)

Unnamed: 0,Count
GPE,1100
DATE,797
ORG,623
PERSON-PERSON,607
CARDINAL,472
...,...
PRODUCT-PRODUCT,1
NORP-LOC-LOC,1
DATE-ORG,1
GPE-GPE-ORDINAL,1
