# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

# CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. 

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_ner

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
test = conll.read_corpus_conll('dataset/test.txt')
test = [my_ner.Sentence(sent) for sent in test if '-DOCSTART-' not in sent[0][0]]
# test = test[:100]

# count_ = [len(sent) for sent in test]
# print(sum(count_))

# 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

Conversion from Ontonotes tags to CoNLL format

In [2]:
# conversion of tags from Ontonotes (spacy) to CoNLL format
def from_spacy_to_conll(predictions_spacy):
    switcher = {
                ' ': '',
                '': '',
                'ORG': '-ORG',
                'PER': '-PER',
                'LOC': '-LOC',
                'PERSON': '-PER',
                'GPE': '-LOC'
            }
    
    # LOC, PER, ORG, MISC
    predictions = []
    
    for sent in predictions_spacy:
        new = []
        
        for ent in sent:
            # merge iob and entity type
            new.append((ent.text, ent.ent_iob_ + switcher.get(ent.ent_type_, '-MISC')))
        
        predictions.append(new)
        
    return predictions

## Custom tokenizer
Define custom tokenizer for spacy, otherwise spaCy will tokenize differently from how the CoNLL dataset has been tokenized. This would produce different tokens in output, rendering impossible to compute the accuracy.

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

# def tokenizer_(sent):
#     return spacy.tokens.Doc(nlp.vocab, sent.split())

nlp.tokenizer = lambda sent: spacy.tokens.Doc(nlp.vocab, sent.split())

In [4]:
%time

# spacy predictions
predictions = [nlp(str(sent)) for sent in test]

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


In [5]:
%time

# convert to NLTK format so that conll.evaluate can be used
predictions = from_spacy_to_conll(predictions)

predictions[6]

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.82 µs


[('Oleg', 'B-PER'),
 ('Shatskiku', 'I-PER'),
 ('made', 'O'),
 ('sure', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('win', 'O'),
 ('in', 'O'),
 ('injury', 'O'),
 ('time', 'O'),
 (',', 'O'),
 ('hitting', 'O'),
 ('an', 'O'),
 ('unstoppable', 'O'),
 ('left', 'O'),
 ('foot', 'O'),
 ('shot', 'O'),
 ('from', 'O'),
 ('just', 'O'),
 ('outside', 'O'),
 ('the', 'O'),
 ('area', 'O'),
 ('.', 'O')]

## Token-level accuracy, total and per-class

In [6]:
# organize test data in tuples (entity, tag)
test_set = [[(ent.text, ent.ent_tag) for ent in sent.ents] for sent in test]

test_set[6]

[('Oleg', 'B-PER'),
 ('Shatskiku', 'I-PER'),
 ('made', 'O'),
 ('sure', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('win', 'O'),
 ('in', 'O'),
 ('injury', 'O'),
 ('time', 'O'),
 (',', 'O'),
 ('hitting', 'O'),
 ('an', 'O'),
 ('unstoppable', 'O'),
 ('left', 'O'),
 ('foot', 'O'),
 ('shot', 'O'),
 ('from', 'O'),
 ('just', 'O'),
 ('outside', 'O'),
 ('the', 'O'),
 ('area', 'O'),
 ('.', 'O')]

### Total accuracy

In [7]:
def total_accuracy(predictions_labels, labels):
    if len(predictions_labels) != len(labels):
        raise Exception('Prediction labels and test labels have different lenght')
    
    correct = 0
    for i in range(len(predictions_labels)):
        if predictions_labels[i] == labels[i]:
            correct += 1
    
    return correct/len(labels)

pred_labels = [ent[1] for sent in predictions for ent in sent]
test_labels = [ent[1] for sent in test_set for ent in sent]

total_accuracy(pred_labels, test_labels)

0.8109184882093249

### Per-class accuracy

### Chunk-level accuracy

In [10]:
import pandas as pd

results = conll.evaluate(test_set, predictions)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.761,0.59,0.665,1617
LOC,0.766,0.695,0.729,1668
MISC,0.105,0.55,0.177,702
ORG,0.448,0.272,0.339,1661
total,0.397,0.523,0.451,5648
