# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

## CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_ner

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
test = conll.read_corpus_conll('dataset/test.txt')
test = [my_ner.Sentence(sent) for sent in test]

In [2]:
test[7]

Oleg Shatskiku made sure of the win in injury time , hitting an unstoppable left foot shot from just outside the area .

In [3]:
for ent in test[7].ents:
    print('{}\t\t{}'.format(ent.text, ent.ent_tag))

Oleg		B-PER
Shatskiku		I-PER
made		O
sure		O
of		O
the		O
win		O
in		O
injury		O
time		O
,		O
hitting		O
an		O
unstoppable		O
left		O
foot		O
shot		O
from		O
just		O
outside		O
the		O
area		O
.		O


In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')
txt = str(test[7])
doc = nlp(txt)

print([ent.text for ent in doc.ents])
print([(t.text, t.ent_type_, t.ent_iob_) for t in doc])
spacy.displacy.render(doc, style="ent")

['Oleg Shatskiku']
[('Oleg', 'PERSON', 'B'), ('Shatskiku', 'PERSON', 'I'), ('made', '', 'O'), ('sure', '', 'O'), ('of', '', 'O'), ('the', '', 'O'), ('win', '', 'O'), ('in', '', 'O'), ('injury', '', 'O'), ('time', '', 'O'), (',', '', 'O'), ('hitting', '', 'O'), ('an', '', 'O'), ('unstoppable', '', 'O'), ('left', '', 'O'), ('foot', '', 'O'), ('shot', '', 'O'), ('from', '', 'O'), ('just', '', 'O'), ('outside', '', 'O'), ('the', '', 'O'), ('area', '', 'O'), ('.', '', 'O')]


## 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

In [21]:
# spacy predictions
predictions_spacy = [nlp(str(sent)) for sent in test[:50]]
print([(t.text, t.ent_iob_, t.ent_type_) for t in predictions_spacy[7]])

# convert to NLTK format so that conll.evaluate can be used
predictions = []
for sent in predictions_spacy:
    new = []
    for ent in sent:
        tag = ent.ent_iob_
        if ent.ent_type_ != '':
            tag += '-' + ent.ent_type_

[('Oleg', 'B', 'PERSON'), ('Shatskiku', 'I', 'PERSON'), ('made', 'O', ''), ('sure', 'O', ''), ('of', 'O', ''), ('the', 'O', ''), ('win', 'O', ''), ('in', 'O', ''), ('injury', 'O', ''), ('time', 'O', ''), (',', 'O', ''), ('hitting', 'O', ''), ('an', 'O', ''), ('unstoppable', 'O', ''), ('left', 'O', ''), ('foot', 'O', ''), ('shot', 'O', ''), ('from', 'O', ''), ('just', 'O', ''), ('outside', 'O', ''), ('the', 'O', ''), ('area', 'O', ''), ('.', 'O', '')]


In [13]:
test_set = [[(ent.text, ent.ent_tag) for ent in sent.ents] for sent in test[:50]]
test_set[7]

[('Oleg', 'B-PER'),
 ('Shatskiku', 'I-PER'),
 ('made', 'O'),
 ('sure', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('win', 'O'),
 ('in', 'O'),
 ('injury', 'O'),
 ('time', 'O'),
 (',', 'O'),
 ('hitting', 'O'),
 ('an', 'O'),
 ('unstoppable', 'O'),
 ('left', 'O'),
 ('foot', 'O'),
 ('shot', 'O'),
 ('from', 'O'),
 ('just', 'O'),
 ('outside', 'O'),
 ('the', 'O'),
 ('area', 'O'),
 ('.', 'O')]