# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

## CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. 

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_ner

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
raw_test = [sent for sent in conll.read_corpus_conll('dataset/test.txt') if '-DOCSTART-' not in sent[0][0]]
test = [my_ner.Sentence(sent) for sent in raw_test]

count_ = [len(sent) for sent in test]
print(sum(count_))

46435


In [2]:
# for ent in test[6].ents:
#     print('{}\t\t{}'.format(ent.text, ent.ent_tag))

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
# txt = str(test[6])
# doc = nlp(txt)

# print([ent.text for ent in doc.ents])
# print([(t.text, t.ent_type_, t.ent_iob_) for t in doc])
# spacy.displacy.render(doc, style="ent")

## 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

In [4]:
# conversion of tags from Ontonotes (spacy) to CoNLL format
def from_spacy_to_conll(predictions_spacy):
    switcher = {
                ' ': '',
                '': '',
                'ORG': '-ORG',
                'PER': '-PER',
                'LOC': '-LOC',
                'PERSON': '-PER',
                'ORGANIZATION': '-ORG',
                'LOCATION': '-LOC',
                'GPE': '-LOC'
            }
    
    # LOC, PER, ORG, MISC
    predictions = []
    
    for sent in predictions_spacy:
        new = []
        
        for ent in sent:
            # merge iob and entity type
            new.append((ent.text, ent.ent_iob_ + switcher.get(ent.ent_type_, 'MISC')))
        
        predictions.append(new)
        
    return predictions


# spacy predictions
predictions_spacy = [nlp(str(sent)) for sent in test[:100]]
# print([(t.text, t.ent_iob_, t.ent_type_) for t in predictions_spacy[7]])

# convert to NLTK format so that conll.evaluate can be used
predictions = from_spacy_to_conll(predictions_spacy)

In [5]:
# organize test data in tuples (entity, tag)
test_set = [[(ent.text, ent.ent_tag) for ent in sent.ents] for sent in test[:100]]

In [6]:
# count_test = [len(sent) for sent in test_set]
# sum(count_test)

In [7]:
# count_pred = [len(sent) for sent in predictions]
# sum(count_pred)

In [8]:
# results = conll.evaluate(test_set, predictions)