# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

# CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. 

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_tokens
import spacy
from multiprocessing import Pool
from sklearn.metrics import classification_report
import pandas as pd
from collections import defaultdict

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
test = conll.read_corpus_conll('dataset/test.txt')
test = [my_tokens.Sentence(sent) for sent in test
        if '-DOCSTART-' not in sent[0][0]]
# test = test[:100]

# 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

Conversion from Ontonotes tags to CoNLL format

In [2]:
# conversion of tags from Ontonotes (spacy) to CoNLL format
def from_spacy_to_conll(predictions_spacy):
    switcher = {
                'PERSON': 'PER',
                'NORP': 'MISC',
                'LOC': 'LOC',
                'FAC': 'LOC',
                'GPE': 'LOC',
                'ORG': 'ORG',
                'PRODUCT': 'MISC',
                'EVENT': 'MISC',
                'LANGUAGE': 'MISC'
            }

    # LOC, PER, ORG, MISC
    predictions = []

    for sent in predictions_spacy:
        new = []

        for token in sent:
            # merge iob and entity type
            tag = switcher.get(token.ent_type_, 'O')

            if tag != 'O':
                tag = token.ent_iob_ + '-' + tag

            new.append((token.text, tag))

        predictions.append(new)

    return predictions

## SpaCy predictions with custom tokenizer
Define custom tokenizer for spacy, otherwise spaCy will tokenize differently from how the CoNLL dataset has been tokenized. This would produce different tokens in output, rendering impossible to compute the accuracy.
Because the prediction of each sentence is indipendent of the other sentences, so I used the Pool function to parallelize the task across multiple processes to performe the computation faster. However, for convinience I used a test set of only 100 samples during development, and that case is faster to not use multi processing.

In [3]:
# %%time

nlp = spacy.load('en_core_web_sm')


def tokenizer_(sent):
    return spacy.tokens.Doc(nlp.vocab, sent.split())


nlp.tokenizer = tokenizer_

if len(test) > 100:
    with Pool(4) as p:
        # spacy predictions, multi-process
        predictions_spacy = p.map(nlp, [str(sent) for sent in test])
else:
    # spacy predictions
    predictions_spacy = [nlp(str(sent)) for sent in test]

[(token.text, token.ent_iob_, token.ent_type_) for token in predictions_spacy[5][0:3]]

[('China', 'B', 'GPE'), ('controlled', 'O', ''), ('most', 'O', '')]

In [4]:
# convert to NLTK format so that conll.evaluate can be used
predictions = from_spacy_to_conll(predictions_spacy)

predictions[5][0:3]

[('China', 'B-LOC'), ('controlled', 'O'), ('most', 'O')]

## Token-level accuracy, total and per-class

In [5]:
# organize test data in tuples (entity, tag)
test_set = [[(token.text, token.ent_tag) for token in sent.tokens] for sent in test]

test_set[5][0:3]

[('China', 'B-LOC'), ('controlled', 'O'), ('most', 'O')]

In [6]:
pred_labels = [token[1] for sent in predictions for token in sent]
test_labels = [token[1] for sent in test_set for token in sent]

per_class_metrics = classification_report(test_labels, pred_labels,
                                          output_dict=True)

pd_tbl_class = pd.DataFrame().from_dict(per_class_metrics).transpose()
pd_tbl_class.round(decimals=3)

Unnamed: 0,precision,recall,f1-score,support
B-LOC,0.77,0.711,0.739,1668.0
B-MISC,0.77,0.55,0.642,702.0
B-ORG,0.5,0.303,0.378,1661.0
B-PER,0.786,0.609,0.686,1617.0
I-LOC,0.577,0.658,0.615,257.0
I-MISC,0.593,0.338,0.431,216.0
I-ORG,0.42,0.52,0.464,835.0
I-PER,0.815,0.756,0.785,1156.0
O,0.949,0.981,0.964,38323.0
accuracy,0.909,0.909,0.909,0.909


### Chunk-level accuracy

In [7]:
results = conll.evaluate(test_set, predictions)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.761,0.59,0.665,1617
LOC,0.76,0.702,0.73,1668
ORG,0.448,0.272,0.339,1661
MISC,0.758,0.541,0.632,702
total,0.687,0.524,0.594,5648


# 1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

In [8]:
def is_in_a_group(token, groups):
    for group in groups:
        for elem in group:
            if elem.i == token.i:
                return True
    return False


def insert_in_group(token, groups):
    # search the group containing the previous token in sentence order
    for i, group in enumerate(groups):
        for elem in group:
            if elem.i == token.i - 1:
                # add the token in next position
                return groups.insert(i+1, [token])


def group_by_noun_chunk(sent_doc):
    groups = [[token for token in chunk if token.ent_type_ != '']
              for chunk in sent_doc.noun_chunks]
    groups = [group for group in groups if len(group) > 0]

    for token in sent_doc:
        # check if token already in a group
        if token.ent_type_ != '' and not is_in_a_group(token, groups):
            # insert the missing token in sentence order
            insert_in_group(token, groups)

    return groups


predictions_grouped = [group_by_noun_chunk(sent) for sent in predictions_spacy]
print(predictions_spacy[5])
predictions_grouped[5]

China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net . 


[[China], [the, 78th, minute], [Uzbek, Igor, Shkvyrin], [Chinese]]

Swap each token with its tag

In [9]:
def to_tag_groups(grouped_sents):
    for i, sent in enumerate(grouped_sents):
        for j, group in enumerate(sent):
            for k, token in enumerate(group):
                grouped_sents[i][j][k] = token.ent_type_


to_tag_groups(predictions_grouped)
predictions_grouped[5]

[['GPE'], ['TIME', 'TIME', 'TIME'], ['NORP', 'PERSON', 'PERSON'], ['NORP']]

## Count frequencies of named entities combinations

In [10]:
def get_group_freqs(grouped_sents):
    freqs = defaultdict(lambda: 0, {})

    for i, sent in enumerate(grouped_sents):
        for j, group in enumerate(sent):
            freqs['-'.join([str(token) for token in group])] += 1

    return freqs


pd_freq_tbl = pd.DataFrame().from_dict(get_group_freqs(predictions_grouped),
                                       orient='index', columns=['Count'])
pd_freq_tbl.sort_values(by='Count', ascending=False)

Unnamed: 0,Count
GPE,1100
DATE,797
ORG,623
PERSON-PERSON,607
CARDINAL,472
...,...
PRODUCT-PRODUCT,1
NORP-LOC-LOC,1
DATE-ORG,1
GPE-GPE-ORDINAL,1


# 2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [11]:
def fix_sent_segmentation(sent, recursive=True):
    updated = True
    keep_expanding = True

    # if a token in an entity span has a compound child that is not in another
    # entity span, expand the span to include it
    while keep_expanding and updated:
        updated = False
        keep_expanding = recursive

        for span_i, _ in enumerate(sent.ents):
            for token in sent.ents[span_i]:
                for child in token.children:
                    # this way if a span gets expanded, this variable will be
                    # updated accordingly instead of being set to the original
                    # span passed by the 'for span in sent', which is never
                    # updated
                    span = sent.ents[span_i]

                    # the child is a compound for the parent
                    # because it has an IOB tag of O, it is not part of any
                    # other span. So, its parent's span has to be
                    # expandend to include the child
                    if child.dep_ == 'compound' and child.ent_iob_ == 'O':
                        # child adjacent and on the left of the span
                        if child.i == span[0].i - 1:
                            # find the span's boundaries to create a new span
                            updated = True

                            sent.set_ents(
                                [
                                    spacy.tokens.Span(
                                        doc=sent,
                                        start=child.i,
                                        end=span[-1].i + 1,
                                        label=span[0].ent_type_
                                    )
                                ],
                                default='unmodified'
                            )
                        # adjacent and on the right side
                        elif child.i == span[-1].i + 1:
                            updated = True

                            sent.set_ents(
                                [
                                    spacy.tokens.Span(
                                        doc=sent,
                                        start=span[0].i,
                                        end=child.i + 1,
                                        label=span[0].ent_type_
                                    )
                                ],
                                default='unmodified'
                            )


def fix_segmentation(spacy_sents, recursive=True):
    for pred in spacy_sents:
        fix_sent_segmentation(pred, recursive)


# before
spacy.displacy.render(predictions_spacy[5], style='ent')

fix_segmentation(predictions_spacy)

# after
spacy.displacy.render(predictions_spacy[5], style='ent')

## Test post-processing

In [12]:
predictions = from_spacy_to_conll(predictions_spacy)
pred_labels = [token[1] for sent in predictions for token in sent]

per_class_metrics = classification_report(test_labels, pred_labels,
                                          output_dict=True)

pd_tbl_class = pd.DataFrame().from_dict(per_class_metrics).transpose()
pd_tbl_class.round(decimals=3)

Unnamed: 0,precision,recall,f1-score,support
B-LOC,0.748,0.691,0.719,1668.0
B-MISC,0.77,0.55,0.642,702.0
B-ORG,0.497,0.302,0.375,1661.0
B-PER,0.657,0.509,0.574,1617.0
I-LOC,0.486,0.658,0.559,257.0
I-MISC,0.583,0.343,0.431,216.0
I-ORG,0.409,0.527,0.46,835.0
I-PER,0.666,0.763,0.711,1156.0
O,0.95,0.973,0.961,38323.0
accuracy,0.898,0.898,0.898,0.898


In [13]:
results = conll.evaluate(test_set, predictions)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.633,0.49,0.553,1617
LOC,0.738,0.682,0.709,1668
ORG,0.442,0.269,0.334,1661
MISC,0.756,0.54,0.63,702
total,0.64,0.488,0.554,5648
