# Assignment

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

# CoNLL Data
From https://www.clips.uantwerpen.be/conll2003/ner/

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. 

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.


In [1]:
import conll
import my_tokens
import spacy
from multiprocessing import Pool
from sklearn.metrics import classification_report
import pandas as pd
from collections import defaultdict

# token, POS tag, syntactic chunk tag (IOB), entity tag (IOB)
test = conll.read_corpus_conll('dataset/test.txt')
test = [my_tokens.Sentence(sent) for sent in test
        if '-DOCSTART-' not in sent[0][0]]
test = test[:100]

# 0. Evaluate spaCy NER on CoNLL 2003 data (provided)
- report token-level performance (per class and total)
    - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
- report CoNLL chunk-level performance (per class and total);
    - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

Conversion from Ontonotes tags to CoNLL format

In [2]:
# conversion of tags from Ontonotes (spacy) to CoNLL format
def from_spacy_to_conll(predictions_spacy):
    switcher = {
                ' ': '',
                '': '',
                'ORG': '-ORG',
                'PER': '-PER',
                'LOC': '-LOC',
                'PERSON': '-PER',
                'GPE': '-LOC'
            }

    # LOC, PER, ORG, MISC
    predictions = []

    for sent in predictions_spacy:
        new = []

        for token in sent:
            # merge iob and entity type
            new.append((token.text, token.ent_iob_ + switcher.get(
                                            token.ent_type_, '-MISC')))

        predictions.append(new)

    return predictions

## SpaCy predictions with custom tokenizer
Define custom tokenizer for spacy, otherwise spaCy will tokenize differently from how the CoNLL dataset has been tokenized. This would produce different tokens in output, rendering impossible to compute the accuracy.
Because the prediction of each sentence is indipendent of the other sentences, so I used the Pool function to parallelize the task across multiple processes to performe the computation faster. However, for convinience I used a test set of only 100 samples during development, and that case is faster to not use multi processing.

In [3]:
# %%time

nlp = spacy.load('en_core_web_sm')


def tokenizer_(sent):
    return spacy.tokens.Doc(nlp.vocab, sent.split())


nlp.tokenizer = tokenizer_

if len(test) > 100:
    with Pool(4) as p:
        # spacy predictions, multi-process
        predictions_spacy = p.map(nlp, [str(sent) for sent in test])
else:
    # spacy predictions
    predictions_spacy = [nlp(str(sent)) for sent in test]

[(token.text, token.ent_iob_, token.ent_type_) for token in predictions_spacy[5][0:3]]

CPU times: user 1.34 s, sys: 54.8 ms, total: 1.39 s
Wall time: 1.56 s


[('China', 'B', 'GPE'), ('controlled', 'O', ''), ('most', 'O', '')]

In [4]:
# convert to NLTK format so that conll.evaluate can be used
predictions = from_spacy_to_conll(predictions_spacy)

predictions[5][0:3]

[('China', 'B-LOC'), ('controlled', 'O'), ('most', 'O')]

## Token-level accuracy, total and per-class

In [5]:
# organize test data in tuples (entity, tag)
test_set = [[(token.text, token.ent_tag) for token in sent.tokens] for sent in test]

test_set[5][0:3]

[('China', 'B-LOC'), ('controlled', 'O'), ('most', 'O')]

### Total accuracy

In [6]:
def total_accuracy(predictions_labels, labels):
    if len(predictions_labels) != len(labels):
        raise Exception('Prediction labels and test labels have different lenght')

    correct = 0
    for i in range(len(predictions_labels)):
        if predictions_labels[i] == labels[i]:
            correct += 1

    return correct/len(labels)


pred_labels = [token[1] for sent in predictions for token in sent]
test_labels = [token[1] for sent in test_set for token in sent]

round(total_accuracy(pred_labels, test_labels), 3)

0.797

### Per-class accuracy

In [7]:
per_class_metrics = classification_report(test_labels, pred_labels,
                                          output_dict=True)
pd_tbl_class = pd.DataFrame().from_dict(per_class_metrics).transpose()
pd_tbl_class.round(decimals=3)

Unnamed: 0,precision,recall,f1-score,support
B-LOC,0.84,0.768,0.803,82.0
B-MISC,0.086,0.545,0.148,22.0
B-ORG,0.133,1.0,0.235,2.0
B-PER,0.958,0.622,0.754,111.0
I-LOC,1.0,0.889,0.941,9.0
I-MISC,0.098,0.5,0.164,12.0
I-ORG,0.059,1.0,0.111,1.0
I-PER,0.873,0.711,0.784,97.0
O,0.946,0.831,0.885,1086.0
accuracy,0.797,0.797,0.797,0.797


### Chunk-level accuracy

In [8]:
results = conll.evaluate(test_set, predictions)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
MISC,0.079,0.5,0.136,22
PER,0.903,0.586,0.71,111
LOC,0.84,0.768,0.803,82
ORG,0.067,0.5,0.118,2
total,0.464,0.645,0.539,217


# 1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

In [9]:
def is_in_a_group(token, groups):
    for group in groups:
        for elem in group:
            if elem.i == token.i:
                return True
    return False


def insert_in_group(token, groups):
    # search the group containing the previous token in sentence order
    for i, group in enumerate(groups):
        for elem in group:
            if elem.i == token.i - 1:
                # add the token in next position
                return groups.insert(i+1, [token])


def group_by_noun_chunk(sent_doc):
    groups = [[token for token in chunk if token.ent_type_ != '']
              for chunk in sent_doc.noun_chunks]
    groups = [group for group in groups if len(group) > 0]

    for token in sent_doc:
        # check if token already in a group
        if token.ent_type_ != '' and not is_in_a_group(token, groups):
            # insert the missing token in sentence order
            insert_in_group(token, groups)

    return groups


predictions_grouped = [group_by_noun_chunk(sent) for sent in predictions_spacy]
print(predictions_spacy[5])
predictions_grouped[5]

China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net . 


[[China], [the, 78th, minute], [Uzbek, Igor, Shkvyrin], [Chinese]]

Swap each token with its tag

In [10]:
def to_tag_groups(grouped_sents):
    for i, sent in enumerate(grouped_sents):
        for j, group in enumerate(sent):
            for k, token in enumerate(group):
                grouped_sents[i][j][k] = token.ent_type_


to_tag_groups(predictions_grouped)
predictions_grouped[5]

[['GPE'], ['TIME', 'TIME', 'TIME'], ['NORP', 'PERSON', 'PERSON'], ['NORP']]

## Count frequencies of named entities combinations

In [11]:
def get_group_freqs(grouped_sents):
    freqs = defaultdict(lambda: 0, {})

    for i, sent in enumerate(grouped_sents):
        for j, group in enumerate(sent):
            freqs['-'.join([str(token) for token in group])] += 1

    return freqs


pd_freq_tbl = pd.DataFrame().from_dict(get_group_freqs(predictions_grouped),
                                       orient='index', columns=['Count'])
pd_freq_tbl.sort_values(by='Count', ascending=False)

Unnamed: 0,Count
GPE,63
PERSON-PERSON,38
CARDINAL,24
CARDINAL-PERSON-PERSON,17
DATE,13
ORG,9
NORP,8
TIME-TIME-TIME,7
ORG-ORG,7
TIME-TIME-TIME-TIME,6


# 2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [56]:
ts = predictions_spacy[5].copy()

new_span = spacy.tokens.Span(ts, 18, 20, label='PERSON')
# span2 = spacy.tokens.Span(ts, 16, 17)
print(new_span)
ts.set_ents([new_span], default='outside')

ts[20].head = ts[19]
ts[20].dep_ = 'compound'

ts[21].head = ts[20]
ts[21].dep_ = 'compound'

# spacy.displacy.render(ts, style='dep')
spacy.displacy.render(ts, style='ent')
for token in ts:
    if token.dep_ == 'compound':
        print(f'{token.text}, {token.dep_}, {token.ent_iob_}, {token.ent_type_}, {token.i}')

Igor Shkvyrin


Uzbek, compound, O, , 16
striker, compound, O, , 17
Igor, compound, B, PERSON, 18
took, compound, O, , 20
advantage, compound, O, , 21


In [57]:
def fix_segmentation(spacy_sents):
    updated = True

    # if a token in an entity span has a compound child that is not in another
    # entity span, expand the span to include it
    for sent in spacy_sents:
        while updated:
            updated = False

            for span_i, _ in enumerate(sent.ents):
                for token in sent.ents[span_i]:
                    for child in token.children:
                        # this way if a span gets expanded, this variable will be
                        # updated accordingly instead of being set to the original
                        # span passed by the 'for span in sent', which is never
                        # updated
                        span = sent.ents[span_i]

                        # the child is a compound for the parent
                        # because it has an IOB tag of O, it is not part of any
                        # other span. So, its parent's span has to be
                        # expandend to include the child
                        if child.dep_ == 'compound' and child.ent_iob_ == 'O':
                            # child on the left of the parent
                            if child.i < token.i:
                                # find the span's boundaries to create a new span
                                start = child.i
                                end = span[-1].i + 1
                            else:
                                # child.i > token.i since they can't be the same
                                # token
                                start = span[0].i
                                end = child.i + 1

                            sent.set_ents(
                                [
                                    spacy.tokens.Span(
                                        doc=sent,
                                        start=start,
                                        end=end,
                                        label=span[0].ent_type_
                                    )
                                ],
                                default='unmodified'
                            )
                            updated = True


fix_segmentation([ts])
spacy.displacy.render(ts, style='ent')