<a href="https://colab.research.google.com/github/adefgreen98/NLU2021-Assignment2/blob/main/code/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Understanding 2021 - Assignment 2: NERs & Dependency Parsing

_Federico Pedeni, 223993_

### Current issues
- During spaCy parsing of `test.txt`, the number of sentences grows from 3453 to 4205
- Spacy has many more entity tags than the ground truth, it should be checked if they have statistical importance for the purpose of our study
- Check what is the equivalent of ground truth's MISC (should be NORP)

### Requirements


In [1]:
!git clone https://github.com/adefgreen98/NLU2021-Assignment2.git
!mv /content/NLU2021-Assignment2/code/conll.py ./

fatal: destination path 'NLU2021-Assignment2' already exists and is not an empty directory.
mv: cannot stat '/content/NLU2021-Assignment2/code/conll.py': No such file or directory


In [2]:
import spacy
import nltk
import zipfile
import re
from conll import *

In [3]:
# Initialize parser
nlp = spacy.load('en')

# correct tokenizer to use only whitespaces
nlp.tokenizer = lambda input: spacy.tokens.Doc(nlp.vocab, input.split()[:-1] + (['.'] if input[-1] == '.' else []))

In [4]:
# Extract assignment data
with zipfile.ZipFile("/content/NLU2021-Assignment2/data/conll2003.zip") as zipref:
    zipref.extractall('data')

In [5]:
# Format of dataset: <TOKEN> <POS> <IOB part-of speech tag> <TAG>


def load_dataset(mode):
    res = {
        'sentences': [],
        'ners': []
    }
    pth = f'data/{mode}.txt'
    
    idx = 0

    tmpsentence = []
    tmpentity = {}
    tmp_entities_in_sentence = []

    tmpmisc = None

    with open(pth, 'rt') as file:
        for line in file:
            idx += 1
            if line == '\n':
                if len(tmpsentence) > 0:
                    # adding artificial punctuation for only nominal sentences so that they are correclty parsed
                    if tmpsentence[-1] != '.': 
                        tmpsentence.append('.')

                    # flushes the current sentence
                    res['sentences'].append(' '.join(tmpsentence[:-1]) + tmpsentence[-1])
                    tmpsentence = []

                    # flushes the last entity in entity list for sentence
                    if len(tmpentity) > 0: tmp_entities_in_sentence.append(tmpentity)
                    tmpentity = []
                    
                    # adds artificial punctuation if needed also to the entity list
                    if tmp_entities_in_sentence[-1][-1][0] != '.':
                        tmp_entities_in_sentence.append([('.', 'O')])
                    
                    # flushes entity list
                    res['ners'].append(tmp_entities_in_sentence)
                    tmp_entities_in_sentence = []
                continue
            elif line.startswith('-DOCSTART-'):
                continue
            else:
                if len(line.split()) != 4: 
                    print(f"Error: line with size {len(line.split())} at index {index}")
                token, pos, tag1, tag2 = line.split()
                tmpsentence.append(token)

                if tag2.startswith('B'):
                    if len(tmpentity) > 0:
                        tmp_entities_in_sentence.append(tmpentity)
                        tmpentity = [(token, tag2)]
                    else:
                        tmpentity = [(token, tag2)]
                elif tag2.startswith('I'):
                    currtag = tag2.split('-')[1]
                    oldtag =  tmpentity[-1][1].split('-')[1]
                    if currtag != oldtag: 
                        raise RuntimeError(f"not corresponding tags at index {idx}; tags are '{currtag}' (new) and '{oldtag}' (old)")
                    tmpentity.append((token, tag2))
                elif tag2.startswith('O'): 
                    if len(tmpentity) > 0: tmp_entities_in_sentence.append(tmpentity)
                    tmpentity = [(token, tag2)]
                else:
                    print(f"Error: wrong tag detected at line {idx}, line: {line.encode()}")

    # TODO: solve the issue of MISC tagged-tokens that seem compound but appear without a subject (eg: 'German', 'British')
    return res


In [6]:
# Utility to return iob-tagging for a single string
def tag_iob_string(sentence:str):
    return [(token.text, "-".join([token.ent_iob_, token.ent_type_])) for token in nlp(sentence)]

### Named entity lanbels conversion from SpaCy format to CoNLL format
Labelmaps converted to CoNLL according to [this](https://www.clips.uantwerpen.be/conll2003/ner/annotation.txt).

In [7]:
for el in nlp.entity.labels:
    print(el, ": ", spacy.explain(el))

CARDINAL :  Numerals that do not fall under another type
DATE :  Absolute or relative dates or periods
EVENT :  Named hurricanes, battles, wars, sports events, etc.
FAC :  Buildings, airports, highways, bridges, etc.
GPE :  Countries, cities, states
LANGUAGE :  Any named language
LAW :  Named documents made into laws.
LOC :  Non-GPE locations, mountain ranges, bodies of water
MONEY :  Monetary values, including unit
NORP :  Nationalities or religious or political groups
ORDINAL :  "first", "second", etc.
ORG :  Companies, agencies, institutions, etc.
PERCENT :  Percentage, including "%"
PERSON :  People, including fictional
PRODUCT :  Objects, vehicles, foods, etc. (not services)
QUANTITY :  Measurements, as of weight or distance
TIME :  Times smaller than a day
WORK_OF_ART :  Titles of books, songs, etc.


In [8]:

# Labelmaps converted to CoNLL according to https://www.clips.uantwerpen.be/conll2003/ner/annotation.txt
labelmap = {
    'CARDINAL': 'out',
    'DATE': 'out',
    'EVENT': 'MISC',
    'FAC': 'LOC',
    'GPE': 'LOC',
    'LANGUAGE': 'MISC',
    'LAW': 'out',
    'LOC': 'LOC',
    'MONEY': 'out',
    'NORP': 'MISC',
    'ORDINAL': 'out',
    'ORG': 'ORG',
    'PERCENT': 'out',
    'PERSON': 'PER',
    'PRODUCT': 'out',
    'QUANTITY': 'out',
    'TIME': 'out',
    'WORK_OF_ART': 'out',
    '': 'out'
}

In [9]:
dataset = load_dataset('test')

In [10]:
for sent, ents in zip(dataset['sentences'][:4], dataset['ners'][:4]):
    print("Sentence: ", sent)
    print("Ents: ", *ents, sep='\n')
    print("----------------------------")

Sentence:  SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT.
Ents: 
[('SOCCER', 'O')]
[('-', 'O')]
[('JAPAN', 'B-LOC')]
[('GET', 'O')]
[('LUCKY', 'O')]
[('WIN', 'O')]
[(',', 'O')]
[('CHINA', 'B-PER')]
[('IN', 'O')]
[('SURPRISE', 'O')]
[('DEFEAT', 'O')]
[('.', 'O')]
----------------------------
Sentence:  Nadim Ladki.
Ents: 
[('Nadim', 'B-PER'), ('Ladki', 'I-PER')]
[('.', 'O')]
----------------------------
Sentence:  AL-AIN , United Arab Emirates 1996-12-06.
Ents: 
[('AL-AIN', 'B-LOC')]
[(',', 'O')]
[('United', 'B-LOC'), ('Arab', 'I-LOC'), ('Emirates', 'I-LOC')]
[('1996-12-06', 'O')]
[('.', 'O')]
----------------------------
Sentence:  Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.
Ents: 
[('Japan', 'B-LOC')]
[('began', 'O')]
[('the', 'O')]
[('defence', 'O')]
[('of', 'O')]
[('their', 'O')]
[('Asian', 'B-MISC'), ('Cup', 'I-MISC')]
[('title', 'O')]
[('with', 'O')]
[('a', 'O')]
[('lucky', 'O')]
[('2-1', 

### 1) Evaluate spaCy NER model using CoNLL evaluation script on CoNLL 2003 data 
+ report token-level performance (per class and total)
> + accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)
+ report CoNLL chunk-level performance (per class and total); 
> + precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total


In [11]:
# Part 1
import itertools 

accuracies = {k: {'matching': 0, 'total': 0} for k in [el[0] + el[1] for el in itertools.product(['B-', 'I-'], ['ORG', 'PER', 'LOC', 'MISC'])] + ['O']}

# Computing ACCURACY sentence by sentence
for sent, ents in zip(dataset['sentences'], dataset['ners']):
    doc = nlp(sent)
    for tk in doc:
        # label converted according to CoNLL standard (if it is not 'O')
        conv_label = (tk.ent_iob_ + ('-' + labelmap[tk.ent_type_])) if labelmap[tk.ent_type_] != 'out' else 'O'
        # tokens in a sentence (to get the index)
        sent_tks = [enttoken[0] for ent in ents for enttoken in ent]
        idx = sent_tks.index(tk.text)
        # finds corresponding ground truth
        labels = [enttoken[1] for ent in ents for enttoken in ent]
        if conv_label == labels[idx]:
            accuracies[conv_label]['matching'] += 1
        accuracies[conv_label]['total'] += 1

for cls, vals in accuracies.items():
    print(f"Accuracy for {cls}: {vals['matching'] / vals['total']}")

global_accuracy = sum([d['matching'] for d in accuracies.values()]) / sum([d['total'] for d in accuracies.values()])
print(f"Global Accuracy: {global_accuracy}")

Accuracy for B-ORG: 0.4899267399267399
Accuracy for B-PER: 0.7568659127625202
Accuracy for B-LOC: 0.7636986301369864
Accuracy for B-MISC: 0.796976241900648
Accuracy for I-ORG: 0.4116485686080948
Accuracy for I-PER: 0.679635761589404
Accuracy for I-LOC: 0.4825174825174825
Accuracy for I-MISC: 0.6166666666666667
Accuracy for O: 0.9578903042425518
Global Accuracy: 0.9092165077101346


In [12]:
# Part 2



### 2) Grouping of Entities. Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).



### 3) One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

### Utilities

In [13]:
str_dataset = " ".join([sent for sent in dataset['sentences']])
print(*str_dataset.split('.')[:5], sep='\n')

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT
 Nadim Ladki
 AL-AIN , United Arab Emirates 1996-12-06
 Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday
 But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan


In [14]:
doc = nlp(str_dataset)

In [15]:
entities_counts = {'total': 0}
for sent in doc.sents:
    for tk in sent:
        if tk.ent_type_ == 'NORP' and entities_counts['total'] < 1000:
            print(tk)
        try: entities_counts[tk.ent_type_] += 1
        except KeyError: entities_counts[tk.ent_type_] = 1
        entities_counts['total'] += 1
print(*entities_counts.items(), sep='\n')

Chinese
Soviet
Syrian
Syrian
Syrian
Syrian
Syrians
Marcello
Syrian
Syrian
('total', 44876)
('', 33377)
('ORG', 2080)
('GPE', 1579)
('EVENT', 195)
('ORDINAL', 138)
('TIME', 253)
('PERSON', 2339)
('NORP', 421)
('DATE', 1389)
('CARDINAL', 1798)
('FAC', 103)
('LOC', 128)
('QUANTITY', 226)
('PRODUCT', 166)
('MONEY', 347)
('PERCENT', 200)
('WORK_OF_ART', 74)
('LAW', 61)
('LANGUAGE', 2)


In [16]:
print("After Parsing: ", len(list(doc.sents)))
print("Before Parsing: ", len(dataset['sentences']))
print("From read_corpus_conll(): ", len(read_corpus_conll('/content/data/test.txt')))

After Parsing:  3454
Before Parsing:  3453
From read_corpus_conll():  3684


In [17]:
# here i show that my loading method works perfectly
i = 0
for sent in read_corpus_conll('/content/data/test.txt'):
    if sent[0][0] == '-DOCSTART- -X- -X- O':
        i += 1
print(i)
print(len(read_corpus_conll('/content/data/test.txt')) - len(dataset['sentences']))

231
231
