<a href="https://colab.research.google.com/github/adefgreen98/NLU2021-Assignment2/blob/main/code/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Understanding 2021 - Assignment 2: NERs & Dependency Parsing

_Federico Pedeni, 223993_

### Current issues
- During spaCy parsing of `test.txt`, the number of sentences grows from 3453 to 4205
- Spacy has many more entity tags than the ground truth, it should be checked if they have statistical importance for the purpose of our study
- Check what is the equivalent of ground truth's MISC (should be NORP)

### Requirements


In [1]:
!cp -R "drive/MyDrive/Colab Notebooks/NLU/asgnmt2_data/" ./

In [2]:
import spacy
import nltk
import zipfile
from asgnmt2_data.conll import *

In [3]:
# Initialize parser
nlp = spacy.load('en')

In [4]:
# Extract assignment data
with zipfile.ZipFile("asgnmt2_data/conll2003.zip") as zipref:
    zipref.extractall('data')

In [5]:
# Format of dataset: <TOKEN> <POS> <IOB part-of speech tag> <TAG>


def load_dataset(mode):
    res = {
        'sentences': [],
        'ners': {}
    }
    pth = f'data/{mode}.txt'
    
    idx = 0

    tmpsentence = []
    tmpentity = []

    tmpmisc = None

    with open(pth, 'rt') as file:
        for line in file:
            idx += 1
            if line == '\n':
                if len(tmpsentence) > 0:
                # flushes the current sentence
                    # creates spaced sentence but the last word must be stacked with punctuation
                    if idx < 100: print(tmpsentence)
                    res['sentences'].append(' '.join(tmpsentence[:-1]) + (tmpsentence[-1] if tmpsentence[-1] in {'.', ';', ','} else (' ' + tmpsentence[-1])))
                    tmpsentence = []
                continue
            elif line.startswith('-DOCSTART-'):
                continue
            else:
                if len(line.split()) != 4: 
                    print(f"Error: line with size {len(line.split())} at index {index}")
                token, pos, tag1, tag2 = line.split()
                tmpsentence.append(token)
                if tag2.startswith('B'):
                    if len(tmpentity) > 0:
                        currtag = tmpentity[-1][1].split('-')[1]
                        try: res['ners'][currtag].append(tmpentity)
                        except KeyError: res['ners'][currtag] = [tmpentity]
                        tmpentity = [(token, tag2)]
                    else:
                        print(f"New Entity detected at line {idx}, data: {line}")
                        tmpentity = [(token, tag2)]
                elif tag2.startswith('I'):
                    tmpentity.append((token, tag2))
                elif tag2.startswith('O'): 
                    try: res['ners']['O'].append(token)
                    except KeyError: res['ners']['O'] = [token]
                else:
                    print(f"Error: wrong tag detected at line {idx}, line: {line.encode()}")

    # TODO: solve the issue of MISC tagged-tokens that seem compound but appear without a subject (eg: 'German', 'British')
    return res


In [6]:
# Utility to return iob-tagging for a single string
def tag_iob_string(sentence:str):
    return [(token.text, "-".join([token.ent_iob_, token.ent_type_])) for token in nlp(sentence)]


In [7]:
dataset = load_dataset('test')

New Entity detected at line 5, data: JAPAN NNP B-NP B-LOC

['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
['Nadim', 'Ladki']
['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']
['Japan', 'began', 'the', 'defence', 'of', 'their', 'Asian', 'Cup', 'title', 'with', 'a', 'lucky', '2-1', 'win', 'against', 'Syria', 'in', 'a', 'Group', 'C', 'championship', 'match', 'on', 'Friday', '.']
['But', 'China', 'saw', 'their', 'luck', 'desert', 'them', 'in', 'the', 'second', 'match', 'of', 'the', 'group', ',', 'crashing', 'to', 'a', 'surprise', '2-0', 'defeat', 'to', 'newcomers', 'Uzbekistan', '.']


In [8]:
for k,v in dataset['ners'].items():
    print(k, ": ", v[:4])

O :  ['SOCCER', '-', 'GET', 'LUCKY']
LOC :  [[('JAPAN', 'B-LOC')], [('AL-AIN', 'B-LOC')], [('United', 'B-LOC'), ('Arab', 'I-LOC'), ('Emirates', 'I-LOC')], [('Japan', 'B-LOC')]]
PER :  [[('CHINA', 'B-PER')], [('Nadim', 'B-PER'), ('Ladki', 'I-PER')], [('Igor', 'B-PER'), ('Shkvyrin', 'I-PER')], [('Oleg', 'B-PER'), ('Shatskiku', 'I-PER')]]
MISC :  [[('Asian', 'B-MISC'), ('Cup', 'I-MISC')], [('Uzbek', 'B-MISC')], [('Chinese', 'B-MISC')], [('Soviet', 'B-MISC')]]
ORG :  [[('FIFA', 'B-ORG')], [('RUGBY', 'B-ORG'), ('UNION', 'I-ORG')], [('Plymouth', 'B-ORG')], [('Exeter', 'B-ORG')]]


In [9]:
str_dataset = "\n".join([sent for sent in dataset['sentences']])
print(*str_dataset.split('\n')[:5], sep='\n')

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT.
Nadim Ladki
AL-AIN , United Arab Emirates 1996-12-06
Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.
But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan.


### 1) Evaluate spaCy NER model using CoNLL evaluation script on CoNLL 2003 data 
+ report token-level performance (per class and total)
> + accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)
+ report CoNLL chunk-level performance (per class and total); 
> + precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total


In [None]:
doc = nlp(str_dataset)

In [22]:
outliers = {'total': 0}
for sent in doc.sents:
    for tk in sent:
        if tk.ent_type_ == '' and outliers['total'] < 100:
            print(tk)
        try: outliers[tk.ent_type_] += 1
        except KeyError: outliers[tk.ent_type_] = 1
        outliers['total'] += 1
print(*outliers.items(), sep='\n')

SOCCER
-
JAPAN
GET
LUCKY
,
IN
SURPRISE
DEFEAT
.




,


began
the
defence
of
their
Cup
title
with
a
lucky
-
1
win
against
in
a
championship
match
on
.


But
saw
their
luck
desert
them
in
the
match
of
the
group
,
crashing
to
a
surprise
-
0
defeat
to
newcomers
.


controlled
most
of
the
match
and
saw
several
chances
missed
until
('total', 52592)
('', 38534)
('ORG', 2137)
('GPE', 1664)
('PERSON', 2504)
('DATE', 2985)
('NORP', 458)
('CARDINAL', 2397)
('ORDINAL', 146)
('TIME', 267)
('EVENT', 198)
('FAC', 93)
('LOC', 151)
('QUANTITY', 264)
('LANGUAGE', 12)
('PRODUCT', 106)
('MONEY', 341)
('PERCENT', 221)
('WORK_OF_ART', 68)
('LAW', 46)


In [27]:
i = 0
for sent in doc.sents:
    if sent.text != dataset['sentences'][i]:
        print("Found difference at index {}".format(i))
        print(f"     '{sent}' != '{dataset['sentences'][i]}' ")
    i += 1
print(i)
print(len(dataset['sentences']))

[1;30;43mOutput streaming troncato alle ultime 5000 righe.[0m
Found difference at index 1670
     'Newsroom +541 318-0655
' != 'Mills was scheduled to die Wednesday but had his sentence temporarily postponed by the Florida Supreme Court.' 
Found difference at index 1671
     'Mexican daily port , shipping update for Dec 6.
' != 'On Thursday , the 11th Circuit U.S. Court of Appeals in Atlanta denied his appeal in federal court.' 
Found difference at index 1672
     'MEXICO CITY 1996-12-06
' != 'In March 1982 , Mills and accomplice Michael Frederick knocked on the door of Lester Lawhon 's trailer in an attempt to rob it , police said.' 
Found difference at index 1673
     'All major ports were open as of 1000 local /' != 'Lester Lawhon was taken to a nearby airstrip where he was bludgeoned with a tire iron.' 
Found difference at index 1674
     '1600 GMT , the Communications and Transportation Ministry said in a daily update.
' != 'Mills then fired two shots that killed Lawhon as the v

IndexError: ignored


### 2) Grouping of Entities. Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).



### 3) One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

In [23]:
print(*read_corpus_conll('/content/data/test.txt')[:10], sep='\n')

[('-DOCSTART- -X- -X- O',)]
[('SOCCER NN B-NP O',), ('- : O O',), ('JAPAN NNP B-NP B-LOC',), ('GET VB B-VP O',), ('LUCKY NNP B-NP O',), ('WIN NNP I-NP O',), (', , O O',), ('CHINA NNP B-NP B-PER',), ('IN IN B-PP O',), ('SURPRISE DT B-NP O',), ('DEFEAT NN I-NP O',), ('. . O O',)]
[('Nadim NNP B-NP B-PER',), ('Ladki NNP I-NP I-PER',)]
[('AL-AIN NNP B-NP B-LOC',), (', , O O',), ('United NNP B-NP B-LOC',), ('Arab NNP I-NP I-LOC',), ('Emirates NNPS I-NP I-LOC',), ('1996-12-06 CD I-NP O',)]
[('Japan NNP B-NP B-LOC',), ('began VBD B-VP O',), ('the DT B-NP O',), ('defence NN I-NP O',), ('of IN B-PP O',), ('their PRP$ B-NP O',), ('Asian JJ I-NP B-MISC',), ('Cup NNP I-NP I-MISC',), ('title NN I-NP O',), ('with IN B-PP O',), ('a DT B-NP O',), ('lucky JJ I-NP O',), ('2-1 CD I-NP O',), ('win VBP B-VP O',), ('against IN B-PP O',), ('Syria NNP B-NP B-LOC',), ('in IN B-PP O',), ('a DT B-NP O',), ('Group NNP I-NP O',), ('C NNP I-NP O',), ('championship NN I-NP O',), ('match NN I-NP O',), ('on IN B-PP O'