<a href="https://colab.research.google.com/github/adefgreen98/NLU2021-Assignment2/blob/main/code/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Understanding 2021 - Assignment 2: NERs & Dependency Parsing

_Federico Pedeni, 223993_

### Current issues
- During spaCy parsing of `test.txt`, the number of sentences grows from 3453 to 4205
- Spacy has many more entity tags than the ground truth, it should be checked if they have statistical importance for the purpose of our study
- Check what is the equivalent of ground truth's MISC (should be NORP)

### Requirements


In [1]:
!cp -R "drive/MyDrive/Colab Notebooks/NLU/asgnmt2_data/" ./

In [2]:
import spacy
import nltk
import zipfile
from asgnmt2_data.conll import *

In [3]:
# Initialize parser
nlp = spacy.load('en')

In [4]:
# Extract assignment data
with zipfile.ZipFile("asgnmt2_data/conll2003.zip") as zipref:
    zipref.extractall('data')

In [5]:
# Format of dataset: <TOKEN> <POS> <IOB part-of speech tag> <TAG>


def load_dataset(mode):
    res = {
        'sentences': [],
        'ners': {}
    }
    pth = f'data/{mode}.txt'
    
    idx = 0

    tmpsentence = []
    tmpentity = []

    tmpmisc = None

    with open(pth, 'rt') as file:
        for line in file:
            idx += 1
            if line == '\n':
                if len(tmpsentence) > 0:
                # flushes the current sentence
                    # creates spaced sentence but the last word must be stacked with punctuation
                    # also: adding artificial punctuation for only nominal sentences so that they are correclty parsed
                    if tmpsentence[-1] != '.': tmpsentence.append('.')
                    res['sentences'].append(' '.join(tmpsentence[:-1]) + tmpsentence[-1])
                    tmpsentence = []
                continue
            elif line.startswith('-DOCSTART-'):
                continue
            else:
                if len(line.split()) != 4: 
                    print(f"Error: line with size {len(line.split())} at index {index}")
                token, pos, tag1, tag2 = line.split()
                tmpsentence.append(token)
                if tag2.startswith('B'):
                    if len(tmpentity) > 0:
                        currtag = tmpentity[-1][1].split('-')[1]
                        try: res['ners'][currtag].append(tmpentity)
                        except KeyError: res['ners'][currtag] = [tmpentity]
                        tmpentity = [(token, tag2)]
                    else:
                        tmpentity = [(token, tag2)]
                elif tag2.startswith('I'):
                    tmpentity.append((token, tag2))
                elif tag2.startswith('O'): 
                    try: res['ners']['O'].append(token)
                    except KeyError: res['ners']['O'] = [token]
                else:
                    print(f"Error: wrong tag detected at line {idx}, line: {line.encode()}")

    # TODO: solve the issue of MISC tagged-tokens that seem compound but appear without a subject (eg: 'German', 'British')
    return res


In [6]:
# Utility to return iob-tagging for a single string
def tag_iob_string(sentence:str):
    return [(token.text, "-".join([token.ent_iob_, token.ent_type_])) for token in nlp(sentence)]


In [7]:
dataset = load_dataset('test')

In [8]:
for k,v in dataset['ners'].items():
    print(k, ": ", v[:4])

O :  ['SOCCER', '-', 'GET', 'LUCKY']
LOC :  [[('JAPAN', 'B-LOC')], [('AL-AIN', 'B-LOC')], [('United', 'B-LOC'), ('Arab', 'I-LOC'), ('Emirates', 'I-LOC')], [('Japan', 'B-LOC')]]
PER :  [[('CHINA', 'B-PER')], [('Nadim', 'B-PER'), ('Ladki', 'I-PER')], [('Igor', 'B-PER'), ('Shkvyrin', 'I-PER')], [('Oleg', 'B-PER'), ('Shatskiku', 'I-PER')]]
MISC :  [[('Asian', 'B-MISC'), ('Cup', 'I-MISC')], [('Uzbek', 'B-MISC')], [('Chinese', 'B-MISC')], [('Soviet', 'B-MISC')]]
ORG :  [[('FIFA', 'B-ORG')], [('RUGBY', 'B-ORG'), ('UNION', 'I-ORG')], [('Plymouth', 'B-ORG')], [('Exeter', 'B-ORG')]]


### 1) Evaluate spaCy NER model using CoNLL evaluation script on CoNLL 2003 data 
+ report token-level performance (per class and total)
> + accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)
+ report CoNLL chunk-level performance (per class and total); 
> + precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total


In [20]:
doc = nlp(dataset['sentences'][0])
print(doc)
print([(tk.text, tk.ent_iob_, tk.ent_type_) for tk in doc])
print([[(tk.text, tk.ent_iob_, tk.ent_type_) for tk in nlp('japan')]])

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT.
[('SOCCER', 'O', ''), ('-', 'O', ''), ('JAPAN', 'O', ''), ('GET', 'O', ''), ('LUCKY', 'O', ''), ('WIN', 'B', 'ORG'), (',', 'O', ''), ('CHINA', 'B', 'GPE'), ('IN', 'O', ''), ('SURPRISE', 'O', ''), ('DEFEAT', 'O', ''), ('.', 'O', '')]
[[('japan', 'B', 'GPE')]]



### 2) Grouping of Entities. Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).



### 3) One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

### Utilities

In [None]:
str_dataset = " ".join([sent for sent in dataset['sentences']])
print(*str_dataset.split('.')[:5], sep='\n')

In [None]:
doc = nlp(str_dataset)

In [None]:
entities_counts = {'total': 0}
for sent in doc.sents:
    for tk in sent:
        if tk.ent_type_ == 'NORP' and entities_counts['total'] < 1000:
            print(tk)
        try: entities_counts[tk.ent_type_] += 1
        except KeyError: entities_counts[tk.ent_type_] = 1
        entities_counts['total'] += 1
print(*entities_counts.items(), sep='\n')

In [None]:
print("After Parsing: ", len(list(doc.sents)))
print("Before Parsing: ", len(dataset['sentences']))
print("From read_corpus_conll(): ", len(read_corpus_conll('/content/data/test.txt')))

In [None]:
# here i show that my loading method works perfectly
i = 0
for sent in read_corpus_conll('/content/data/test.txt'):
    if sent[0][0] == '-DOCSTART- -X- -X- O':
        i += 1
print(i)
print(len(read_corpus_conll('/content/data/test.txt')) - len(dataset['sentences']))