# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

## Setup

In [209]:
from nltk import FreqDist
from nltk.lm import Vocabulary
from nltk.corpus import ConllCorpusReader

CORPUS_ROOT = 'data'
CORPUS_FILEIDS = ['train.txt', 'test.txt', 'valid.txt']
CORPUS_COLUMNTYPES = ['words', 'ne', 'pos', 'chunk', 'tree']

corpus = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS, CORPUS_COLUMNTYPES)
corpus_train = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[0], CORPUS_COLUMNTYPES)
corpus_test = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[1], CORPUS_COLUMNTYPES)
corpus_val = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[2], CORPUS_COLUMNTYPES)


import spacy
nlp = spacy.load('en_core_web_sm')

# Utilities
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

def get_flat_sents(corpus):
    sents = list()
    for sent in corpus:
        flat_sent = ""
        for w in sent:
            flat_sent += f"{w} " if w != '.' else ""
        sents.append(flat_sent.strip())
    return sents

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

In [18]:
def q11():
    # Create vocabulary
    vocab = set([w.lower() for w in corpus.words()])
    vocab_train = set([w.lower() for w in corpus_train.words()])
    vocab_test = set([w.lower() for w in corpus_test.words()])

    # Create frequency distribution
    fd = FreqDist([w.lower() for w in corpus.words()])
    fd_train = FreqDist([w.lower() for w in corpus_train.words()])
    fd_test = FreqDist([w.lower() for w in corpus_test.words()])

    # Print vocabulary length
    print("Length of whole dataset: %d" % len(vocab))
    print("Length of train set: %d" % len(vocab_train))
    print("Length of test set: %d" % len(vocab_test))

    # Print the first 20 words for each dict
    print("\nFirst 20 words of whole dataset:")
    print(nbest(fd, 20))
    print("\nFirst 20 words of train set:")
    print(nbest(fd_train, 20))
    print("\nFirst 20 words of test set:")
    print(nbest(fd_test, 20))

q11()

Length of whole dataset: 26869
Length of train set: 21009
Length of test set: 8548

First 20 words of whole dataset:
{'the': 12310, ',': 10876, '.': 10874, 'of': 5502, 'in': 5405, 'to': 5129, 'a': 4731, '(': 4226, ')': 4225, 'and': 4223, '"': 3239, 'on': 3115, 'said': 2694, "'s": 2339, 'for': 2109, '-': 1866, '1': 1845, 'at': 1679, 'was': 1593, '2': 1342}

First 20 words of train set:
{'the': 8390, '.': 7374, ',': 7290, 'of': 3815, 'in': 3621, 'to': 3424, 'a': 3199, 'and': 2872, '(': 2861, ')': 2861, '"': 2178, 'on': 2092, 'said': 1849, "'s": 1566, 'for': 1465, '1': 1421, '-': 1243, 'at': 1146, 'was': 1095, '2': 973}

First 20 words of test set:
{'the': 1765, ',': 1637, '.': 1626, 'to': 805, 'of': 789, 'in': 761, '(': 686, ')': 684, 'a': 658, 'and': 598, 'on': 467, '"': 421, 'said': 399, "'s": 347, '-': 287, 'for': 286, 'at': 251, 'was': 224, '4': 201, 'with': 185}


#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

In [127]:
def q12(cutoff=1):


    test_lower = [w.lower() for w in corpus_test.words()]
    val_lower = [w.lower() for w in corpus_val.words()]
    # Get vocabs
    vocab_train = Vocabulary([w.lower() for w in corpus_train.words()], unk_cutoff=cutoff)
    vocab_test = Vocabulary(test_lower, unk_cutoff=cutoff)
    vocab_valid = Vocabulary(val_lower, unk_cutoff=cutoff)
    vocab_tv = Vocabulary([*test_lower, *val_lower], unk_cutoff=cutoff)

    # Get list of tokens
    tokens_train = set(vocab_train.counts.keys())
    tokens_test = set(vocab_test.counts.keys())
    tokens_val = set(vocab_valid.counts.keys())
    tokens_tv = set(vocab_tv.counts.keys())

    # Get OOV 
    oov_test = tokens_test.difference(tokens_train)
    oov_valid = tokens_val.difference(tokens_train)
    oov_tv = tokens_val.difference(tokens_tv)
    print("[Q1.2.1]\n>\tOOV tokens:")
    print(">\t (test) Found {} OOV".format(len(oov_test)))
    print(">\t (valid) Found {} OOV".format(len(oov_valid)))
    print(">\t (test + valid) Found {} OOV".format(len(oov_tv)))

    print()

    # Get overlapping tokens w/ test set
    intersection_test = tokens_train.intersection(tokens_test)
    intersection_val = tokens_train.intersection(tokens_val)
    intersection_tv = tokens_train.intersection(tokens_tv)
    print("[Q1.2.1]\n>\tOverlapping tokens:")
    print(">\t (test) Found {} overlapping tokens".format(len(intersection_test)))
    print(">\t (valid) Found {} overlapping tokens".format(len(intersection_val)))
    print(">\t (test + val) Found {} overlapping tokens".format(len(intersection_tv)))
    

q12()

[Q1.2.1]
>	OOV tokens:
>	 (test) Found 3268 OOV
>	 (valid) Found 2856 OOV
>	 (test + valid) Found 0 OOV

[Q1.2.1]
>	Overlapping tokens:
>	 (test) Found 5280 overlapping tokens
>	 (valid) Found 6146 overlapping tokens
>	 (test + val) Found 8066 overlapping tokens


#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [66]:
def q13():

    # Get average sentence length
    print("[Q1.3.1]\n>\tAverage sentence length in tokens: {:.4f}\n".format(len(corpus.words())/len(corpus.sents())))

    # Get 50 most common tokens
    vocab = Vocabulary([w.lower() for w in corpus.words()])
    most_common_tokens = nbest(vocab.counts, 50)
    print("[Q1.3.2]\n>\t50 most common tokens:")
    # print(">\t", most_common_tokens)
    count = 1
    for key in most_common_tokens:
        print(">\t[{}] {}: {}".format(count, key, most_common_tokens[key]))
        count += 1

    # Get number of sentences
    print("\n[Q1.3.3]\n>\tNumber of sentences: %d" % len(corpus.sents()))

q13()

[Q1.3.1]
>	Average sentence length in tokens: 13.6160

[Q1.3.2]
>	50 most common tokens:
>	[1] the: 12310
>	[2] ,: 10876
>	[3] .: 10874
>	[4] of: 5502
>	[5] in: 5405
>	[6] to: 5129
>	[7] a: 4731
>	[8] (: 4226
>	[9] ): 4225
>	[10] and: 4223
>	[11] ": 3239
>	[12] on: 3115
>	[13] said: 2694
>	[14] 's: 2339
>	[15] for: 2109
>	[16] -: 1866
>	[17] 1: 1845
>	[18] at: 1679
>	[19] was: 1593
>	[20] 2: 1342
>	[21] with: 1267
>	[22] 3: 1264
>	[23] 0: 1232
>	[24] that: 1212
>	[25] he: 1166
>	[26] from: 1146
>	[27] by: 1113
>	[28] it: 1082
>	[29] :: 1057
>	[30] is: 984
>	[31] 4: 973
>	[32] as: 920
>	[33] his: 867
>	[34] had: 841
>	[35] were: 804
>	[36] an: 796
>	[37] but: 786
>	[38] not: 786
>	[39] after: 780
>	[40] has: 768
>	[41] be: 754
>	[42] have: 738
>	[43] new: 656
>	[44] first: 645
>	[45] who: 643
>	[46] 5: 636
>	[47] will: 591
>	[48] 6: 584
>	[49] two: 579
>	[50] they: 567

[Q1.3.3]
>	Number of sentences: 22137


#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

In [61]:
def q14():
    (WORD, POS, NE) = range(3)

    # Whole dataset
    ne_all = [w[NE] for w in corpus.iob_words() if w[NE] != 'O']
    fd_all = FreqDist(ne_all)
    print("[Q1.4.1]\n>\tFrequency dist of Named Entities for the whole dataset\n>\t", nbest(fd_all, 20))

    # Train set
    ne_train = [w[NE] for w in corpus_train.iob_words() if w[NE] != 'O']
    fd_train = FreqDist(ne_train)
    print("[Q1.4.2]\n>\tFrequency dist of Named Entities for the training set\n>\t", nbest(fd_train, 20))

    # Whole dataset
    ne_test = [w[NE] for w in corpus_test.iob_words() if w[NE] != 'O']
    fd_test = FreqDist(ne_test)
    print("[Q1.4.3]\n>\tFrequency dist of Named Entities for the test set\n>\t", nbest(fd_test, 20))

q14()

[Q1.4.1]
>	Frequency dist of Named Entities for the whole dataset
>	 {'B-LOC': 10645, 'B-PER': 10059, 'B-ORG': 9323, 'I-PER': 6991, 'I-ORG': 5290, 'B-MISC': 5062, 'I-MISC': 1717, 'I-LOC': 1671}
[Q1.4.2]
>	Frequency dist of Named Entities for the training set
>	 {'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155}
[Q1.4.3]
>	Frequency dist of Named Entities for the test set
>	 {'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216}


### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [131]:
def q21(corpus):

    def get_subj_obj_dict(doc):
        deps_dict = dict()
        deps = ['nsubj', 'dobj', 'pobj']
        for dep in deps:
            deps_dict[dep] = list()
        for token in doc:
            if token.dep_ in deps:
                deps_dict[token.dep_].append(token.text)
        return deps_dict
                    
    def get_noun_chunks(doc):
        return doc.noun_chunks

    def get_head_of_chunk(doc):
        return [(c.root.text, c.text) for c in doc.noun_chunks]


    def q211(doc):
        print("[Q2.1.1]\n>\tProviding subjects and objects:")
        deps_dict = get_subj_obj_dict(doc)
        for key in deps_dict:
            print(">\t {}: {}".format(key, deps_dict[key]))
        print()


    def q212(doc):
        print("[Q2.1.2]\n>\tProviding noun chunks:")
        noun_chunks = get_noun_chunks(doc)
        for chunk in noun_chunks:
            print(">\t", chunk)
        print()

    def q213(doc):
        print("[Q2.1.3]\n>\tProviding head noun for each noun chunk:")
        print(">\t'CHUNK' -> HEAD\n>")
        heads = get_head_of_chunk(doc)
        for head, chunk in heads:
            print(">\t'{}' -> {}".format(chunk, head))
        print()


    sents = get_flat_sents(corpus)
    for sent in sents:
        doc = nlp(sent)
        print (sent, "\n")
        q211(doc)
        q212(doc)
        q213(doc)

    # doc = nlp(sents[14])
    # print (sents[14], "\n")
    # q211(doc)
    # q212(doc)
    # q213(doc)


q21(["I saw the man with a telescope".split(" ")])
# q21(corpus.sents())

I saw the man with a telescope  

[Q2.1.1]
>	Providing subjects and objects:
>	 nsubj: ['I']
>	 dobj: ['man']
>	 pobj: ['telescope']

[Q2.1.2]
>	Providing noun chunks:
>	 I
>	 the man
>	 a telescope

[Q2.1.3]
>	Providing head noun for each noun chunk:
>	'CHUNK' -> HEAD
>
>	'I' -> I
>	'the man' -> man
>	'a telescope' -> telescope



#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

In [241]:
def q22(corpus):

    def get_root(doc):
        for token in doc:
            if token.dep_ == 'ROOT':
                return token.text 
            else:
                continue

    def get_subtree(chunk_text, doc):
        subtree = None
        chunk_doc = nlp(chunk_text)
        chunk_root = get_root(chunk_doc)
        for token in doc:
            if token.text == chunk_root:
                # print(list(token.subtree))
                leftmost = list(token.subtree)[0].i
                rightmost = list(token.subtree)[-1].i
                # print(leftmost, rightmost)
                subtree = doc[leftmost:rightmost+1]
        # print()
        return subtree


    sents = get_flat_sents(corpus)
    # sents = [sents[14]]
    for sent in sents:
        print(sent)
        doc = nlp(sent)
        spacy.displacy.render(doc, style="dep")
        for chunk in doc.noun_chunks:
            subtree = get_subtree(chunk.text, doc)
            print("Chunk: {}\n>\tSubtree: {}".format(chunk, subtree))
            spacy.displacy.render(subtree, style="dep")
    
q22([
    "I saw the man with the telescope".split(" "),
    "I saw the man with a telescope".split(" ")
])
# q22(corpus.sents())

I saw the man with the telescope


Chunk: I
>	Subtree: I


Chunk: the man
>	Subtree: the man with the telescope


Chunk: the telescope
>	Subtree: the telescope


I saw the man with a telescope


Chunk: I
>	Subtree: I


Chunk: the man
>	Subtree: the man


Chunk: a telescope
>	Subtree: a telescope


#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

In [114]:
def q23(corpus):
    def compute_dependency_path(token):
        path = list()
        path.append(token.text)
        while token.dep_ != 'ROOT':
            token = token.head
            path.append(token.text)
        return path

    sents = get_flat_sents(corpus)
    for sent in sents:
        print(sent)
        doc = nlp(sent)
        print("TOKEN ---> ['path', 'to', 'root']\n")
        for token in doc:
            print("{}\n\t---> {}".format(token.text, compute_dependency_path(token)))

    spacy.displacy.render(doc, style="dep")

q23(["I saw the man with a telescope on the hill".split(" ")])
# q23(corpus.sents())

I saw the man with a telescope on the hill 
TOKEN ---> ['path', 'to', 'root']

I
	---> ['I', 'saw']
saw
	---> ['saw']
the
	---> ['the', 'man', 'saw']
man
	---> ['man', 'saw']
with
	---> ['with', 'saw']
a
	---> ['a', 'telescope', 'with', 'saw']
telescope
	---> ['telescope', 'with', 'saw']
on
	---> ['on', 'telescope', 'with', 'saw']
the
	---> ['the', 'hill', 'on', 'telescope', 'with', 'saw']
hill
	---> ['hill', 'on', 'telescope', 'with', 'saw']


### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

In [135]:
def q31():
    sents = get_flat_sents(corpus_val.sents())
    # print(sents[14])
    # for sent in sents:
    #     doc = nlp(sent)
    doc = nlp(sents[14])
    print(doc)
    ents = [e.label_ for e in doc.ents]
    print(ents)
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)


    def q311(sents):
        pass
        


q31()

After the frustration of seeing the opening day of their match badly affected by the weather , Kent stepped up a gear to dismiss Nottinghamshire for 214 
['ORG', 'CARDINAL']
Kent 95 99 ORG
214 149 152 CARDINAL


### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.