# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

## Setup

In [2]:
from nltk import FreqDist
from nltk.lm import Vocabulary
from nltk.corpus import ConllCorpusReader

CORPUS_ROOT = 'data'
CORPUS_FILEIDS = ['train.txt', 'test.txt', 'valid.txt']
CORPUS_COLUMNTYPES = ['words', 'ne', 'pos', 'chunk']

corpus = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS, CORPUS_COLUMNTYPES)
corpus_train = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[0], CORPUS_COLUMNTYPES)
corpus_test = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[1], CORPUS_COLUMNTYPES)
corpus_val = ConllCorpusReader(CORPUS_ROOT, CORPUS_FILEIDS[2], CORPUS_COLUMNTYPES)

# Utilities
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

In [2]:
def q11():
    # Create vocabulary
    vocab = set([w.lower() for w in corpus.words()])
    vocab_train = set([w.lower() for w in corpus_train.words()])
    vocab_test = set([w.lower() for w in corpus_test.words()])

    # Create frequency distribution
    fd = FreqDist([w.lower() for w in corpus.words()])
    fd_train = FreqDist([w.lower() for w in corpus_train.words()])
    fd_test = FreqDist([w.lower() for w in corpus_test.words()])

    # Print vocabulary length
    print("Length of whole dataset: %d" % len(vocab))
    print("Length of train set: %d" % len(vocab_train))
    print("Length of test set: %d" % len(vocab_test))

    # Print the first 20 words for each dict
    print("\nFirst 20 words of whole dataset:")
    print(nbest(fd, 20))
    print("\nFirst 20 words of train set:")
    print(nbest(fd_train, 20))
    print("\nFirst 20 words of test set:")
    print(nbest(fd_test, 20))

q11()

Length of whole dataset: 26869
Length of train set: 21009
Length of test set: 8548

First 20 words of whole dataset:
{'the': 12310, ',': 10876, '.': 10874, 'of': 5502, 'in': 5405, 'to': 5129, 'a': 4731, '(': 4226, ')': 4225, 'and': 4223, '"': 3239, 'on': 3115, 'said': 2694, "'s": 2339, 'for': 2109, '-': 1866, '1': 1845, 'at': 1679, 'was': 1593, '2': 1342}

First 20 words of train set:
{'the': 8390, '.': 7374, ',': 7290, 'of': 3815, 'in': 3621, 'to': 3424, 'a': 3199, 'and': 2872, '(': 2861, ')': 2861, '"': 2178, 'on': 2092, 'said': 1849, "'s": 1566, 'for': 1465, '1': 1421, '-': 1243, 'at': 1146, 'was': 1095, '2': 973}

First 20 words of test set:
{'the': 1765, ',': 1637, '.': 1626, 'to': 805, 'of': 789, 'in': 761, '(': 686, ')': 684, 'a': 658, 'and': 598, 'on': 467, '"': 421, 'said': 399, "'s": 347, '-': 287, 'for': 286, 'at': 251, 'was': 224, '4': 201, 'with': 185}


#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

In [39]:
def q12(cutoff=1):
    # Get vocabs
    vocab_train = Vocabulary([w.lower() for w in corpus_train.words()], unk_cutoff=cutoff)
    vocab_test = Vocabulary([w.lower() for w in corpus_test.words()], unk_cutoff=cutoff)

    # Get list of tokens
    tokens_train = set(vocab_train.counts.keys())
    tokens_test = set(vocab_test.counts.keys())

    # Get OOV
    oov = tokens_test.difference(tokens_train)
    print("Found {} OOV".format(len(oov)))

    # Get overlapping tokens
    intersection = tokens_train.intersection(tokens_test)
    print("Found {} overlapping tokens".format(len(intersection)))
    

q12()

Found 3268 OOV
Found 5280 overlapping tokens


#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [24]:
def q13():

    # Get average sentence length
    print("Average sentence length: {:.4f}".format(len(corpus.words())/len(corpus.sents())))

    # Get 50 most common tokens
    vocab = Vocabulary([w.lower() for w in corpus.words()])
    most_common_tokens = nbest(vocab.counts, 50)
    print("50 most common tokens:")
    for key in most_common_tokens:
        print("{}: {}".format(key, most_common_tokens[key]))

    # Get number of sentences
    print("Number of sentences: %d" % len(corpus.sents()))

q13()

Average sentence length: 13.6160
50 most common tokens:
the: 12310
,: 10876
.: 10874
of: 5502
in: 5405
to: 5129
a: 4731
(: 4226
): 4225
and: 4223
": 3239
on: 3115
said: 2694
's: 2339
for: 2109
-: 1866
1: 1845
at: 1679
was: 1593
2: 1342
with: 1267
3: 1264
0: 1232
that: 1212
he: 1166
from: 1146
by: 1113
it: 1082
:: 1057
is: 984
4: 973
as: 920
his: 867
had: 841
were: 804
an: 796
but: 786
not: 786
after: 780
has: 768
be: 754
have: 738
new: 656
first: 645
who: 643
5: 636
will: 591
6: 584
two: 579
they: 567
Number of sentences: 22137


#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

In [24]:
def q14():
    (WORD, POS, NE) = range(3)

    # Whole dataset
    ne_all = [w[NE] for w in corpus.iob_words() if w[NE] != 'O']
    fd_all = FreqDist(ne_all)
    print("Frequency dist of Named Entities for the whole dataset\n", nbest(fd_all, 20))

    # Train set
    ne_train = [w[NE] for w in corpus_train.iob_words() if w[NE] != 'O']
    fd_train = FreqDist(ne_train)
    print("Frequency dist of Named Entities for the training set\n", nbest(fd_train, 20))

    # Whole dataset
    ne_test = [w[NE] for w in corpus_test.iob_words() if w[NE] != 'O']
    fd_test = FreqDist(ne_test)
    print("Frequency dist of Named Entities for the test set\n", nbest(fd_test, 20))

q14()

Frequency dist of Named Entities for the whole dataset
 {'B-LOC': 10645, 'B-PER': 10059, 'B-ORG': 9323, 'I-PER': 6991, 'I-ORG': 5290, 'B-MISC': 5062, 'I-MISC': 1717, 'I-LOC': 1671}
Frequency dist of Named Entities for the training set
 {'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155}
Frequency dist of Named Entities for the test set
 {'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216}


### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.