# NLU: Mid-Term Assignment 2022
### Description
In this notebook, we ask you to complete four main tasks to show what you have learnt during the NLU labs. Therefore, to complete the assignment please refer to the concepts, libraries and other materials shown and used during the labs. The last task is not mandatory, it is a *BONUS* to get an extra mark for the laude. 

### Instructions
- **Dataset**: in this notebook, you are asked to work with the dataset *Conll 2003* provided by us in the *data* folder. Please, load the files from the *data* folder and **do not** change names or paths of the inner files. 
- **Output**: for each part of your task, print your results and leave it in the notebook. Please, **do not** send a jupyter notebook without the printed outputs.
- **Other**: follow carefully all the further instructions and suggestions given in the question descriptions.

### Deadline
The deadline is due in two weeks from the project presentation. Please, refer to *piazza* channel for the exact date.

### Task 1: Analysis of the dataset

#### Q 1.1
- Create the Vocabulary and Frequency Dictionary of the:
    1. Whole dataset
    2. Train set
    3. Test set
    
**Attention**: print the first 20 words of the Dictionaty of each set

##### <p style='color:lightskyblue'> Q 1.1 Development </p>
<p style='color:lightskyblue'>To produce a Vocabulary and a Frequency Dictionary we will use the <code>nltk</code> and <code>spaCy</code> libraries.</p>
<p style='color:lightskyblue'>The former is used to load <a href="data/test.txt">Test</a>, <a href="data/train.txt">Train</a>, and <a href="data/valid.txt">Validation</a> datasets.</p>

<p style='color:lightskyblue'>
    For the <b>tokenization</b> task we use the <code>nltk.word_tokenize</code> method. This will allow us to get tokens (words) out of the raw text.
    However, in order to create a vocabulary, we iterate through tokens that have been set to lowercase, and add them to a <code>set</code> resulting in a list-like object but with unique values in it.
</p>
<p style='color:lightskyblue'>
    For the <b>frequency</b> task we use the <code>nltk.FreqDist</code> method. This will allow us to get a dictionary representing the frequency with which each word appears in the input text. Secondly, we display the <i>top-20</i> words in terms of frequency for each text file using the. To do this we use a modified version of the <code>nbest</code> function provided in the first lab session, available in the <a href="utils.py">utils</a> file.
</p>


In [2]:
# IMPORTS
import nltk
import spacy
import matplotlib.pyplot as plt
import pandas as pd
# Stopwords
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP_WORDS

from utils import nbest

In [3]:
%%capture
# IMPORTING DATASETS
raw_test = nltk.load('./data/test.txt')
raw_train = nltk.load('./data/train.txt')
raw_val = nltk.load('./data/valid.txt')

<p style='color:lightskyblue'>Test</p>

In [4]:
# Tokenization
test_words = nltk.word_tokenize(raw_test)

test_vocabulary = set([ word.lower() for word in test_words])
# printing first 20 words in test Vocab
list(test_vocabulary)[:20]

['franco',
 'spots',
 'advance',
 'kong',
 'affecting',
 'try',
 'subject',
 'largest',
 'first-class',
 'north',
 'morale-boosting',
 'safe',
 'waited',
 'urska',
 'above',
 'iseas',
 'megan',
 'sergio',
 'hindu',
 'start']

In [None]:
# Frequency Dictionary from raw text
test_freq_dict = nltk.FreqDist(test_words)

top_20_test = nbest(test_freq_dict, n=20)
top_20_test


<p style='color:lightskyblue'>Train</p>

In [32]:
# TRAIN

# Tokenization
train_words = nltk.word_tokenize(raw_train)

train_vocab = set([ word.lower() for word in train_words])
# printing first 20 words in train Vocab
list(train_vocab)[:20]


['employer',
 'tutsis',
 '25-30',
 'reid',
 'peso',
 'uag',
 'col',
 'housing',
 'detain',
 'volgograd',
 '369.00',
 'as-safir',
 'watkinson',
 'budge',
 'kapoor',
 'please',
 'jonathon',
 'sutjeska',
 'steptoe',
 'lasted']

In [None]:
# Frequency Dictionary from raw text
train_freq_dict = nltk.FreqDist(train_words)

top_20_train = nbest(train_freq_dict, n=20)
top_20_train


<p style='color:lightskyblue'>Validation</p>

In [None]:
# VALIDATION

# Tokenization
val_sents = nltk.sent_tokenize(raw_val)
val_words = nltk.word_tokenize(raw_val)

val_vocab = set([ word.lower() for word in val_words])
# printing first 20 words in val Vocab
list(val_vocab)[:20]


In [None]:
# Frequency Dictionary from raw text
val_freq_dict = nltk.FreqDist(val_words)

top_20_val = nbest(val_freq_dict, n=20)
val_freq_dict


<p style="color:lightskyblue">Here we repeat the same <b>tokenization</b> and <b>frequency-list</b> tasks using the <code>spaCy</code> library.</p>

In [None]:
%%capture
# USING SPACY

# Tokenization using SpaCy
nlp = spacy.load("en_core_web_sm")

test_words = nlp(raw_test)
test_vocab = set([ token.text for token in test_words ])
list(test_vocab)[:20]

In [65]:
# Tokenization using SpaCy
train_words = nlp(raw_train)
train_vocab = set([ token.text for token in train_words])
list(train_vocab)[:20]

821087

#### Q 1.2
- Obtain the list of:
    1. Out-Of-Vocabulary (OOV) tokens
    2. Overlapping tokens between train and test sets  

##### <h style='color:lightskyblue'> Q 1.2 Development</h>
<p style='color:lightskyblue'>
    We consider as <strong>OOVs</strong> characters such as <em>punctuation</em> and <em>words containing numbers</em>.
    For convenience, both <code>spaCy</code> and <code>nltk</code> offer a list of stopwords which is displayed in the cell below.
</p>
<p style='color:lightskyblue'>
    In the cells above, we tried to use spacy's <code>.is_oov</code> attribute to tokens, but the result is ambiguos in the sense that words like <code>-DOCSTART-</code>, <code>-X-</code>, <code>NN</code>, etc are not meaningfull hance not desirable training inputs. However, such an attribute also removes meaningful words such as <code>SOCCER</code>, <code>JAPAN</code>, etc.
</p>

In [None]:
%%capture
# Lists of Stopwords
NLTK_STOP_WORDS = set(stopwords.words('english'))
SPACY_STOP_WORDS

In [None]:
# Word Overlap between Test and Train sets


#### Q 1.3
- Perform a complete data analysis of the whole dataset (train + test sets) to obtain:
    1. Average sentence length computed in number of tokens
    2. The 50 most-common tokens
    3. Number of sentences

In [6]:
# Loading sentences
def load_as_sents(path):
    sents = []
    with open(path, 'r') as f:
        [sents.append(line.strip()) for line in f.readlines()]
    return sents

test_sents = load_as_sents('./data/test.txt')
train_sents = load_as_sents('./data/train.txt')
dataset = train_sents + test_sents

In [8]:
# Average sentence length computed in number of tokens 
from statistics import mean
def tokens_in_sent(text_sents):
    '''
    Computes the average sentence length in terms of tokens per sentence of the given :param text_sents: .
    Params:
    - :param text_sents: list of sentences.
    Returns:
    - :sent_len: number of tokens per element of the :param text_sents: list. 
    '''
    sent_len = []
    for sent in text_sents:
        sent_ = sent.split(' ')
        sent_len.append(len(sent_))
    return mean(sent_len)

words_per_sent = tokens_in_sent(dataset)
round(words_per_sent,2)

3.79

In [None]:
# The 50 most-common tokens
top_20_val = nbest(val_freq_dict, n=20)


In [102]:
# Number of sentences
len(dataset)

269904

#### Q 1.4
- Create the dictionary of Named Entities and their Frequencies for the:
    1. Whole dataset
    2. Train set
    3. Test set

### Task 2: Working with Dependecy Tree
*Suggestions: use Spacy pipeline to retreive the Dependecy Tree*


#### Q 2.1
- Given each sentence in the dataset, write the required functions to provide:
    1. Subject, obects (direct and indirect)
    2. Noun chunks
    3. The head noun in each noun chunk
    
**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

#### Q 2.2
- Given a dependecy tree of a sentence and a segment of that sentence write the required functions that ouput the dependency subtree of that segment.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope" (the segment could be any e.g. "saw the man", "a telescope", etc.)*

#### Q 2.3
- Given a token in a sentence, write the required functions that output the dependency path from the root of the dependency tree to that given token.

**Attention**: *print only the results of these functions by using the sentence "I saw the man with a telescope"*

### Task 3: Named Entity Recognition
*Suggestion: use scikit-learn metric functions. See classification_report*

#### Q 3.1
- Benchmark Spacy Named Entity Recognition model on the test set by:
    1. Providing the list of categories in the dataset (person, organization, etc.)
    2. Computing the overall accuracy on NER
    3. Computing the performance of the Named Entity Recognition model for each category:
        - Compute the perfomance at the token level (eg. B-Person, I-Person, B-Organization, I-Organization, O, etc.)
        - Compute the performance at the entity level (eg. Person, Organization, etc.)

### Task 4: BONUS PART (extra mark for laude)

#### Q 4.1
- Modify NLTK Transition parser's Configuration calss to use better features.

#### Q 4.2
- Evaluate the features comparing performance to the original.

#### Q 4.3
- Replace SVM classifier with an alternative of your choice.