# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="images/pos_tagging.png" alt="POS tagging" />
</center>

# Solution

### 0.1 Imports

In [1]:
import importlib

packages = ['keras_preprocessing']

for p in packages:
  try:
      importlib.import_module(p)
  except ImportError:
      !pip install {p}

Collecting keras_preprocessing
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m832.8 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: keras_preprocessing
Successfully installed keras_preprocessing-1.1.2


In [8]:
import gc
import numpy as np
import os
import pandas as pd
import random
import re
import urllib.request
import zipfile

import keras
import plotly
import plotly.express as px
import plotly.graph_objs as go
import progressbar
import tensorflow as tf
from IPython.display import display_html
from collections import defaultdict, OrderedDict
from itertools import chain, cycle
from sklearn.metrics import f1_score, classification_report
from tabulate import tabulate

import nltk
from keras import backend as K
from keras.layers import Input, Dense, LSTM, InputLayer, Bidirectional, TimeDistributed, Embedding, Activation
from keras.models import Sequential
from keras.optimizers import Adam
from keras_preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

In [9]:
plotly.offline.init_notebook_mode(connected=True)

### 0.2 Miscellaneous

In [10]:
pbar = None

def show_progress(block_num, block_size, total_size):
    """
    Displays a progress bar to track the download progress of a file.

    Parameters:
        block_num (int): The current block number being downloaded.
        block_size (int): The size of each block.
        total_size (int): The total size of the file being downloaded.

    Returns:
        None
    """
    global pbar
    if pbar is None:
        pbar = progressbar.ProgressBar(maxval=total_size)
        pbar.start()

    downloaded = block_num * block_size
    if downloaded < total_size:
        pbar.update(downloaded)
    else:
        pbar.finish()
        pbar = None

def set_reproducibility(seed):
    """
    Sets the random seed for reproducibility.

    Parameters:
        seed (int): The seed value to set.

    Returns:
        None
    """
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

def display_df(*args, titles=cycle([''])):
    """
    Displays multiple pandas DataFrames side by side.

    Parameters:
        *args: A variable number of pandas DataFrames to display.
        titles (list): A list of titles for the DataFrames.

    Returns:
        None
    """
    html_str=''
    for df, title in zip(args, chain(titles, cycle(['</br>']))):
        html_str+='<th style="text-align:left"><td style="vertical-align:top">'
        html_str+=f'<h4 style="text-align: left;">{title}</h2>'
        html_str+=df.to_html().replace('table', 'table style="display:inline"')
        html_str+='</td></th>'
    display_html(html_str, raw=True)

def tags_mismatch(tags1, tags2, tags3, name1, name2, name3):
    """
    Prints information about tag mismatches between different sets.

    Parameters:
        tags1 (list): Tags in the first set.
        tags2 (list): Tags in the second set.
        tags3 (list): Tags in the third set.
        name1 (str): Name of the first set.
        name2 (str): Name of the second set.
        name3 (str): Name of the third set.

    Returns:
        None
    """
    print(f'{name1} tags number: {len(tags1)}')
    print(f'{name1} tags list: {tags1}')

    exceeding_validation = [el for el in tags1 if el not in tags2]
    if exceeding_validation:
        print(f'\tClasses in {name1} set for which there are no samples in {name2} set: {exceeding_validation}')

    exceeding_test = [el for el in tags1 if el not in tags3]
    if exceeding_test:
        print(f'\tClasses in {name1} set for which there are no samples in {name3} set: {exceeding_test}\n')

In [11]:
def plot_value_counts(df, key, name):
    """
    Plots a histogram to visualize the occurrences of words by tag.

    Parameters:
        df (pandas DataFrame): The DataFrame containing the data.
        key (str): The column key for which to plot the histogram.
        name (str): Name of the dataset.

    Returns:
        None
    """
    values = df[key].value_counts()
    fig = px.bar(x=values.index, y=values.values)
    fig.update_layout(xaxis_title=key,
                      yaxis_title='Occurrences of words',
                      title=f'{name} set words per tag')
    fig.show()

def plot_tag_distribution(tag_lists, name):
    """
    Plots the tag distribution per sentence.

    Parameters:
        tag_lists (list): List of lists containing tags for each sentence.
        name (str): Name of the dataset.

    Returns:
        None
    """
    tag_counts = []
    for tags in tag_lists:
        tag_dict = {}
        for tag in tags:
            if tag in tag_dict:
                tag_dict[tag] += 1
            else:
                tag_dict[tag] = 1
        tag_counts.append(tag_dict)

    df = pd.DataFrame(tag_counts)
    df = df.fillna(0)
    df = df.apply(lambda x: x / sum(x) * 100)

    fig = px.line(df, title=f'Tag Distribution per {name} Sentence')
    fig.show()

def plot_train(baseline_model_recaps, add_lstm_model_recaps, add_fc_model_recaps, metric, epochs=50):
    """
    Plots the training metrics for different models.

    Parameters:
        baseline_model_recaps (list): List of dictionaries containing baseline model recaps.
        add_lstm_model_recaps (list): List of dictionaries containing additional LSTM model recaps.
        add_fc_model_recaps (list): List of dictionaries containing additional FC model recaps.
        metric (str): The metric to plot.
        ep (int): The number of epochs.

    Returns:
        None
    """
    epochs = np.arange(1, epochs, 1)
    fig = go.Figure()
    for i, model_recaps in enumerate([baseline_model_recaps, add_lstm_model_recaps, add_fc_model_recaps]):
        for j, m in enumerate(model_recaps):
            fig.add_trace(go.Scatter(x=epochs, y=m['history'].history[metric], name=m['name'], mode='lines+markers',
                                     marker=dict(color=f'rgb({i * 90}, {j * 90}, {i * 90})')))
    fig.update_layout(title=f'{metric} during training', height=750)
    fig.show()

def plot_vocab_pie(counts_sum):
    """
    Plots a pie chart to visualize the distribution of in-vocabulary (IV) and out-of-vocabulary (OOV) words.

    Parameters:
        counts_sum (list): A list containing the counts of IV and OOV words.

    Returns:
        None
    """
    fig = px.pie(values=counts_sum, names=['OOV words', 'IV words'])
    fig.show()

def plot_sentence_lengths(lengths):
    """
    Plots a boxplot of the lengths of sentences.

    Parameters:
        lengths (list): List containing the lengths of sentences.

    Returns:
        None
    """
    fig = px.box(lengths)
    fig.update_layout(xaxis_title='',
                      yaxis_title='',
                      title='Words per sentence')
    fig.show()

def configure_plotly_browser_state():
    """
    Configures Plotly to display graphs in Colab.
    """
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

In [12]:
# Downloading the dataset
nltk.download('treebank')

# Download the GloVe embeddings file
url = 'http://nlp.stanford.edu/data/glove.6B.zip'
urllib.request.urlretrieve(url, 'glove.6B.zip', show_progress)

# Extract the zip file
zip_ref = zipfile.ZipFile('glove.6B.zip', 'r')
zip_ref.extractall()

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
100% (862182613 of 862182613) |##########| Elapsed Time: 0:02:39 Time:  0:02:39


In [13]:
# Get the files' list
fileids = nltk.corpus.treebank.fileids()

# Get the Penn Treebank tagged sentences
train_corpus = nltk.corpus.treebank.tagged_sents(fileids[:100])
val_corpus = nltk.corpus.treebank.tagged_sents(fileids[100:150])
test_corpus = nltk.corpus.treebank.tagged_sents(fileids[150:])

# Flatten the lists
train_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(train_corpus) for item in sublist]
val_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(val_corpus) for item in sublist]
test_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(test_corpus) for item in sublist]

# Create the Dataframes
train_df = pd.DataFrame(train_corpus, columns = ['word', 'tag', 'sentence'])
val_df = pd.DataFrame(val_corpus, columns = ['word', 'tag', 'sentence'])
test_df = pd.DataFrame(test_corpus, columns = ['word', 'tag', 'sentence'])

# Summary of the created Dataframes
display_df(train_df.describe(), val_df.describe(), test_df.describe(), titles = [f'Training set {train_df.shape}', f'Validation set {val_df.shape}', f'Test set {test_df.shape}'])

Unnamed: 0,word,tag,sentence
count,50748,50748,50748
unique,8442,46,1963
top,",",NN,1854
freq,2570,6270,271

Unnamed: 0,word,tag,sentence
count,33260,33260,33260
unique,6068,45,1299
top,",",NN,339
freq,1528,4513,90

Unnamed: 0,word,tag,sentence
count,16668,16668,16668
unique,3648,41,652
top,",",NN,232
freq,787,2383,64


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

### 2.1 Pre-processing

The `-NONE-` tag in the Natural Language Toolkit (NLTK) is used to represent words or tokens that do not have a specific Part-of-Speech (POS) tag. Removing these occurances from the data can be useful for a POS-tagging task as it reduces the noise in the data and improves the quality of the results. By removing the `-NONE-` tags, the model will have fewer examples of unstructured data to learn from and can instead focus on the examples that are more relevant to the task of POS-tagging. This can help the model learn more accurate patterns and relationships between words and their corresponding POS tags, leading to more accurate results in the end.

In [14]:
# Get the Penn Treebank tagged sentences
train_corpus = nltk.corpus.treebank.tagged_sents(fileids[:100])
val_corpus = nltk.corpus.treebank.tagged_sents(fileids[100:150])
test_corpus = nltk.corpus.treebank.tagged_sents(fileids[150:])

# Flatten the lists
train_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(train_corpus) for item in sublist if item[1] != '-NONE-']
val_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(val_corpus) for item in sublist if item[1] != '-NONE-']
test_corpus = [tuple(list(item)+[str(idx)]) for idx,sublist in enumerate(test_corpus) for item in sublist if item[1] != '-NONE-']

# Create the Dataframes
train_df = pd.DataFrame(train_corpus, columns = ['word', 'tag', 'sentence'])
val_df = pd.DataFrame(val_corpus, columns = ['word', 'tag', 'sentence'])
test_df = pd.DataFrame(test_corpus, columns = ['word', 'tag', 'sentence'])

# Summary of the created Dataframes
display_df(train_df.describe(), val_df.describe(), test_df.describe(), titles = [f'Training set {train_df.shape}', f'Validation set {val_df.shape}', f'Test set {test_df.shape}'])

Unnamed: 0,word,tag,sentence
count,47356,47356,47356
unique,8009,45,1963
top,",",NN,1854
freq,2570,6270,249

Unnamed: 0,word,tag,sentence
count,31183,31183,31183
unique,5892,44,1299
top,",",NN,339
freq,1528,4513,81

Unnamed: 0,word,tag,sentence
count,15545,15545,15545
unique,3623,40,652
top,",",NN,232
freq,787,2383,58


The number of words and in particular unique words in each set is different, with the training set having the most and the test set having the least.

The most frequent word in each set is `,` and the most frequent tag is `NN` (noun, singular or mass). This suggests that the datasets might have a large number of common words and that nouns might be the most frequent part of speech in the text, apart from the comma that will be ignored in the final scores computation.

In [15]:
configure_plotly_browser_state()

# Ordering tags in the sets
tags_train = sorted(list(set([x for x in train_df.tag])))
tags_val = sorted(list(set([x for x in val_df.tag])))
tags_test = sorted(list(set([x for x in test_df.tag])))

max_tags_list = max([len(tags_train),len(tags_val),len(tags_test)])

# Training set tags list
tags_mismatch(tags_train,tags_val,tags_test,'Training','Validation','Test')

# Validation set tags list
tags_mismatch(tags_val,tags_train,tags_test,'Validation','Training','Test')

# Test set tags list
tags_mismatch(tags_test,tags_train,tags_val,'Test','Training','Validation')

# Histograms of occurencies of words per tag
plot_value_counts(train_df, 'tag', 'Training')
plot_value_counts(val_df, 'tag', 'Validation')
plot_value_counts(test_df, 'tag', 'Test')

Training tags number: 45
Training tags list: ['#', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']
	Classes in Training set for which there are no samples in Validation set: ['SYM']
	Classes in Training set for which there are no samples in Test set: ['#', 'FW', 'LS', 'SYM', 'UH']

Validation tags number: 44
Validation tags list: ['#', '$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']
	Classes in Validation set for which there are no samples in Test set: ['#', 'FW', 'LS', 'UH']

Test tags number: 40
Test tags list: ['$', "'

The most frequent tags in each set are similar, with Nouns (NN) being the most frequent, followed by prepositions (IN), determiners (DT), and proper nouns (NNP), this suggests that nouns and prepositions are the most frequent parts of speech in the text, and that the datasets are similar in terms of the distribution of tags.

The analysis of these distributions will be crucial in setting the weights to counteract the unbalanceness of the datasets while looking at the metrics.

In [16]:
# Retriving prepocessed data and grouping in sentences
X_train_raw = train_df.groupby('sentence').word.apply(list).reset_index()['word']
X_val_raw = val_df.groupby('sentence').word.apply(list).reset_index()['word']
X_test_raw = test_df.groupby('sentence').word.apply(list).reset_index()['word']

y_train_raw = train_df.groupby('sentence').tag.apply(list).reset_index()['tag']
y_val_raw = val_df.groupby('sentence').tag.apply(list).reset_index()['tag']
y_test_raw = test_df.groupby('sentence').tag.apply(list).reset_index()['tag']

In [17]:
configure_plotly_browser_state()

# Plot tag distributions per sentence
plot_tag_distribution(y_train_raw,'Training')
plot_tag_distribution(y_val_raw,'Validation')
plot_tag_distribution(y_test_raw,'Test')

As expected, looking at the tag distribution per sentence plots, the minority classes in the dataset have occurences only in some sentences in all the three datasets.

### 2.2 Vocabulary and GloVe embedding

GloVe (Global Vectors for Word Representation) is a method for learning vector representations of words, called "word embeddings," from a large corpus of text. Word embeddings are numerical representations of words that capture the semantic relationships between words in a continuous, low-dimensional space. They are commonly used as input to natural language processing models, such as language translation and language modeling.

GloVe works by learning the co-occurrence statistics of words in a corpus, and using this information to learn word embeddings that capture the semantic relationships between words. The GloVe method produces word embeddings that are trained on a global corpus, as opposed to embeddings that are trained on a specific task or dataset.

There are different versions of the GloVe word embeddings, including 50-dimensional, 100-dimensional, and 200-dimensional embeddings. The 50-dimensional version of GloVe embeddings may be better in some applications because they have a lower dimensionality, which can make them easier to work with and more computationally efficient.

By using GloVe embeddings as the initial weights for a model, we can take advantage of these pre-trained word representations and fine-tune them for a specific task.

In [18]:
#Use the 300 dimensional GLove Word Embeddings
glove_dir = './'

embedding_dim = 100
embedding_dict = {} #initialize dictionary
f = open(os.path.join(glove_dir, f'glove.6B.{embedding_dim}d.txt'), encoding="utf8")
lines = f.readlines()
f.close()

pbar = progressbar.ProgressBar()
for line in pbar(lines):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embedding_dict[word] = coefs

print('Found %s word vectors.' % len(embedding_dict))

100% (400000 of 400000) |################| Elapsed Time: 0:00:12 Time:  0:00:12


Found 400000 word vectors.


### Vocabulary and OOV handling

To compute the embeddings for out-of-vocabulary (OOV) words, we took the mean of existing embeddings related to the words with the same part of speech (POS) tag and added noise. This approach is based on the assumption that words with similar POS tags are also semantically similar, and therefore, their embeddings should be similar. Taking the mean of the existing embeddings provides a general representation of the semantic space of the words with similar POS tags, and adding noise helps to avoid overfitting by making the embeddings for the OOV words slightly different from each other.

This approach can be effective in some cases, especially if the number of OOV words is small and their semantic similarity to the in-vocabulary (IV) words is high. However, it's important to keep in mind that this may not always be the best option, as the quality of the OOV word embeddings depends on the quality and diversity of the IV word embeddings used to compute the mean. If the IV word embeddings are not representative of the words with similar POS tags, the OOV word embeddings may not be of good quality.

In [19]:
def mean_embed4tag(df, tags, embedding_dict, embedding_dim):
    """
    Computes embeddings based on the respective tag means.

    Parameters:
        df (pandas DataFrame): The DataFrame containing word-tag pairs.
        tags (list): List of tags to compute mean embeddings for.
        embedding_dict (dict): Dictionary mapping words to their embeddings.
        embedding_dim (int): Dimensionality of the embeddings.

    Returns:
        dict: A dictionary containing mean embeddings for each tag.
    """
    tag_dict = {tag: np.zeros(embedding_dim) for tag in tags}
    tag_count = {tag: 0 for tag in tags}

    for _, row in df.iterrows():
        for tag in tags:
            if tag == row['tag']:
                if row['word'].lower() in embedding_dict:
                    tag_count[tag] += 1
                    tag_dict[tag] += embedding_dict[row['word'].lower()]

    for tag in tags:
        if np.all(tag_dict[tag]):
            tag_dict[tag] = tag_dict[tag] / tag_count[tag]
    print(f'Computed mean embeddings for {len(tags)} tags.')
    return tag_dict

def update_vocab(df, vocabulary, embeddings_dict, mean_embedding_dict, embedding_dim, seed=42):
    """
    Updates the vocabulary with out-of-vocabulary (OOV) words and their embeddings.

    Parameters:
        df (pandas DataFrame): The DataFrame containing word-tag pairs.
        vocabulary (dict): The current vocabulary mapping words to embeddings.
        embeddings_dict (dict): Dictionary mapping words to their GloVe embeddings.
        mean_embedding_dict (dict): Dictionary containing mean embeddings for each tag.
        embedding_dim (int): Dimensionality of the embeddings.
        seed (int): Random seed for generating noise. Default is 42.

    Returns:
        dict: Updated vocabulary with OOV words and respective embeddings.
        list: A list containing the counts of OOV words added and existing words updated.
    """
    oov_c = 0
    np.random.seed(seed)
    for _, row in df.iterrows():
        if row['word'].lower() not in embeddings_dict.keys():
            if row['word'].lower() not in vocabulary.keys():
                oov_c += 1
                noise = np.random.uniform(low=-0.05, high=0.05, size=embedding_dim)
                vocabulary[row['word'].lower()] = mean_embedding_dict[row['tag']] + noise
        else:
            vocabulary[row['word'].lower()] = embeddings_dict[row['word'].lower()]
    counts = [oov_c, 0]
    print(f'Added {oov_c} OOV words + respective embeddings to the vocabulary.')

    return vocabulary, counts

def encode_sentences(raw_sentences, raw_tags, vocab, tags):
    """
    Encodes sentences and tags using the provided vocabulary and tag mapping.

    Parameters:
        raw_sentences (list): List of sentences, where each sentence is a list of words.
        raw_tags (list): List of tag sequences for each sentence.
        vocab (dict): Vocabulary mapping words to embeddings.
        tags (dict): Dictionary mapping tags to their indices.

    Returns:
        list: Encoded sentences as lists of word embeddings.
        list: Encoded tags as lists of tag indices.
    """
    encoded_sentences = []
    encoded_tags = []
    for sentence in raw_sentences:
        sent_int = []
        for word in sentence:
            sent_int.append(vocab[word.lower()])
        encoded_sentences.append(sent_int)

    for sent_tags in raw_tags:
        encoded_tags.append([tags[tag] for tag in sent_tags])

    return encoded_sentences, encoded_tags


In [20]:
# Computing mean embeddings per tag
mean_embedding_dict = mean_embed4tag(train_df, tags_train, embedding_dict, embedding_dim)

Computed mean embeddings for 45 tags.


V1 + Training set OOV (V2)

In [21]:
counts = []
vocabulary = {} #initialize vocabulary

# Computing the embeddings for the OOV words found in training set
vocabulary, counts_1 = update_vocab(train_df,vocabulary,embedding_dict,mean_embedding_dict,embedding_dim)
counts.append(counts_1)

Added 359 OOV words + respective embeddings to the vocabulary.


V2 + Validation set OOV (V3)

In [22]:
# Computing the embeddings for the OOV words found in validation set
vocabulary, counts_2 = update_vocab(val_df,vocabulary,embedding_dict,mean_embedding_dict,embedding_dim)
counts.append(counts_2)

Added 189 OOV words + respective embeddings to the vocabulary.


V3 + Test set OOV (V4)

In [23]:
# Computing the embeddings for the OOV words found in test set
vocabulary, counts_3 = update_vocab(test_df,vocabulary,embedding_dict,mean_embedding_dict,embedding_dim)
counts.append(counts_3)

Added 128 OOV words + respective embeddings to the vocabulary.


In [24]:
configure_plotly_browser_state()

# Building the actual word vocabulary
index2word = OrderedDict()
word2index = OrderedDict()

# Adding the entry for padding
index2word[0] = '-PAD-'
word2index['-PAD-'] = 0

curr_idx = 1
for key in vocabulary.keys():
  word2index[key] = curr_idx
  index2word[curr_idx] = key
  curr_idx += 1

vocab_length = len(word2index)
print(f'[Debug] Index -> Word vocabulary size: {len(index2word)}')
print(f'[Debug] Word -> Index vocabulary size: {len(word2index)}')


counts_sum = [sum(x) for x in zip(*counts)]
counts_sum[1] = vocab_length - counts_sum[0] - counts_sum[1] -1

plot_vocab_pie(counts_sum)

[Debug] Index -> Word vocabulary size: 10948
[Debug] Word -> Index vocabulary size: 10948


In [25]:
# Tag vocabulary
tag2index = OrderedDict()
index2tag = OrderedDict()

# Adding the entry for padding
index2tag[0] = '-PAD-'
tag2index['-PAD-'] = 0

curr_id = 1
for tag in tags_train:
  tag2index[tag] = curr_id
  index2tag[curr_id] = tag
  curr_id += 1

print(f'[Debug] Index -> Tag vocabulary size: {len(index2tag)}')
print(f'[Debug] Tag -> Index vocabulary size: {len(tag2index)}')

# Description dictionary
# Mapping tags to their descriptions
tag2description = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal number',
    'DT': 'Determiner',
    'EX': 'Existential there',
    'FW': 'Foreign word',
    'IN': 'Preposition or subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative',
    'JJS': 'Adjective, superlative',
    'LS': 'List item marker',
    'MD': 'Modal',
    'NN': 'Noun, singular or mass',
    'NNS': 'Noun, plural',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive ending',
    'PRP': 'Personal pronoun',
    'PRP$': 'Possessive pronoun',
    'RB': 'Adverb',
    'RBR': 'Adverb, comparative',
    'RBS': 'Adverb, superlative',
    'RP': 'Particle',
    'SYM': 'Symbol',
    'TO': 'to',
    'UH': 'Interjection',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund or present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, non-3rd person singular present',
    'VBZ': 'Verb, 3rd person singular present',
    'WDT': 'Wh-determiner',
    'WP': 'Wh-pronoun',
    'WP$': 'Possessive wh-pronoun',
    'WRB': 'Wh-adverb',
    ':': ':',
    '#': '#',
    '$': '$',
    '-LRB-': 'Left Round Bracket',
    '-RRB-': 'Right Round Bracket',
    ',': ',',
    '.': '.',
    "''": "''",
    '``': '``',
    '-PAD-': 'Padding'
}

[Debug] Index -> Tag vocabulary size: 46
[Debug] Tag -> Index vocabulary size: 46


In [26]:
# Tokenizing words and tags by their indexes in vocabulary
X_train_np, y_train_np = encode_sentences(X_train_raw,y_train_raw,word2index,tag2index)
X_val_np, y_val_np = encode_sentences(X_val_raw,y_val_raw,word2index,tag2index)
X_test_np, y_test_np = encode_sentences(X_test_raw,y_test_raw,word2index,tag2index)

# Examples
print('-Not encoded')
print('\t',X_train_raw[0])
print('\t',y_train_raw[0])
print('-Encoded')
print('\t',X_train_np[0])
print('\t',y_train_np[0])

-Not encoded
	 ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
	 ['NNP', 'NNP', ',', 'CD', 'NNS', 'JJ', ',', 'MD', 'VB', 'DT', 'NN', 'IN', 'DT', 'JJ', 'NN', 'NNP', 'CD', '.']
-Encoded
	 [1, 2, 3, 4, 5, 6, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
	 [21, 21, 4, 10, 23, 15, 4, 19, 35, 11, 20, 14, 11, 15, 20, 21, 10, 7]


In [27]:
configure_plotly_browser_state()

# Checking the lengths of the sentences
lengths = [len(sentence) for sentence in X_train_raw]
lengths.sort()

# Showing a boxplot of the lengths of the sentences
plot_sentence_lengths(lengths)

In [28]:
# Maximum words in a sentence
MAX_LENGTH = lengths[-1]
# Second longest sentence
PAD_LENGTH = lengths[-2]

print(f'Length of longest sentence: {MAX_LENGTH}')
print(f'Second longest sentence length: {PAD_LENGTH}')

# Padding the sequences
X_train = pad_sequences(X_train_np, maxlen=PAD_LENGTH, padding='post')
X_val = pad_sequences(X_val_np, maxlen=PAD_LENGTH, padding='post')
X_test = pad_sequences(X_test_np, maxlen=PAD_LENGTH, padding='post')

y_train = pad_sequences(y_train_np, maxlen=PAD_LENGTH, padding='post')
y_val = pad_sequences(y_val_np, maxlen=PAD_LENGTH, padding='post')
y_test = pad_sequences(y_test_np, maxlen=PAD_LENGTH, padding='post')

print('-Padded')
print('\tX:',X_train[0])
print('\n\ty:',y_train[0])

Length of longest sentence: 249
Second longest sentence length: 114
-Padded
	X: [ 1  2  3  4  5  6  3  7  8  9 10 11 12 13 14 15 16 17  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

	y: [21 21  4 10 23 15  4 19 35 11 20 14 11 15 20 21 10  7  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


In [29]:
# One hot encoding the sets
y_train_one_hot = to_categorical(y_train, len(tag2index))
y_val_one_hot = to_categorical(y_val, len(tag2index))
y_test_one_hot = to_categorical(y_test, len(tag2index))

In [30]:
#Building the Embedding Layer
embedding_matrix = np.zeros((len(word2index), embedding_dim))
for word, i in word2index.items():
  if word != '-PAD-':
    embedding_vector = vocabulary.get(word)
    embedding_matrix[i] = embedding_vector

In [31]:
if 'embedding_dict' in globals():
    del embedding_dict
    print(f'Garbage Collection: {gc.collect()}')

Garbage Collection: 6799


# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

The final hyperparameters were primarily selected through a trial-and-error process.
All of our models employ two regularization techniques, `EarlyStopping` and `ReduceLROnPlateu`, which help mitigate respectively overfitting the training data and getting stuck into a local minimum.

In [32]:
# Callbacks
callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=6, restore_best_weights=True,verbose=True),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.33, patience=3, verbose=True, min_lr=0.001)
]

# Embedding layer parameters
embedding_params = {'input_dim': vocab_length,'output_dim': embedding_dim,
                    'embeddings_initializer': tf.keras.initializers.Constant(embedding_matrix)}

### 3.1 Baseline
Bidirectional LSTM layers are able to process sequential data in both the forward and backward directions, which can allow the model to capture contextual information from both the past and the future. This can be particularly useful for natural language processing tasks, where the meaning of a word can depend on the context in which it is used.

In the context of POS tagging, TimeDistributed can be used to apply a tag prediction layer to each word in a sentence. For example, you might have an RNN that processes a sequence of words in a sentence, and at each time step, the RNN outputs a hidden state. You could then apply a TimeDistributed dense layer to the hidden states, which would allow you to predict the POS tag for each word in the sentence.

One advantage of using TimeDistributed for POS tagging is that it allows you to predict the POS tag for each word in the sentence simultaneously, rather than having to process the sentence one word at a time. This can be particularly useful when dealing with long sentences, as it can make the tagging process more efficient.

Different combinations of hyperparameters have been taken in consideration during the tuning phase.

The most evident changes in the scores were due to the `units` used in the LSTM layer and the `batch_size`:

*  Generally, a larger number of units in the LSTM layer means that the model has more capacity to learn complex representations. However, having too many units can lead to overfitting, where the model memorizes the training data instead of learning general patterns. On the other hand, having too few units can lead to underfitting, where the model is not able to capture important features of the data.
In our case, using fewer units (we reduced them from 256 to 128) may be helping to prevent the model from memorizing the training data, resulting in a more generalizable model that is better able to generalize to new data.

*   We observed better results with a batch size of `32`.
Using an unbalanced dataset like ours can lead to a bias towards the class with more samples. When the batch size is small, the model is more likely to see a diverse range of samples in each batch, which can help mitigate the impact of the class imbalance. On the other hand, if the batch size is too large, the model may not see enough samples from the minority class to learn to accurately classify them. In such cases, using a smaller batch size could help alleviate the problem and improve performance.

In [33]:
bl_lr = 1e-2

bl_layer_params = [{'layer_type': 'Bidirectional',
                          'layer_kwargs': {'units': 128}},
                         {'layer_type': 'Dense',
                          'layer_kwargs':{'units': len(tag2index),'activation': 'softmax'}}]

bl_training_params = {'x': X_train, 'y': y_train_one_hot, 'validation_data': (X_val, y_val_one_hot),
                  'batch_size': 32, 'epochs': 50, 'callbacks': callbacks}

### 3.2 Additional LSTM layer

The second biLSTM layer increases the capacity of the network, allowing it to model more complex dependencies in the data; adding a second biLSTM layer can help further improve this ability. This can be especially beneficial for tasks like POS-tagging where the relationships between words can span multiple time steps.

However, it's also possible that adding a second biLSTM layer could lead to overfitting, especially if the model is already sufficiently large to model the data. We hence counteracted this by means of dropout and an increased batch size. We did not notice benefits by increasing the number of units.

By adding a dropout layer explicityely, the regularization is applied after the current layer's output, i.e., the dropout is applied to the activations of the LSTM layer, meaning that some of the activations will be set to zero during each forward pass with a probability specified by the dropout rate.

Due to the constraints to the layers in our architecture, a Dropout layer was not applicable, so we decided to use instead the `dropout` parameter in the LSTM layer, causing it to be applied to the input of the LSTM layer. This means that the dropout rate specified for the `dropout` parameter will apply to the input connections, and some of the connections will be set to zero during each forward pass with a probability specified by the dropout rate.

The implementation of this regularization technique has proven to be effective in achieving better results, whereas increasing the batch size did not yield significant improvements. The learning rate remains the same as that utilized for the baseline.

In [34]:
add_lstm_lr = 1e-2

add_lstm_layer_params = [{'layer_type': 'Bidirectional',
                          'layer_kwargs': {'units': 128, 'dropout': 0.4}},
                         {'layer_type': 'Bidirectional',
                          'layer_kwargs': {'units': 128}},
                         {'layer_type': 'Dense',
                          'layer_kwargs':{'units': len(tag2index),'activation': 'softmax'}}]

add_lstm_training_params = {'x': X_train, 'y': y_train_one_hot, 'validation_data': (X_val, y_val_one_hot),
                  'batch_size': 32, 'epochs': 50, 'callbacks': callbacks}

### 3.3 Additional dense layer

Using two dense layers, one with a non-linear activation function and one with a softmax activation function, is a common pattern in neural network architectures for classification tasks, because the introduction of non-linearity allows the model to learn more complex patterns in the data and make more accurate predictions.

The introduction of dropout enanched the results even with this architecture; again, increasing the batch size did not show significant benefits as well as using a smaller learning rate.

In [35]:
add_fc_lr = 1e-2

add_fc_layer_params = [{'layer_type': 'Bidirectional',
                          'layer_kwargs': {'units': 128, 'dropout': 0.4}},
                         {'layer_type': 'Dense',
                          'layer_kwargs':{'units': PAD_LENGTH,'activation': 'relu'}},
                         {'layer_type': 'Dense',
                          'layer_kwargs':{'units': len(tag2index),'activation': 'softmax'}}]

add_fc_training_params = {'x': X_train, 'y': y_train_one_hot, 'validation_data': (X_val, y_val_one_hot),
                          'batch_size': 32, 'epochs': 50, 'callbacks': callbacks}

# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively)
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

### Custom metrics

The provided code comprises three functions designed for fullfilling the metrics requirements.

The `compute_weights` function calculates weights for each tag based on their frequency in the dataset, aiding in tasks such as sequence labeling.

The `weighted_masked_accuracy_wrapper` function computes accuracy, considering the previous weights and ignoring specified classes. This function first computes the per-sample accuracy, which is a binary tensor indicating whether the prediction for each sample is correct or not. Then, it multiplies the per-sample accuracy with the weights for the corresponding true class to obtain a weighted per-sample accuracy. Next, it creates a binary ignore mask indicating which samples should be ignored in the computation of the overall accuracy. The mask is initialized as all ones and then updated to exclude the samples with the class labels specified in the classes argument. Finally, it computes the overall weighted accuracy by summing the weighted per-sample accuracy and dividing by the number of non-ignored samples.

Finally, the `evaluate` function computes average evaluation metrics across multiple model runs, including per-class F1 scores and macro F1 score. It is also used to evaluate models on the Test set.


In [36]:
def compute_weights(df, tag2index):
    """
    Computes weights for each tag based on their frequency in the dataset.

    Parameters:
        df (pandas DataFrame): DataFrame containing word-tag pairs.
        tag2index (dict): Dictionary mapping tags to their indices.

    Returns:
        list: List of weights for each tag.
    """
    # Words per tag in train set
    tag_counts = df['tag'].value_counts()

    # Encoding and sorting by the tag vocab index
    index = tag_counts.index.map(lambda x: tag2index.get(x, 0))
    encoded_tc = pd.DataFrame(tag_counts.values, index=index).sort_index()
    sorted_tc = encoded_tc.values

    # Normalizing the values
    weights = sorted_tc / sorted_tc.sum()

    # Adding the pad weight
    weights = np.insert(weights, 0, 0.01)

    # Reversing the values for weights
    weights = [1 - i for i in weights]

    return weights


def weighted_masked_accuracy_wrapper(weights, classes_to_ignore=[]):
    """
    Creates a wrapper function for weighted masked accuracy calculation.

    Parameters:
        weights (list): List of weights for each tag.
        classes_to_ignore (list): List of classes to ignore while calculating accuracy.

    Returns:
        function: A wrapper function for weighted masked accuracy.
    """
    @tf.function
    def weighted_masked_accuracy(y_true, y_pred):
        y_true_class = K.argmax(y_true, axis=-1)
        y_pred_class = K.argmax(y_pred, axis=-1)

        per_sample_accuracies = K.cast(K.equal(y_true_class, y_pred_class), 'float32')
        weighted_per_sample_accuracies = per_sample_accuracies * K.gather(weights, y_true_class)

        mask = K.ones_like(y_pred_class, dtype='int32')
        for to_ignore in classes_to_ignore:
            mask = K.cast(K.not_equal(y_true_class, to_ignore), 'int32') * mask

        weighted_acc = K.sum(weighted_per_sample_accuracies * K.cast(mask, 'float32')) / K.maximum(K.cast(K.sum(mask), 'float32'), 1)

        return weighted_acc

    return weighted_masked_accuracy


def evaluate(true_labels, model_recaps, tag2index, ignore, test = False, verbose=True):
    """
    Computes average metrics across multiple model's runs made with different seeds. H

    Parameters:
        true_labels (numpy array): True labels.
        model_recaps (list): List of dictionaries containing model recaps.
        tag2index (dict): Dictionary mapping tags to their indices.
        ignore (list): List of classes to ignore.
        test (bool): Control function behaviour. Default is False. If True,
        operates in test mode.
        verbose (bool): Whether to print the average metrics. Default is True.

    Returns:
        dict: Dictionary containing average metrics.
    """
    if test:
      mean_test = []
      best_macro_f1 = 0.
      best_seed = None
      best_predictions = None

    per_class_f1 = {tag: 0. for tag in tag2index.keys() if tag not in ignore}
    for m in model_recaps:

      print('#' * 50)
      if test:
        print('Evaluating model with seed', m['seed'], 'with the Test set:')

        predictions_one_hot_encode = m["model"].predict(X_test)

        # Convert the class probabilities into class labels
        predictions = np.argmax(predictions_one_hot_encode, axis=-1)

        report_dict = classification_report(true_labels.flatten(), predictions.flatten(),
                                            labels=[tag2index[tag] for tag in tag2index.keys() if tag not in ignore],
                                            target_names=[tag for tag in tag2index.keys() if tag not in ignore],
                                            digits=4, zero_division=0, output_dict=True)

        mean_test.append(report_dict['macro avg']['f1-score'])

        if report_dict['macro avg']['f1-score'] > best_macro_f1:
          best_macro_f1 = report_dict['macro avg']['f1-score']
          best_seed = m['seed']
          best_predictions = predictions

      else:
        print('Evaluating model with seed', m['seed'], 'with the Validation set:')

        report_dict = classification_report(true_labels.flatten(), m['predictions'].flatten(),
                                              labels=[tag2index[tag] for tag in tag2index.keys() if tag not in ignore],
                                              target_names=[tag for tag in tag2index.keys() if tag not in ignore],
                                              digits=4, zero_division=0, output_dict=True)

      per_class_f1.update({tag: per_class_f1[tag] + report_dict[tag]['f1-score'] for tag in per_class_f1.keys()})

      print('Macro f1 score for model with seed', m['seed'], ':', report_dict['macro avg']['f1-score'], '\n')
      print('#' * 50)

    per_class_f1 = {tag: per_class_f1[tag] / len(model_recaps) for tag in per_class_f1.keys()}

    # Print the average metrics
    if test and verbose:
      avg_values = {
          'macro_f1': np.mean(mean_test),
          'per_class_f1': per_class_f1,
      }
      print("AVERAGE METRICS")
      print("\tMacro F1 Score:", avg_values['macro_f1'])
      print("\tPer Class F1 Scores:")
      for tag, f1_score in avg_values['per_class_f1'].items():
          print(f"\t\t- {tag}: {round(f1_score, 4)} (No samples in set)" if f1_score == 0. else
          f"\t\t- {tag}: {round(f1_score, 4)}")

      return {'best_macro_f1': best_macro_f1, 'best_seed': best_seed, 'best_predictions': best_predictions}
    elif verbose:
      avg_values = {
          'macro_f1': np.mean([r['macro_f1'] for r in model_recaps]),
          'per_class_f1': per_class_f1,
          'weighted_masked_accuracy': [(r['seed'], r['scores']['weighted_masked_accuracy']) for r in model_recaps]
      }
      print("AVERAGE METRICS")
      print("\tMacro F1 Score:", avg_values['macro_f1'])
      print("\tPer Class F1 Scores:")
      for tag, f1_score in avg_values['per_class_f1'].items():
          print(f"\t\t- {tag}: {round(f1_score, 4)} (No samples in set)" if f1_score == 0. else
          f"\t\t- {tag}: {round(f1_score, 4)}")
      print("\tWeighted Masked Accuracy:")
      for seed, accuracy in avg_values['weighted_masked_accuracy']:
          print(f"\t\tModel with seed {seed}: {accuracy}")

      return avg_values

We have decided to exclude also the tags `SYM` and `LS`, representing "Symbol" and "List item marker" respectively, from the computation of metrics. These tags are considered symbols and are thus excluded from the evaluation process.

In [37]:
# Tags to ignore from the metrics
ignore = [':', '#', '$', '-LRB-', '-RRB-', ',', '.', "''", '``', 'SYM','LS','-PAD-']

# Custom metrics ignoring classes
metrics = [weighted_masked_accuracy_wrapper(compute_weights(train_df,tag2index),[tag2index[tag] for tag in ignore])]

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

The `run_models` function creates and trains multiple models with different seeds and evaluates them on a validation set. The parameters specify the model's structure, which however share the same input and embedding layers.
Adam and Categorical cross entropy are two others common choices along the models; the latter is used as a monitor measure for training, since `sklearn.f1_score` could not be directly used as a metric for a tensorflow model.

The `pos_tagging_sample` function serves to visually inspect the model's performance by presenting a random sentence alongside its ground truth and predicted tags. This function aids in qualitative assessment by displaying the tags associated with each word and highlighting any discrepancies between the ground truth and predicted tags.

In [38]:
def build_model(model_name, layer_params, LR):
    # Define the model
    model = Sequential(name=model_name)

    # Add the Embedding layer
    model.add(Embedding(**embedding_params, trainable=False, mask_zero=True))

    # Add layers
    for layer_param in layer_params:
      layer_type = layer_param['layer_type']
      layer_kwargs = layer_param['layer_kwargs']
      if layer_type == 'Bidirectional':
          layer = Bidirectional(LSTM(**layer_kwargs, return_sequences=True))
      elif layer_type == 'Dense':
          layer = TimeDistributed(Dense(**layer_kwargs))
      model.add(layer)

    # Compile the model
    model.compile(optimizer=Adam(LR), loss='categorical_crossentropy', metrics=metrics)

    return model

def run_models(name, layer_params, embedding_params, training_params, X_val, y_val_one_hot, metrics, LR, seeds, tag2index, ignore):
    """
    Trains multiple models with different seeds and evaluates them on the validation set.

    Parameters:
        name (str): Name of the model.
        layer_params (list): List of dictionaries containing layer type and layer parameters.
        embedding_params (dict): Parameters for the embedding layer.
        training_params (dict): Parameters for training the models.
        X_val (numpy.ndarray): Validation set input data.
        y_val_one_hot (numpy.ndarray): One-hot encoded validation set labels.
        metrics (list): List of evaluation metrics.
        LR (float): Learning rate for model training.
        seeds (list): List of random seeds for reproducibility.
        tag2index (dict): Mapping of tags to their index.
        ignore (list): List of tags to ignore during evaluation.

    Returns:
        list: A list of dictionaries containing information about each trained model.
    """
    model_recaps = []

    for seed in seeds:
        print('#' * 50)
        print('Running with seed:', seed)
        set_reproducibility(seed)

        # Build the model
        tf.keras.backend.clear_session()
        model_name = name + '_' + str(seed)
        model = build_model(model_name, layer_params, LR)

        # Summary
        model.summary()
        tf.keras.utils.plot_model(model,to_file=f'{name}.png')

        # Fitting the model
        print('Fitting the', name, '(seed', seed, ') model...')
        history = model.fit(**training_params)

        # Evaluating the model
        print('Evaluating the', name, '(seed', seed, ') model...')
        scores = model.evaluate(X_val, y_val_one_hot, return_dict=True)

        # Making predictions
        print('Obtaining predictions from the', name, '(seed', seed, ') model...')
        predictions_one_hot_encode = model.predict(X_val)
        predictions = np.argmax(predictions_one_hot_encode, axis=-1)

        # Calculating macro F1 score
        macro_f1 = f1_score(y_val.flatten(), predictions.flatten(),
                            labels=[tag2index[tag] for tag in tag2index.keys() if tag not in ignore],
                            average='macro', zero_division=0)

        model_recap = {
            'name': model_name,
            'model': model,
            'history': history,
            'scores': scores,
            'predictions': predictions,
            'macro_f1': macro_f1,
            'seed': seed
        }

        model_recaps.append(model_recap)

        print('Macro f1 score:', macro_f1)
        print('Garbage collection:', gc.collect())
        print('#' * 50)

    return model_recaps

def pos_tagging_sample(samples, labels, predictions, index2word, index2tag, tag2description):
    """
    Prints a random sentence with ground truth and predicted tags.

    Parameters:
        samples (numpy.ndarray): Input sentences.
        labels (numpy.ndarray): Ground truth labels.
        predictions (numpy.ndarray): Predicted labels.
        index2word (dict): Mapping of word indices to words.
        index2tag (dict): Mapping of tag indices to tags.
        tag2sdescription (dict): Mapping of tags to descriptions.

    Returns:
        None
    """
    idx = np.random.randint(0, len(samples))

    data = []
    for word, tag_true, tag_pred in zip(samples[idx], labels[idx], predictions[idx]):
        mismatch = 'Mismatch' if tag_true != tag_pred else ''
        data.append([index2word[word], index2tag[tag_true], tag2description[index2tag[tag_true]], index2tag[tag_pred], tag2description[index2tag[tag_pred]], mismatch])

    print(tabulate(data, headers=['Word', 'Ground Truth', 'Description', 'Prediction', 'Description', 'Mismatch'], tablefmt="rounded_grid"))

In [39]:
# Seeds
seeds = [16, 29, 192]

if not os.path.exists('checkpoints'):
    os.makedirs('checkpoints')

### 5.1 Baseline

In [40]:
baseline_model_recaps = run_models('Baseline', bl_layer_params, embedding_params, bl_training_params, X_val, y_val_one_hot,\
           metrics, bl_lr, seeds, tag2index, ignore)

##################################################
Running with seed: 16
Model: "Baseline_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         1094800   
                                                                 
 bidirectional (Bidirection  (None, None, 256)         234496    
 al)                                                             
                                                                 
 time_distributed (TimeDist  (None, None, 46)          11822     
 ributed)                                                        
                                                                 
Total params: 1341118 (5.12 MB)
Trainable params: 246318 (962.18 KB)
Non-trainable params: 1094800 (4.18 MB)
_________________________________________________________________
Fitting the Baseline (seed 16 ) model...
Epoch 1/50
Epoch 2/50
Epoch 3/

In [41]:
avg_metrics_bl = evaluate(y_val, baseline_model_recaps, tag2index, ignore, test=False)

##################################################
Evaluating model with seed 16 with the Validation set:
Macro f1 score for model with seed 16 : 0.7903116973849472 

##################################################
##################################################
Evaluating model with seed 29 with the Validation set:
Macro f1 score for model with seed 29 : 0.7920461657011612 

##################################################
##################################################
Evaluating model with seed 192 with the Validation set:
Macro f1 score for model with seed 192 : 0.7850634514344828 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7891404381735304
	Per Class F1 Scores:
		- CC: 0.9876
		- CD: 0.9685
		- DT: 0.9858
		- EX: 0.9854
		- FW: 0.0 (No samples in set)
		- IN: 0.9708
		- JJ: 0.8417
		- JJR: 0.8458
		- JJS: 0.8687
		- MD: 0.9894
		- NN: 0.8908
		- NNP: 0.0388
		- NNPS: 0.2352
		- NNS: 0.9176
		- PDT: 0.3186
		- POS: 0.9975
		- P

In [42]:
best_predictions_bl = evaluate(y_test, baseline_model_recaps, tag2index , ignore, test=True)

##################################################
Evaluating model with seed 16 with the Test set:
Macro f1 score for model with seed 16 : 0.7747939064877759 

##################################################
##################################################
Evaluating model with seed 29 with the Test set:
Macro f1 score for model with seed 29 : 0.7896529172563154 

##################################################
##################################################
Evaluating model with seed 192 with the Test set:
Macro f1 score for model with seed 192 : 0.7787667121762102 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7810711786401004
	Per Class F1 Scores:
		- CC: 0.9964
		- CD: 0.9899
		- DT: 0.9887
		- EX: 1.0
		- FW: 0.0 (No samples in set)
		- IN: 0.9688
		- JJ: 0.8406
		- JJR: 0.8404
		- JJS: 0.9102
		- MD: 0.9823
		- NN: 0.8998
		- NNP: 0.0435
		- NNPS: 0.2603
		- NNS: 0.9191
		- PDT: 0.1333
		- POS: 0.9967
		- PRP: 0.9895
		- PRP$: 

In [43]:
pos_tagging_sample(X_test_np, y_test_np, best_predictions_bl['best_predictions'], index2word, index2tag, tag2description)

╭──────────────────┬────────────────┬──────────────────────────────────────────┬──────────────┬──────────────────────────────────────────┬────────────╮
│ Word             │ Ground Truth   │ Description                              │ Prediction   │ Description                              │ Mismatch   │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ freeport-mcmoran │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ inc.             │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼────────

In [44]:
best_bl_model = max(baseline_model_recaps, key=lambda x: x['macro_f1'])


if not os.path.exists('checkpoints/best_bl_model/'):
    os.makedirs('checkpoints/best_bl_model')

best_bl_model['model'].save_weights('checkpoints/best_bl_model/bl_weights')

### 5.2 Additional LSTM layer

In [45]:
add_lstm_lr = 1e-2

add_lstm_model_recaps = run_models('Additional_LSTM', add_lstm_layer_params, embedding_params, add_lstm_training_params, X_val, y_val_one_hot,\
           metrics, add_lstm_lr, seeds, tag2index, ignore)

##################################################
Running with seed: 16
Model: "Additional_LSTM_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         1094800   
                                                                 
 bidirectional (Bidirection  (None, None, 256)         234496    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, None, 256)         394240    
 onal)                                                           
                                                                 
 time_distributed (TimeDist  (None, None, 46)          11822     
 ributed)                                                        
                                                                 
Total params: 1735358 (6.62 MB)
Trainable

In [46]:
avg_metrics_add_lstm = evaluate(y_val, add_lstm_model_recaps, tag2index, ignore, test=False)

##################################################
Evaluating model with seed 16 with the Validation set:
Macro f1 score for model with seed 16 : 0.8054984029489812 

##################################################
##################################################
Evaluating model with seed 29 with the Validation set:
Macro f1 score for model with seed 29 : 0.7955553353408082 

##################################################
##################################################
Evaluating model with seed 192 with the Validation set:
Macro f1 score for model with seed 192 : 0.7887889159132006 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7966142180676634
	Per Class F1 Scores:
		- CC: 0.6654
		- CD: 0.9801
		- DT: 0.6759
		- EX: 0.9952
		- FW: 0.0 (No samples in set)
		- IN: 0.9795
		- JJ: 0.8641
		- JJR: 0.8528
		- JJS: 0.8268
		- MD: 0.9913
		- NN: 0.9092
		- NNP: 0.8983
		- NNPS: 0.283
		- NNS: 0.925
		- PDT: 0.3102
		- POS: 0.9975
		- PRP

In [47]:
best_predictions_add_lstm = evaluate(y_test, add_lstm_model_recaps, tag2index , ignore, test=True)

##################################################
Evaluating model with seed 16 with the Test set:
Macro f1 score for model with seed 16 : 0.793977696016351 

##################################################
##################################################
Evaluating model with seed 29 with the Test set:
Macro f1 score for model with seed 29 : 0.778690030428912 

##################################################
##################################################
Evaluating model with seed 192 with the Test set:
Macro f1 score for model with seed 192 : 0.7783705252980494 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7836794172477708
	Per Class F1 Scores:
		- CC: 0.6685
		- CD: 0.9907
		- DT: 0.6739
		- EX: 1.0
		- FW: 0.0 (No samples in set)
		- IN: 0.9773
		- JJ: 0.8576
		- JJR: 0.7727
		- JJS: 0.9043
		- MD: 0.9824
		- NN: 0.9122
		- NNP: 0.9122
		- NNPS: 0.2683
		- NNS: 0.9249
		- PDT: 0.0 (No samples in set)
		- POS: 0.9956
		- PRP: 0.

In [48]:
pos_tagging_sample(X_test_np, y_test_np, best_predictions_add_lstm['best_predictions'], index2word, index2tag, tag2description)

╭──────────────────┬────────────────┬──────────────────────────────────────────┬──────────────┬──────────────────────────────────────────┬────────────╮
│ Word             │ Ground Truth   │ Description                              │ Prediction   │ Description                              │ Mismatch   │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ freeport-mcmoran │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ inc.             │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼────────

In [49]:
best_add_lstm_model = max(add_lstm_model_recaps, key=lambda x: x['macro_f1'])

if not os.path.exists('checkpoints/best_add_lstm_model/'):
    os.makedirs('checkpoints/best_add_lstm_model')

best_add_lstm_model['model'].save_weights('checkpoints/best_add_lstm_model/add_lstm_weights')

### 5.3 Additional dense layer

In [50]:
add_fc_model_recaps = run_models('Additional_FC', add_fc_layer_params, embedding_params, add_fc_training_params, X_val, y_val_one_hot,\
           metrics, add_fc_lr, seeds, tag2index, ignore)

##################################################
Running with seed: 16
Model: "Additional_FC_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         1094800   
                                                                 
 bidirectional (Bidirection  (None, None, 256)         234496    
 al)                                                             
                                                                 
 time_distributed (TimeDist  (None, None, 114)         29298     
 ributed)                                                        
                                                                 
 time_distributed_1 (TimeDi  (None, None, 46)          5290      
 stributed)                                                      
                                                                 
Total params: 1363884 (5.20 MB)
Trainable p

In [51]:
avg_metrics_add_fc = evaluate(y_val, add_fc_model_recaps, tag2index, ignore, test=False)

##################################################
Evaluating model with seed 16 with the Validation set:
Macro f1 score for model with seed 16 : 0.7991818958803185 

##################################################
##################################################
Evaluating model with seed 29 with the Validation set:
Macro f1 score for model with seed 29 : 0.7956613321012745 

##################################################
##################################################
Evaluating model with seed 192 with the Validation set:
Macro f1 score for model with seed 192 : 0.8027764963181309 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7992065747665746
	Per Class F1 Scores:
		- CC: 0.9912
		- CD: 0.9835
		- DT: 0.9906
		- EX: 0.9903
		- FW: 0.0 (No samples in set)
		- IN: 0.9789
		- JJ: 0.8619
		- JJR: 0.8525
		- JJS: 0.8482
		- MD: 0.9909
		- NN: 0.0657
		- NNP: 0.8959
		- NNPS: 0.2372
		- NNS: 0.927
		- PDT: 0.3277
		- POS: 0.9975
		- PR

In [52]:
best_predictions_add_fc = evaluate(y_test, add_fc_model_recaps, tag2index , ignore, test=True)

##################################################
Evaluating model with seed 16 with the Test set:
Macro f1 score for model with seed 16 : 0.7781961996447441 

##################################################
##################################################
Evaluating model with seed 29 with the Test set:
Macro f1 score for model with seed 29 : 0.7861894866924903 

##################################################
##################################################
Evaluating model with seed 192 with the Test set:
Macro f1 score for model with seed 192 : 0.7909249407519173 

##################################################
AVERAGE METRICS
	Macro F1 Score: 0.7851035423630505
	Per Class F1 Scores:
		- CC: 0.9955
		- CD: 0.993
		- DT: 0.9918
		- EX: 0.9697
		- FW: 0.0 (No samples in set)
		- IN: 0.9792
		- JJ: 0.8586
		- JJR: 0.7711
		- JJS: 0.8863
		- MD: 0.9795
		- NN: 0.068
		- NNP: 0.909
		- NNPS: 0.2722
		- NNS: 0.9268
		- PDT: 0.0 (No samples in set)
		- POS: 0.9956
		- PRP: 

In [53]:
pos_tagging_sample(X_test_np, y_test_np, best_predictions_add_fc['best_predictions'], index2word, index2tag, tag2description)

╭──────────────────┬────────────────┬──────────────────────────────────────────┬──────────────┬──────────────────────────────────────────┬────────────╮
│ Word             │ Ground Truth   │ Description                              │ Prediction   │ Description                              │ Mismatch   │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ freeport-mcmoran │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼──────────────┼──────────────────────────────────────────┼────────────┤
│ inc.             │ NNP            │ Proper noun, singular                    │ NNP          │ Proper noun, singular                    │            │
├──────────────────┼────────────────┼──────────────────────────────────────────┼────────

In [54]:
best_add_fc_model = max(add_fc_model_recaps, key=lambda x: x['macro_f1'])

if not os.path.exists('checkpoints/best_add_fc_model/'):
    os.makedirs('checkpoints/best_add_fc_model')

best_add_fc_model['model'].save_weights('checkpoints/best_add_fc_model/add_fc_weights')

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.





### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible)
* Comment the about errors and propose possible solutions on how to address them.

In [55]:
bl_f1 = best_predictions_bl['best_macro_f1']
add_lstm_f1 = best_predictions_add_lstm['best_macro_f1']
add_fc_f1 = best_predictions_add_fc['best_macro_f1']

In [56]:
print(f'The best macro f1_score obtained with baseline model is {bl_f1}')
print(f'The best macro f1_score obtained with double lstm layer model is {add_lstm_f1}')
print(f'The best macro f1_score obtained with double dense layer model is {add_fc_f1}')

The best macro f1_score obtained with baseline model is 0.7896529172563154
The best macro f1_score obtained with double lstm layer model is 0.793977696016351
The best macro f1_score obtained with double dense layer model is 0.7909249407519173


Although the performances of the two deeper models are similar, the one with two LSTM layers demonstrated the highest `f1_score` on the test set.

This observation suggests that LSTM layers may have effectively captured sequential dependencies and long-range relationships within the input data, which are particularly crucial for POS tagging tasks.

### 6.1 Compare the errors made on the validation and test sets

In [57]:
def compare_errors(X_val, y_val, X_test, y_test, val_predictions, test_predictions):
    """
    Compare the errors made on the validation and test sets.

    Args:
    - X_val: Validation input data
    - y_val: Validation true labels
    - X_test: Test input data
    - y_test: Test true labels
    - val_predictions: Predictions on validation set
    - test_predictions: Predictions on test set

    Returns:
    - validation_errors: List of tuples (input_sentence, true_labels, predicted_labels) where errors occurred in the validation set
    - test_errors: List of tuples (input_sentence, true_labels, predicted_labels) where errors occurred in the test set
    """

    validation_errors, test_errors = [], []

    # Compare predictions with true labels
    for sent_idx in range(len(X_val)):
        for word, tag_true, tag_pred in zip(X_val[sent_idx], y_val[sent_idx], val_predictions[sent_idx]):
            if tag_true != tag_pred:
                validation_errors.append([index2word[word], index2tag[tag_true], index2tag[tag_pred]])

    for sent_idx in range(len(X_test)):
        for word, tag_true, tag_pred in zip(X_test[sent_idx], y_test[sent_idx], test_predictions[sent_idx]):
            if tag_true != tag_pred:
                test_errors.append([index2word[word], index2tag[tag_true], index2tag[tag_pred]])

    return validation_errors, test_errors

def error_per_tag(errors_list, name):
    tags = set([error[1] for error in errors_list])
    tag_dict = {tag: 0 for tag in tags}

    total_errors = 0
    for error in errors_list:
        tag_dict[error[1]] += 1  # Increment count for the true tag
        total_errors += 1

    sorted_tag_dict = dict(sorted(tag_dict.items(), key=lambda item: item[1], reverse=True))

    tags_list = list(sorted_tag_dict.keys())
    error_counts = list(sorted_tag_dict.values())

    fig = px.bar(x=tags_list, y=error_counts)
    fig.update_layout(title=f"Number of Errors per tag in {name} set (Total Errors: {total_errors})",
                      xaxis_title="Tag",
                      yaxis_title="Number of Errors")
    fig.show()

    return total_errors

In [69]:
# Predict on validation and test sets
val_predictions = np.argmax(best_add_lstm_model['model'].predict(X_val), axis=-1)
test_predictions = np.argmax(best_add_lstm_model['model'].predict(X_test), axis=-1)



In [70]:
validation_errors, test_errors = compare_errors(X_val_np, y_val_np, X_test_np, y_test_np, val_predictions, test_predictions)

In [71]:
configure_plotly_browser_state()

# Plot error counts for validation set
total_errors_val = error_per_tag(validation_errors, "validation")

# Plot error counts for test set
total_errors_test = error_per_tag(test_errors, "test")

We can observe a higher number of errors on the validation set than on the test set. This may denote that the particular samples chosen for the validation set are more challenging or contain more edge cases, leading to a higher error rate.

Indeed, we can notice the congruence of four classes among the top-5 mismatched tags: `NN`, `NNP`, `JJ`, and `NNS`. The larger presence of these classes in the validation set, along with `VBN` tags, explains the higher number of errors.

### 6.2 Typology of errors

In [72]:
def errors_typology(errors_list, mf_tags, name):
    # Create a set of all tags involved in the analysis
    tags = set([error[1] for error in errors_list if error[1] in mf_tags])

    # Initialize a dictionary to count the occurrences of mistaken tags
    mistaken_tag_counts = {mf_tag: {tag: 0 for tag in mf_tags if tag != mf_tag} for mf_tag in mf_tags}

    # Count occurrences of mistaken tags
    total_errors = 0
    for error in errors_list:
        true_tag = error[1]
        if true_tag in mf_tags:
            mistaken_tag = error[2]
            if mistaken_tag in mistaken_tag_counts[true_tag]:
                mistaken_tag_counts[true_tag][mistaken_tag] += 1
                total_errors += 1

    # Prepare data for plotting
    tags_list = []
    mistaken_tags_list = []
    error_counts = []
    for true_tag, mistaken_counts in mistaken_tag_counts.items():
        for mistaken_tag, count in mistaken_counts.items():
            tags_list.append(true_tag)
            mistaken_tags_list.append(mistaken_tag)
            error_counts.append(count)

    # Plotting
    fig = px.bar(x=tags_list, y=error_counts, color=mistaken_tags_list)
    fig.update_layout(title=f"Mistaken Tags in {name} set (Total Errors: {total_errors})",
                      xaxis_title="True Tag",
                      yaxis_title="Number of Errors",
                      legend_title="Mistaken Tag")
    fig.show()

    return total_errors

In [73]:
configure_plotly_browser_state()

mf_tags = ['NN', 'NNP', 'JJ', 'NNS']

# Plot error counts for validation set
errors_val = errors_typology(validation_errors, mf_tags, 'validation')

# Plot error counts for test set
errors_test = errors_typology(test_errors, mf_tags, 'test')

In [74]:
percentage_val = (errors_val / total_errors_val) * 100
print(f"-The errors made between classes more frequently wrongly predicted represent {percentage_val:.2f}% of the total errors on the validation set.")

percentage_test = (errors_test / total_errors_test) * 100
print(f"\n-The errors made between classes more frequently wrongly predicted represent {percentage_test:.2f}% of the total errors on the test set.")

-The errors made between classes more frequently wrongly predicted represent 44.31% of the total errors on the validation set.

-The errors made between classes more frequently wrongly predicted represent 47.61% of the total errors on the test set.


The number of errors made among those four classes represents almost half of the mistakes made by the models. Let us inspect their meanings.

In [75]:
for tag in mf_tags:
    print(f"Tag: {tag} \tDescription: {tag2description[tag]}")

Tag: NN 	Description: Noun, singular or mass
Tag: NNP 	Description: Proper noun, singular
Tag: JJ 	Description: Adjective
Tag: NNS 	Description: Noun, plural


Why hypotize that one potential reason for the frequent mistakes between these classes could be their semantic and syntactic similarities, leading to confusion for the model. For instance:

- **NN (Noun, singular or mass) and NNS (Noun, plural)**: Both represent nouns but differ in number (singular vs. plural). However, certain words might function as both singular and plural forms, contributing to misclassifications.
  
- **NN (Noun, singular or mass) and NNP (Proper noun, singular)**: While NNP represents singular proper nouns, NN encompasses singular common nouns. Proper nouns often refer to specific entities, which might overlap with common nouns, resulting in misclassifications.
  
- **NN (Noun, singular or mass) and JJ (Adjective)**: Adjectives often modify nouns, and distinguishing between them can be challenging, especially when dealing with descriptive language.

Since `VBN` is in the top 5 mistaken classes, let us try to extend the same reasoning to verbs tags. We will include the tag `JJ`, as many errors occur due to confusion between `VBN` and an adjective.

In [76]:
configure_plotly_browser_state()

verbs_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ']

# Plot error counts for validation set
errors_val = errors_typology(validation_errors, verbs_tags, 'validation')

# Plot error counts for test set
errors_test = errors_typology(test_errors, verbs_tags, 'test')

In [77]:
for tag in verbs_tags:
    print(f"Tag: {tag} \tDescription: {tag2description[tag]}")

Tag: VB 	Description: Verb, base form
Tag: VBD 	Description: Verb, past tense
Tag: VBG 	Description: Verb, gerund or present participle
Tag: VBN 	Description: Verb, past participle
Tag: VBP 	Description: Verb, non-3rd person singular present
Tag: VBZ 	Description: Verb, 3rd person singular present
Tag: JJ 	Description: Adjective


Each of these tags represents different forms or aspects of verbs. Past participle verbs tend to be often mistaken for past tense verbs or for adjectives.

- **VBN (Verb, past participle) and VBD (Verb, past tense)**: Past participle verbs and past tense verbs can have similar forms, especially in irregular verbs where the past tense and past participle forms are identical (e.g., "broken"). Furthermore, some past participles and past tense forms may have overlapping semantic meanings, further complicating classification. For example, "finished" can be both a past tense verb ("He finished the book") and a past participle verb/adjective ("the finished product").

- **VBN (Verb, past participle) and JJ (Adjective)**: Past participle verbs can also function as adjectives in certain contexts. For example, "the broken window" uses "broken" as an adjective modifying "window."

The last class that is highly mistaken bot in validation and in test set is the `IN` (Preposition or subordinating conjunction) tag. This tag is often confused with `RP` and `RB`.

In [78]:
ps_tags = ['IN', 'RP', 'RB']

for tag in ps_tags:
    print(f"Tag: {tag} \tDescription: {tag2description[tag]}")

Tag: IN 	Description: Preposition or subordinating conjunction
Tag: RP 	Description: Particle
Tag: RB 	Description: Adverb


In [79]:
configure_plotly_browser_state()

# Plot error counts for validation set
errors_val = errors_typology(validation_errors, ps_tags, 'validation')

# Plot error counts for test set
errors_test = errors_typology(test_errors, ps_tags, 'test')

These errors likely arise from the fact that prepositions and subordinating conjunctions can both function as particles or adverbs, depending on the context of the sentence.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Execution Order

You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End