Assignment 3 : Sequence labelling with RNNs
In this assignement we will ask you to perform POS tagging.

You are asked to follow these steps:

- Download the corpora and split it in training and test sets, structuring a dataframe.
- Embed the words using GloVe embeddings
- Create a baseline model, using a simple neural architecture
- Experiment doing small modifications to the model
- Evaluate your best model
- Analyze the errors of your model
- Corpora: Ignore the numeric value in the third column, use only the words/symbols and its label. https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip

Splits: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.

Baseline: two layers architecture: a Bidirectional LSTM and a Dense/Fully-Connected layer on top.

Modifications: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and using a CRF in addition to the LSTM. Each of this change must be done by itself (don't mix these modifications).

Training and Experiments: all the experiments must involve only the training and validation sets.

Evaluation: in the end, only the best model of your choice must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech (without considering punctuation classes).

Error Analysis (optional) : analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

Report: You are asked to deliver a small report of about 4-5 lines in the .txt file that sums up your findings.

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip

# Sequence labeling

In [195]:
import re
import random

from tqdm import tqdm
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

import nltk

from nltk.corpus import dependency_treebank
nltk.download('dependency_treebank')

import utils

%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package dependency_treebank to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package dependency_treebank is already up-to-date!


In [3]:
RANDOM_SEED = 42


def fix_random(seed):
    """
    Fix all the possible sources of randomness
    """
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


fix_random(RANDOM_SEED)

In [144]:
file_prefix = "wsj_"
file_ext = ".dp"
train_files = [f"{file_prefix}{i:04d}{file_ext}" for i in range(1, 101)]
val_files = [f"{file_prefix}{i:04d}{file_ext}" for i in range(101, 151)]
test_files = [f"{file_prefix}{i:04d}{file_ext}" for i in range(151, 200)]
splits = (
    ["train"] * len(train_files) + ["val"] * len(val_files) + ["test"] * len(test_files)
)
whole_files = train_files + val_files + test_files

In [145]:
whole_files[0:-1:20]

['wsj_0001.dp',
 'wsj_0021.dp',
 'wsj_0041.dp',
 'wsj_0061.dp',
 'wsj_0081.dp',
 'wsj_0101.dp',
 'wsj_0121.dp',
 'wsj_0141.dp',
 'wsj_0161.dp',
 'wsj_0181.dp']

In [147]:
dependency_treebank.raw(fileids=whole_files[0])

'Pierre\tNNP\t2\nVinken\tNNP\t8\n,\t,\t2\n61\tCD\t5\nyears\tNNS\t6\nold\tJJ\t2\n,\t,\t2\nwill\tMD\t0\njoin\tVB\t8\nthe\tDT\t11\nboard\tNN\t9\nas\tIN\t9\na\tDT\t15\nnonexecutive\tJJ\t15\ndirector\tNN\t12\nNov.\tNNP\t9\n29\tCD\t16\n.\t.\t8\n\nMr.\tNNP\t2\nVinken\tNNP\t3\nis\tVBZ\t0\nchairman\tNN\t3\nof\tIN\t4\nElsevier\tNNP\t7\nN.V.\tNNP\t12\n,\t,\t12\nthe\tDT\t12\nDutch\tNNP\t12\npublishing\tVBG\t12\ngroup\tNN\t5\n.\t.\t3\n'

In [157]:
def parse_file(fileid):
    file_str = dependency_treebank.raw(fileids=fileid)
    splitted_file_str = [
        x for x in re.split("\t|\n", file_str.strip()) if x.strip() != ""
    ]
    tokens, tags = [], []
    for i in range(0, len(splitted_file_str), 3):
        token = splitted_file_str[i]
        tag = splitted_file_str[i + 1]
        tokens.append(token)
        tags.append(tag)
    return tokens, tags


def parse_files(fileids):
    tokens_list, tags_list = [], []
    for fileid in fileids:
        tokens, tags = parse_file(fileid)
        tokens_list.append(tokens)
        tags_list.append(tags)
    return tokens_list, tags_list

In [158]:
whole_tokens, whole_tags = parse_files(whole_files)

In [165]:
def flatten(a):
    return [i for s in a for i in s]


flattened_tags = flatten(whole_tags)
flattened_tokens = flatten(whole_tokens)

In [200]:
unique_tags = np.unique(flattened_tags)
len(unique_tags)

45

In [171]:
tags_fd = nltk.probability.FreqDist(flattened_tags)
tags_fd.most_common(10)

[('NN', 13166),
 ('IN', 9857),
 ('NNP', 9410),
 ('DT', 8165),
 ('NNS', 6047),
 ('JJ', 5834),
 (',', 4886),
 ('.', 3874),
 ('CD', 3546),
 ('VBD', 3043)]

In [168]:
tokens_fd = nltk.probability.FreqDist(flattened_tokens)
tokens_fd.most_common(10)

[(',', 4885),
 ('the', 4045),
 ('.', 3828),
 ('of', 2319),
 ('to', 2164),
 ('a', 1878),
 ('in', 1572),
 ('and', 1511),
 ("'s", 864),
 ('for', 817)]

In [163]:
df = pd.DataFrame(
    {
        "tokens": whole_tokens,
        "tags": whole_tags,
        "split": splits,
        "fileid": whole_files,
    }
)
df.head()

Unnamed: 0,tokens,tags,split,fileid
0,"[Pierre, Vinken, ,, 61, years, old, ,, will, j...","[NNP, NNP, ,, CD, NNS, JJ, ,, MD, VB, DT, NN, ...",train,wsj_0001.dp
1,"[Rudolph, Agnew, ,, 55, years, old, and, forme...","[NNP, NNP, ,, CD, NNS, JJ, CC, JJ, NN, IN, NNP...",train,wsj_0002.dp
2,"[A, form, of, asbestos, once, used, to, make, ...","[DT, NN, IN, NN, RB, VBN, TO, VB, NNP, NN, NNS...",train,wsj_0003.dp
3,"[Yields, on, money-market, mutual, funds, cont...","[NNS, IN, JJ, JJ, NNS, VBD, TO, VB, ,, IN, NNS...",train,wsj_0004.dp
4,"[J.P., Bolduc, ,, vice, chairman, of, W.R., Gr...","[NNP, NNP, ,, NN, NN, IN, NNP, NNP, CC, NNP, ,...",train,wsj_0005.dp


In [164]:
train_df = df[df["split"] == "train"]
val_df = df[df["split"] == "val"]
test_df = df[df["split"] == "test"]

In [192]:
embedding_dimension = 50
embedding_model = utils.load_embedding_model("glove", embedding_dimension=embedding_dimension)

In [174]:
def build_vocabulary(tokens):
    """
    Given a list of tokens, builds the corresponding word vocabulary
    """
    words = sorted(set(tokens))
    vocabulary, inverse_vocabulary = dict(), dict()
    for i, w in tqdm(enumerate(words)):
        vocabulary[i] = w
        inverse_vocabulary[w] = i
    return vocabulary, inverse_vocabulary, words

In [184]:
index_to_word, word_to_index, word_listing = build_vocabulary(flattened_tokens)

11968it [00:00, 714310.13it/s]


In [185]:
list(index_to_word.items())[:10]

[(0, '!'),
 (1, '#'),
 (2, '$'),
 (3, '%'),
 (4, '&'),
 (5, "'"),
 (6, "''"),
 (7, "'30s"),
 (8, "'40s"),
 (9, "'50s")]

In [186]:
def check_oov_terms(embedding_model, word_listing):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms
    """
    oov_terms = []
    for word in word_listing:
        if word not in embedding_model.vocab:
            oov_terms.append(word)
    return oov_terms

In [188]:
oov_terms = check_oov_terms(embedding_model, word_listing)
print(
    f"Total OOV terms: {len(oov_terms)} ({round(len(oov_terms) / len(word_listing), 2)}%)"
)

Total OOV terms: 3745 (0.31%)


In [190]:
def build_embedding_matrix(
    embedding_model,
    embedding_dimension,
    word_to_index,
    oov_terms,
    method="normal",
):
    """
    Builds the embedding matrix of a specific dataset given a pre-trained Gensim word embedding model
    """

    def uniform_embedding(embedding_dimension, interval=(-1, 1)):
        return interval[0] + np.random.sample(embedding_dimension) + interval[1]

    def normal_embedding(embedding_dimension):
        return np.random.normal(embedding_dimension)

    embedding_matrix = np.zeros((len(word_to_index), embedding_dimension))
    for word, index in word_to_index.items():
        # Words that are no OOV are taken from the Gensim model
        if word not in oov_terms:
            word_vector = embedding_model[word]
        # OOV words computed as random normal vectors
        elif method == "normal":
            word_vector = normal_embedding(embedding_dimension)
        # OOV words computed as uniform vectors in range [-1, 1]
        elif method == "uniform":
            word_vector = uniform_embedding(embedding_dimension)
        embedding_matrix[index, :] = word_vector
    return embedding_matrix

In [194]:
embedding_matrix = build_embedding_matrix(
    embedding_model, embedding_dimension, word_to_index, oov_terms, method="normal",
)
embedding_matrix[word_to_index["the"]]

array([ 4.18000013e-01,  2.49679998e-01, -4.12420005e-01,  1.21699996e-01,
        3.45270008e-01, -4.44569997e-02, -4.96879995e-01, -1.78619996e-01,
       -6.60229998e-04, -6.56599998e-01,  2.78430015e-01, -1.47670001e-01,
       -5.56770027e-01,  1.46579996e-01, -9.50950012e-03,  1.16579998e-02,
        1.02040000e-01, -1.27920002e-01, -8.44299972e-01, -1.21809997e-01,
       -1.68009996e-02, -3.32789987e-01, -1.55200005e-01, -2.31309995e-01,
       -1.91809997e-01, -1.88230002e+00, -7.67459989e-01,  9.90509987e-02,
       -4.21249986e-01, -1.95260003e-01,  4.00710011e+00, -1.85939997e-01,
       -5.22870004e-01, -3.16810012e-01,  5.92130003e-04,  7.44489999e-03,
        1.77780002e-01, -1.58969998e-01,  1.20409997e-02, -5.42230010e-02,
       -2.98709989e-01, -1.57490000e-01, -3.47579986e-01, -4.56370004e-02,
       -4.42510009e-01,  1.87849998e-01,  2.78489990e-03, -1.84110001e-01,
       -1.15139998e-01, -7.85809994e-01])

In [197]:
class BiLSTMPOSTagger(nn.Module):
    def __init__(
        self,
        input_dimension,
        embedding_dimension,
        hidden_dimension,
        output_dimension,
        num_layers=1,
        bidirectional=True,
        dropout_rate=0.0,
    ):
        super().__init__()
        self.embedding = nn.Embedding(input_dimension, embedding_dimension)
        self.lstm = nn.LSTM(
            embedding_dimension,
            hidden_dimension,
            num_layers=num_layers,
            bidirectional=bidirectional,
            dropout=dropout_rate if num_layers > 1 else 0,
        )
        self.fc = nn.Linear(
            hidden_dimension * 2 if bidirectional else hidden_dimension,
            output_dimension,
        )
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions

In [201]:
hidden_dimension = 128
num_layers = 1
bidirectional = True
dropout_rate = 0.0

model = BiLSTMPOSTagger(
    len(word_to_index),
    embedding_dimension,
    hidden_dimension,
    len(unique_tags),
    num_layers=num_layers,
    bidirectional=bidirectional,
    dropout_rate=dropout_rate,
)

In [205]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
        
model.apply(init_weights)

BiLSTMPOSTagger(
  (embedding): Embedding(11968, 50)
  (lstm): LSTM(50, 128, bidirectional=True)
  (fc): Linear(in_features=256, out_features=45, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
)

In [206]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 794,285 trainable parameters


In [207]:
model.embedding.weight.data.copy_(torch.tensor(embedding_matrix))

tensor([[-0.5840,  0.3903,  0.6528,  ..., -1.2338,  0.4672,  0.7886],
        [-1.5925,  0.7557,  1.4947,  ..., -0.8408,  0.4949,  0.3405],
        [ 0.4389,  0.9030,  1.4060,  ...,  0.0744, -0.6935,  0.8135],
        ...,
        [-0.7826, -0.3365,  1.2228,  ..., -0.3224,  0.2601, -0.0665],
        [-0.3452, -0.0519,  0.0702,  ...,  0.2199,  0.1535, -0.1313],
        [ 0.0380, -0.9954,  1.7751,  ..., -0.1395, -0.4050, -0.6642]])

In [208]:
optimizer = optim.Adam(model.parameters())