# Lab session 3: Word embedding

This lab covers word embedding as seen in the theory lectures (DL lecture 5).

General instructions:
- Complete the code where needed
- Provide answers to questions only in the cell where indicated
- **Do not alter the evaluation cells** (`## evaluation`) in any way as they are needed for the partly automated evaluation process

## **Embedding; the Steroids for NLP!**

Pre-trained embedding have brought NLP a long way. Most of the recent methods include word embeddings into their pipeline to obtain state-of-the-art performance. `Word2vec` is among the most famous methods to efficiently create word embeddings and has been around since 2013. Word2Vec has two different model architectures, namely `Skip-gram` and `CBOW`. `Skip-gram` was explained in more detail in the theory lecture, and today we will play with `CBOW`. We will train our own little embeddings, and use them to visualize text corpora. In the last part, we will download and utilize other pretrained embeddings to build a Part-of-Speech tagging (PoS) model.

<img src="http://3g1o5q2sqh3w32ohtj4dwggw.wpengine.netdna-cdn.com/wp-content/uploads/2012/08/steroids-before-and-after-480x321.jpg" alt="img" width="512px"/>



In [1]:
# import necessary packages
import random
import math
import numpy as np

from random import shuffle
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [0]:
# for reproducibility

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## 1. Data preparation

As always, let's first prepare the data. We shall use the `text8` dataset, which offers cleaned English Wikipedia text. The data is clean UTF-8 and all characters are lower-cased with valid encodings.

In [3]:
!wget "http://mattmahoney.net/dc/text8.zip" -O text8.zip
!unzip -o text8.zip
!rm text8.zip
!head -c 1b text8 # print first bytes of text8 data

--2020-04-28 14:51:47--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.75
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.75|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2020-04-28 14:52:22 (888 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   
 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the b

In [0]:
# read text8
with open('text8', 'r') as input_file:
    text = input_file.read()

### Tokenization
We first chop our text into pieces using NLTK's `WordPuncTokenizer`:

In [5]:
from nltk.tokenize import WordPunctTokenizer

tknzr = WordPunctTokenizer()
tokenized_text = tknzr.tokenize(text)

print(tokenized_text[0:20])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']


### Build dictionary
In this step, we convert each word to a unique id. We can define our vocabulary trimming rules, which specify whether certain words should remain in the vocabulary, be trimmed away, or handled differently. In following, we limit our vocabulary size to `vocab_size` words and replace the remaining tokens with `UNK`:

In [0]:
def get_data(text, vocab_size = None):
    
    word_counts = Counter(text)
    
    sorted_token = sorted(word_counts, key=word_counts.get, reverse=True) # sort by frequency
    
    if vocab_size: # keep most frequent words
        sorted_token = sorted_token[:vocab_size-1] 
    
    sorted_token.insert(0, 'UNK') # reserve 0 for UNK
    
    id_to_token = {k: w for k, w in enumerate(sorted_token)}
    token_to_id = {w: k for k, w in id_to_token.items()}
    
    # tokenize words in vocab and replace rest with UNK
    tokenized_ids = [token_to_id[w] if w in token_to_id else 0 for w in text]

    return tokenized_ids, id_to_token, token_to_id

In [7]:
tokenized_ids, id_to_token, token_to_id = get_data(tokenized_text)
print('-' * 50)
print('Number of uniqe tokens: {}'.format(len(id_to_token)))
print('-' * 50)
print("tokenized text: {}".format(tokenized_text[0:20]))
print('-' * 50)
print("tokenized ids: {}".format(tokenized_ids[0:20]))

--------------------------------------------------
Number of uniqe tokens: 253855
--------------------------------------------------
tokenized text: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english']
--------------------------------------------------
tokenized ids: [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156, 128, 742, 477, 10572, 134, 1, 27350, 2, 1, 103]


### Generate samples
 
The `CBOW` model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). The training data thus comprises pairs of `(context_window, target_word)`, for which the model should predict the `target_word` based on the `context_window` words.

Considering a simple sentence, __the quick brown fox jumps over the lazy dog__, with a `context_window` of size 1, we have examples like __([quick, fox], brown)__, __([the, brown], quick)__, __([the, dog], lazy)__ and so on. 

<img src="
https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png" alt="img" width="400px"/>



Now let us convert our tokenized text from `tokenized_ids` into `(context_window, target_word)` pairs.

You should loop over the `tokenized_ids` and build a __generator__ which yields a target word of length 1 and surrounding context of length (2 $\times$ `window_size`) where we take `window_size` words before and after the target word in our corpus. Remember to pad context words with zeroes to a fixed length if needed.

In [0]:
def generate_sample(tknzd_ids, window_size = 5):
    for index, target in enumerate(tknzd_ids):
        ############### for student ################
        n_before = window_size
        n_after = window_size

        if index < window_size:
            n_before = index

        if index > len(tknzd_ids) - window_size - 1:
            n_after = len(tknzd_ids) - index - 1

        context_window = [0]*(window_size-n_before) + tknzd_ids[index-n_before:index] + tknzd_ids[index+1:index+n_after+1] + [0]*(window_size-n_after)
        ############################################
        yield context_window, target

In [9]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_gen = generate_sample([11, 12, 13, 14, 15], 2)
dummy_example = list(dummy_gen)
print(dummy_example)

assert isinstance(dummy_example[0], tuple), "Is it a pair?" 
assert len(dummy_example[0][0]) == 4, "Context length should be 2 * window_size"
assert dummy_example[0][1] == 11, "Did you return the correct target word?"
assert dummy_example[0][0][0] == dummy_example[0][0][1]==0, "Did you add 0 pads where needed?"
assert len(dummy_example[0]) == len(dummy_example[-1]), "Length of all instances should be the same due to the padding"
assert dummy_example[0][0] == [0, 0, 12, 13], "Did you consider contexts before and after the target word?" 

print('Well done!')

[([0, 0, 12, 13], 11), ([0, 11, 13, 14], 12), ([11, 12, 14, 15], 13), ([12, 13, 15, 0], 14), ([13, 14, 0, 0], 15)]
Well done!


To train our model faster, it is good idea to batchify our data. For your convenience, we implemented it for you: 

In [0]:
def batch_gen(tknzd_ids, batch_size = 4,  window_size = 5):
    
    # shuffling all tokenized ids changes the context of each target word
    # shuffle(tknzd_ids) # shuffle is in place and does not return anything
    
    single_gen = generate_sample(tknzd_ids, window_size) # get sample generator
    
    while True:
        try: 
            # The end of iterations is indicated by an exception 
            context_batch = np.zeros([batch_size, window_size * 2], dtype=np.int32)
            target_batch = np.zeros([batch_size], dtype=np.int32)
            for index in range(batch_size):
                context_batch[index], target_batch[index] = next(single_gen)
            yield context_batch, target_batch
        except StopIteration:
            break

In [11]:
dummy_batches = batch_gen([11, 12, 13, 14, 15, 16, 17, 18], batch_size=4, window_size=2)

print("First batch:\n", next(dummy_batches))
print('-' * 50)
print("Second batch:\n", next(dummy_batches))

First batch:
 (array([[ 0,  0, 12, 13],
       [ 0, 11, 13, 14],
       [11, 12, 14, 15],
       [12, 13, 15, 16]], dtype=int32), array([11, 12, 13, 14], dtype=int32))
--------------------------------------------------
Second batch:
 (array([[13, 14, 16, 17],
       [14, 15, 17, 18],
       [15, 16, 18,  0],
       [16, 17,  0,  0]], dtype=int32), array([15, 16, 17, 18], dtype=int32))


## 2. CBOW Model

We now leverage pytorch to build our CBOW model. For this, our inputs will be our context words which are first converted into one-hot vectors, and next projected into a word-vector. Word-vectors will be obtained from an embedding-matrix ($W$) which represents the distributed feature vectors associated with each word in the vocabulary. This embedding-matrix is initialized with a normal distribution.

Next, the projected words are averaged out (hence we don’t really consider the order or sequence in the context words when averaged) and then we multiply this averaged vector with another embedding matrix ($W'$), which defines so-called context embeddings to project the CBOW representation back to the one-hot space to match with the target word. (Note: in the theory, this is introduced as the linear output layer, with dimensions equal to the transposed of the embedding matrix.)  We thus apply a log-softmax on the resulting context vectors, to predict the most probable target word given the input context.

We match the predicted word with the actual target word, compute the loss by leveraging the cross entropy loss and perform back-propagation with each iteration to update the embedding-matrix in the process.

<img src="https://cdn-images-1.medium.com/freeze/max/1000/1*uATTt40gbJ1HJQgIqE-VPA.png?q=20" alt="img" width="512px"/>



### Question-1

- How could we modify the `CBOW` architecture to consider the order and position of the context words?  

**<font color=blue>By using max pooling instead of average pooling.</font>**

Now, complete the CBOW class below, following the instructions in the comments.

In [0]:
class CBOW(nn.Module):

    def __init__(self, embedding_dim=100, vocab_size=10000):
        super(CBOW, self).__init__()
        
        self.vocab_size = vocab_size
        
        # use nn.Parameter to define the two matrices W and W' from above, 
        # thus one for word (W) and one for context (W') embeddings:
        # self.embed_in = ...  # word embedding
        # self.embed_out = ... # context embedding
        ############### for student ################
        self.embed_in = nn.Parameter(torch.zeros((embedding_dim, self.vocab_size)))
        self.embed_out = nn.Parameter(torch.zeros((self.vocab_size, embedding_dim)))
        ############################################
        
        self.reset_parameters()
            
    
    def reset_parameters(self):
        # Initialize parameters
        nn.init.kaiming_uniform_(self.embed_in, a=math.sqrt(5))
        nn.init.kaiming_uniform_(self.embed_out, a=math.sqrt(5))
    
    def get_word_embedding(self):
        return self.embed_in
    
    def get_context_embedding(self):
        return self.embed_out
    
    
    def forward(self, inps):
        """
        Convert given indices to log-probablities. 
        Follow these steps:
        1) convert the inputs' word indices to one-hot vectors
        2) project the one-hot vectors to their embedding (use F.linear, do *NOT* use nn.Embedding)
        3) calculate the mean of the embedded vectors
        4) project back with the context embedding matrix 
        5) calculate the log-probability (with F.log_softmax)
                
        :argument:
            inps (list): List of indices
        
        :return:
            log-probablity of words
        """
        ############### for student ################
        one_hot = F.one_hot(inps, self.vocab_size)
        word_embed = F.linear(one_hot.float(), self.embed_in)
        mean = word_embed.mean(dim=1)
        context_embed = F.linear(mean, self.embed_out)
        log_probs = F.log_softmax(context_embed)
        ############################################
        return log_probs

In [13]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = CBOW(20, 10)
dummy_inps1 = torch.tensor([[6, 7, 9, 0]], dtype=torch.long)
dummy_inps2 = torch.tensor([[6, 7, 9, 0], [1, 2, 3, 4]], dtype=torch.long)
dummy_pred1 = dummy_model(dummy_inps1)
dummy_pred2 = dummy_model(dummy_inps2)

assert isinstance(dummy_model.embed_in, nn.Parameter), "Use nn.Parameter for embed_in"
assert isinstance(dummy_model.embed_out, nn.Parameter), "Use nn.Parameter for embed_out"
assert dummy_model.embed_in.shape == torch.Size([20, 10]), "param_in shape is not correct"
assert dummy_model.embed_out.shape == torch.Size([10, 20]), "param_out shape is not correct"
assert dummy_pred1.shape == torch.Size([1,10]), "Prediction shape is not correct"
assert dummy_pred2.shape == torch.Size([2,10]), "Prediction shape is not correct"
assert dummy_pred1.grad_fn.__class__.__name__ == 'LogSoftmaxBackward', "softmax layer?"

print('Well done!')

Well done!




### Train Model

Before jumping into the training part, we need to define some hyper-parameters:

In [0]:
# embedding hyper-parameters

EMBED_DIM = 100
WINDOW_SIZE = 5
BATCH_SIZE = 128
VOCAB_SIZE = 10_000

EPOCHS = 1 # to make things faster in this basic setup
interval = 1000

In [0]:
# get data

tokenized_ids, id_to_token, _ = get_data(tokenized_text, VOCAB_SIZE)

Now we define our main training loop. Please implement the typical steps for training:
- Reset all gradients
- Compute output and loss value
- Perform back-propagation
- Update the network’s parameters

In [16]:
model = CBOW(EMBED_DIM, VOCAB_SIZE)
model = model.to(device)

criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters())

loss_history = []

for e in range(EPOCHS):
    
    batches = batch_gen(tokenized_ids, batch_size=BATCH_SIZE, window_size=WINDOW_SIZE)
    total_loss = 0.0
    
    for iteration, (context, target) in enumerate(batches):
        
        # Step 1. Prepare the inputs to be passed to the model (wrap integer indices in tensors)
        # Step 2. Recall that torch *accumulates* gradients. Before passing a
        #         new instance, you need to zero out the gradients from the old instance
        # Step 3. Run the forward pass, getting predicted target words log probabilities
        # Step 4. Compute your loss function. 
        # Step 5. Do the backward pass and update the gradient
        
        ############### for student ################
        context = torch.LongTensor(context).to(device)
        target = torch.LongTensor(target).to(device)

        optimizer.zero_grad()

        log_probs = model.forward(context)
        loss = criterion(log_probs, target)

        loss.backward()
        optimizer.step()
        ############################################
        
        total_loss += loss.item()
        
        if iteration % interval == 0:
            print('Epoch:{}/{},\tIteration:{},\tLoss:{}'.format(e, EPOCHS, iteration, total_loss / interval))#, end = "\r", flush = True)
            loss_history.append(total_loss / interval)
            total_loss = 0.0



Epoch:0/1,	Iteration:0,	Loss:0.009210083961486816
Epoch:0/1,	Iteration:1000,	Loss:6.98003680229187
Epoch:0/1,	Iteration:2000,	Loss:6.41237203502655
Epoch:0/1,	Iteration:3000,	Loss:6.355920181751252
Epoch:0/1,	Iteration:4000,	Loss:6.268236405968666
Epoch:0/1,	Iteration:5000,	Loss:6.011583807945251
Epoch:0/1,	Iteration:6000,	Loss:6.0881432962417605
Epoch:0/1,	Iteration:7000,	Loss:5.935342714190483
Epoch:0/1,	Iteration:8000,	Loss:5.704964931488037
Epoch:0/1,	Iteration:9000,	Loss:5.688196800708771
Epoch:0/1,	Iteration:10000,	Loss:6.082110327959061
Epoch:0/1,	Iteration:11000,	Loss:6.02968112373352
Epoch:0/1,	Iteration:12000,	Loss:5.972457123041153
Epoch:0/1,	Iteration:13000,	Loss:5.8920275294780735
Epoch:0/1,	Iteration:14000,	Loss:5.933533553361893
Epoch:0/1,	Iteration:15000,	Loss:5.758544758796692
Epoch:0/1,	Iteration:16000,	Loss:6.000870287895203
Epoch:0/1,	Iteration:17000,	Loss:5.954390692472458
Epoch:0/1,	Iteration:18000,	Loss:5.969454768419266
Epoch:0/1,	Iteration:19000,	Loss:5.9801271

In [17]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert loss_history[-1] < 6.5

print('Well done!')

Well done!


### Nearest words

So far, we trained the __CBOW__ successfully, now it is time to explore it more. In this part, we want to find the $k$ nearest word to a given word, i.e., nearby in the vector space.

<img src="https://i0.wp.com/i.imgur.com/IeZt839.png" alt="img" width="480px"/>



Define a helper function to retrieve the corresponding vector for a given word:

In [0]:
# be sure jupyter session is not terminated!
# use token_to_id to retrieve the index

def get_vector(embedding, word):
    """
    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
    :return:
        word-vector for a given word
    """
    ############### for student ################
    index = token_to_id[word]
    word_vector = embedding[:,index].unsqueeze(1)
    return word_vector
    ############################################

In [19]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data

assert get_vector(embedding, 'the').shape == torch.Size([100, 1]), "vector size should be (embed_dim, 1)"
assert np.allclose(embedding[:,(0,)].data.cpu().numpy(), get_vector(embedding, 'UNK').data.cpu().numpy()), "Do you retrieve correct vector?"
print('Well done!')

Well done!


Define a function to return the list of $k$ most similar words, e.g., based on `cosine-similarity`, to a given word:

In [0]:
def most_similar_words(embedding, word, k=1):
    """
    return k similar (based on cosine similarity) items
    :argument:
        embedding (matrix): embedding matrix 
        word (str): The given input
        k (int): The number of similar items    
    :return:
        list of k similar items
    """
    x = get_vector(embedding, word) # 100, 1
    ############### for student ################
    distances = F.cosine_similarity(embedding.T, x.T)
    ids = torch.argsort(distances, descending=True)[1:k+1]
    most_similar = [id_to_token[id.item()] for id in ids]
    ############################################
    return most_similar

In [21]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

embedding = model.embed_in.data

dummy_list = most_similar_words(embedding, "mutual", 3)
s1 = F.cosine_similarity(get_vector(embedding, dummy_list[0]).T, get_vector(embedding, "mutual").T)
s2 = F.cosine_similarity(get_vector(embedding, dummy_list[1]).T, get_vector(embedding, "mutual").T)
s3 = F.cosine_similarity(get_vector(embedding, dummy_list[2]).T, get_vector(embedding, "mutual").T)

assert len(dummy_list) == 3, "return k nearest words"
assert s1.data.cpu().numpy()[0] >= s2.data.cpu().numpy()[0], "first item should have higher probablity to the given word"
assert s2.data.cpu().numpy()[0] >= s3.data.cpu().numpy()[0], "second item should have higher probability"
assert s1.data.cpu().numpy()[0] != 1 , "Similarity score of one means you return the word itself"

print('Well done!')

Well done!


### Linear projection


The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.


<img src="https://hackernoon.com/hn-images/1*ZFqnPuxa1PtUece-OHBoTA.png" alt="img" width="512px"/>

Under the hood, it attempts to decompose an object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing the *mean squared error*:

$$\min_{W, \hat{W}} \ \ \|(X W) \hat{W} - X\|^2_2 $$

with
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;


In [0]:
from sklearn.decomposition import PCA

# Map word vectors onto a 2D plane with PCA. Use the good old sklearn API (fit, transform).
# Finally, normalize the mapped vectors, to make sure they have zero mean and unit variance 

############### for student ################
word_vectors = embedding.data.cpu().numpy()
pca = PCA(n_components=2, whiten=True)
word_vectors_pca = pca.fit_transform(word_vectors)
############################################

In [23]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2D vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

print('Well done')

Well done


In [0]:
# !pip install bokeh

import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxiliary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [25]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=list(id_to_token.values()))



### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [26]:
from sklearn.manifold import TSNE

# Map word vectors onto a 2d plane with TSNE. (Hint: use verbose=100 to see what it's doing.)
# Normalize them just like with PCA into word_tsne

############### for student ################
tsne = TSNE(n_components=2,verbose=100)
word_tsne = tsne.fit_transform(word_vectors)
############################################

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 100 samples in 0.001s...
[t-SNE] Computed neighbors for 100 samples in 0.022s...
[t-SNE] Computed conditional probabilities for sample 100 / 100
[t-SNE] Mean sigma: 17.458122
[t-SNE] Computed conditional probabilities in 0.017s
[t-SNE] Iteration 50: error = 78.5850754, gradient norm = 0.4589480 (50 iterations in 1.593s)
[t-SNE] Iteration 100: error = 80.7489929, gradient norm = 0.4475091 (50 iterations in 0.601s)
[t-SNE] Iteration 150: error = 78.4852066, gradient norm = 0.4269891 (50 iterations in 0.531s)
[t-SNE] Iteration 200: error = 77.9681549, gradient norm = 0.3681235 (50 iterations in 0.862s)
[t-SNE] Iteration 250: error = 82.4722595, gradient norm = 0.3647962 (50 iterations in 0.450s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 82.472260
[t-SNE] Iteration 300: error = 1.6267133, gradient norm = 0.0027350 (50 iterations in 0.430s)
[t-SNE] Iteration 350: error = 1.2485536, gradient norm = 0.0015089 

In [27]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=list(id_to_token.values()))



## 3. POS tagging task

The embeddings by themselves are nice to have, but the main objective of course is to solve a particular (NLP) task. Further, so far we have trained our own embedding from a given corpus, but often it is beneficial to use existing word embeddings.

Now, let's use embeddings to train a simple Part of Speech (PoS) tagging model, using pretrained word embeddings. We shall use [50d glove word vectors](https://nlp.stanford.edu/projects/glove/) for the rest of this section.

Before jumping into our neural POS tagger, it is better to set up a baseline to give us an intuition how the neural model performs compared to other models. The baseline model is the [Conditional-Random-Field (CRF)](https://en.wikipedia.org/wiki/Conditional_random_field, also discussed in lecture `NLP_03_PoS_tagging_and_NER_20201`) which is a discriminative sequence labelling model. The evaluation is done on a 10\% sample of the Penn Treebank (which is offered through NLTK).

Download data from `nltk` repository and split it into test (20%) and training (80%) sets:

In [28]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

# download necessary packages from nltk
nltk.download('treebank')
nltk.download('universal_tagset')

tagged_sentence = nltk.corpus.treebank.tagged_sents(tagset='universal')
print("Number of Tagged Sentences ", len(tagged_sentence))
print(tagged_sentence[0])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
Number of Tagged Sentences  3914
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]


In [29]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(tagged_sentence, test_size=0.20, random_state=42)

print("Train size: {}".format(len(train)))
print("Test size: {}".format(len(test)))

Train size: 3131
Test size: 783


### Setup a baseline

In [0]:
def features(sentence, index):
    """
    Return hand designed features for a given word
    :argument:
        sentence: tokenized sentence [w1, w2, ...] 
        index: index of the word    
    :return:
        a feature set for given word
    """

    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        ############### for student ################
        'length': len(sentence[index]),
        # 'sentence_length':len(sentence),
        # 'index': index,
        'is_number': sentence[index].isdigit(),
        # 'is_stopword': sentence[index] in stopwords,
        # 'prev_prev_word': '' if index <= 1 else sentence[index - 2],
        ############################################
    }

### Question-2

- Suggest about 6 more features that you could improve the above feature-set and add them to the code above. After running the model with these features: which features worked best, and how much did your new features help in improving the model?   

**<font color=blue>results:</font>**

* none: 93.76 %
* length: 93.92 %
* sentense_length: 93.41 %
* index: 93.69 %
* is_number: 94.09 %
* is_stopword: 93.66 %
* prev_prev_word: 93.77 %
* all: 93.98 %
* is_number and length: 94.17 %

**<font color=blue>
The new features do not have a big influence on the accuracy of the model.
sentence_length, index, and is_stopword even makes the model perform worse.
The best features are is_number and length.
Using those together results in the best model.
</font>**

In [0]:
def transform2feature_label(tagged_sentence):
    X, y = [], []
 
    for tagged in tagged_sentence:
        X.append([features([w for w, t in tagged], i) for i in range(len(tagged))])
        y.append([tagged[i][1] for i in range(len(tagged))])
    
    return X,y

In [0]:
X_train, y_train = transform2feature_label(train)
X_test, y_test = transform2feature_label(test)

In [33]:
X_train[0][0]

{'is_capitalized': True,
 'is_first': True,
 'is_last': False,
 'is_number': False,
 'length': 6,
 'next_word': 'Vinken',
 'prev_word': '',
 'word': 'Pierre'}

In [34]:
# install crf-classifier
!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8eebfb0a3d30776334e0097f8432b631a9a3a19/python_crfsuite-0.9.7-cp36-cp36m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 6.6MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


In [35]:
import sklearn_crfsuite

# fit crfsuite classifier on train data
############### for student ################
crf = sklearn_crfsuite.CRF()
crf.fit(X_train, y_train)
############################################

print ("Accuracy:", crf.score(X_test, y_test))

Accuracy: 0.9417003260499295


### Build neural model 

Now it's time to build our Neural PoS-tagger. The model we want to play with is a bi-directional LSTM on top of pretrained word embeddings. First, we prepare the embedding part and then go into the model itself:

In [36]:
# download glove 50d
!wget "https://www.dropbox.com/s/lc3yjhmovq7nyp5/glove6b50dtxt.zip?dl=1" -O glove6b50dtxt.zip
!unzip -o glove6b50dtxt.zip
!rm glove6b50dtxt.zip

--2020-04-28 15:00:15--  https://www.dropbox.com/s/lc3yjhmovq7nyp5/glove6b50dtxt.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:601b:1::a27d:801
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/lc3yjhmovq7nyp5/glove6b50dtxt.zip [following]
--2020-04-28 15:00:15--  https://www.dropbox.com/s/dl/lc3yjhmovq7nyp5/glove6b50dtxt.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce8e2bab1bcb0c1e081ebfa5ae5.dl.dropboxusercontent.com/cd/0/get/A2uEQEbAaGLBVmsjN2eYewGO8yQneTKHb4oiOLXfrBVeOFkIzGRzgCSmQIkGqUm3EalRyxAQQkoJyT6IWmOo-hea84X-tugFfYbcO7U__9d8keol9X7nkGDYF87_i2foxfI/file?dl=1# [following]
--2020-04-28 15:00:15--  https://uce8e2bab1bcb0c1e081ebfa5ae5.dl.dropboxusercontent.com/cd/0/get/A2uEQEbAaGLBVmsjN2eYewGO8yQneTKHb4oiOLXfrBVeOFkIzGRzgCSmQIkGqUm3EalRyxAQQkoJyT6IWmOo-hea84X-tugFf

In [0]:
GLOVE_PATH = 'glove.6B.50d.txt'

We build two dictionaries for mapping words and tags to uniqe ids, which we need later on:

In [38]:
word_to_id = {}
tag_to_id = {}

for sentence in tagged_sentence:
    for word, pos_tag in sentence:
        if word not in word_to_id.keys():
            word_to_id[word] = len(word_to_id)
        if pos_tag not in tag_to_id.keys():
            tag_to_id[pos_tag] = len(tag_to_id)
            
word_vocab_size = len(word_to_id)
tag_vocab_size = len(tag_to_id)

print("Unique words: {}".format(word_vocab_size))
print("Unique tags: {}".format(tag_vocab_size))

Unique words: 12408
Unique tags: 12


We created a wrapper for the embedding module to encapsulate it from the other parts. This module aims to load word vectors from file and assign the weights into the corresponding embedding.

Create an embedding layer (this time use `nn.Embedding`), and assign the pretrained embeddings to its `weight` field. In this exercise, you can continue to finetune the embeddings while training the end task; no need to freeze them: this means the pre-trained embeddings serve as a smart initialization of the embedding layer.

In [0]:
class PretrainedEmbeddings(nn.Module):
    def __init__(self, filename, word_to_id, dim_embedding):
        super(PretrainedEmbeddings, self).__init__()
        
        wordvectors = self.load_word_vectors(filename, word_to_id, dim_embedding)
        ############### for student ################
        self.embed = nn.Embedding(num_embeddings=len(word_to_id), embedding_dim=dim_embedding)
        self.embed.weight = nn.Parameter(wordvectors)
        ############################################

    def forward(self, inputs):
        return self.embed(inputs)
    
    def load_word_vectors(self, filename, word_to_id, dim_embedding):
        wordvectors = torch.zeros(len(word_to_id), dim_embedding)
        with open(filename, 'r') as file:
            for line in file.readlines():
                data = line.split(' ')
                word = data[0]
                vector = data[1:]
                if word in word_to_id.keys():
                    wordvectors[word_to_id[word],:] = torch.Tensor([float(x) for x in vector])
        
        return wordvectors

In [40]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = PretrainedEmbeddings(GLOVE_PATH, word_to_id, 50)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model.embed.weight.shape == torch.Size([word_vocab_size, 50]), "embedding shape is not correct"
assert dummy_model(dummy_inps).shape == torch.Size([5, 50]), "word embedding shape is not correct"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[0], [0] * 50), "Load weights from glove?"
assert np.allclose(dummy_model.embed.weight.detach().numpy()[714], [0] * 50), "Are you sure you load from glove correctly?"

print('Well done')

Well done


Let’s now define the model. Here’s what we need:

- We’ll need an embedding layer that computes a word vector for each word in a given sentence
- We’ll need a bidirectional-LSTM layer to incorporate context from both directions  (reshape the embedding since `nn.LSTM` needs 3-dimensional inputs)
- After the LSTM Layer we need a Linear layer that picks the appropriate POS tag (note that this layer is applied to each element of the sequence).
- Apply the LogSoftmax to calculate the log probabilities from the resulting scores.

Complete the forward path of the POSTagger model: 

In [0]:
class POSTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, word_to_id, tag_to_id, embedding_file_path):
        super(POSTagger, self).__init__()
        
        self.embed = PretrainedEmbeddings(embedding_file_path, word_to_id, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)
        self.hidden2tag = nn.Linear(hidden_dim * 2, len(tag_to_id))
        self.logsoftmax = nn.LogSoftmax()
        
    def forward(self, sentence):
        ############### for student ################
        embeddings = self.embed(sentence)
        hidden, _ = self.lstm(embeddings.unsqueeze(1))
        tag = self.hidden2tag(hidden).squeeze(1)
        tag_scores = self.logsoftmax(tag)
        ############################################
        return tag_scores

In [42]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = POSTagger(50, 50, word_to_id, tag_to_id, GLOVE_PATH)
dummy_inps = torch.tensor([0, 4, 3, 5, 9], dtype=torch.long)

assert dummy_model(dummy_inps).grad_fn.__class__.__name__ == 'LogSoftmaxBackward', "softmax layer?"
assert dummy_model(dummy_inps).shape == torch.Size([5, len(tag_to_id)]), "The output has wrong shape! Probably you need some reshaping!"

print("Well done!")

Well done!


  from ipykernel import kernelapp as app


Perfect! Now train your model:

In [43]:
# Training start
model = POSTagger(50, 64, word_to_id, tag_to_id, GLOVE_PATH)
model = model.to(device)
criterion = nn.NLLLoss()
optimizer = optim.AdamW(model.parameters())

accuracy_list = []
loss_list = []

interval = round(len(train) / 100.)
EPOCHS = 6
e_interval = round(EPOCHS / 10.)

for e in range(EPOCHS):
    acc = 0 
    loss = 0
    
    model.train()
    
    for i, sentence_tag in enumerate(train):
        
        sentence = [word_to_id[s[0]] for s in sentence_tag]
        sentence = torch.tensor(sentence, dtype=torch.long)
        sentence = sentence.to(device)
        targets = [tag_to_id[s[1]] for s in sentence_tag]
        targets = torch.tensor(targets, dtype=torch.long)
        targets = targets.to(device)
        
        model.zero_grad()
        
        tag_scores = model(sentence)
        
        loss = criterion(tag_scores, targets)
        
        loss.backward()
        
        optimizer.step()
        
        loss += loss.item()
        
        _, indices = torch.max(tag_scores, 1)

        acc += torch.mean((targets == indices).float())
        
        if i % interval == 0:
            print("Epoch {} Running;\t{}% Complete".format(e + 1, i / interval), end = "\r", flush = True)
    
    loss = loss / len(train)
    acc = acc / len(train)
    loss_list.append(float(loss))
    accuracy_list.append(float(acc))
    
    if (e + 1) % e_interval == 0:
        print("Epoch {} Completed,\tLoss {}\tAccuracy: {}".format(e + 1, np.mean(loss_list[-e_interval:]), np.mean(accuracy_list[-e_interval:])))



  from ipykernel import kernelapp as app


Epoch 1 Completed,	Loss 3.27258967445232e-05	Accuracy: 0.8677334785461426
Epoch 2 Completed,	Loss 1.4633681530540343e-05	Accuracy: 0.9653012156486511
Epoch 3 Completed,	Loss 1.3041941201663576e-05	Accuracy: 0.9827902913093567
Epoch 4 Completed,	Loss 1.2006984434265178e-05	Accuracy: 0.9901929497718811
Epoch 5 Completed,	Loss 4.038249244331382e-06	Accuracy: 0.994438648223877
Epoch 6 Completed,	Loss 3.2298216865456197e-06	Accuracy: 0.996758222579956


So far, so good! It's time to test our classifier. Complete the evaluation part. Compute accuracy on the test data:

In [0]:
def evaluate(model, data):

    model.eval()
    
    acc = 0.0
    
    # calculate accuracy based on predictions
    ############### for student ################
    for i, sentence_tag in enumerate(data):
        sentence = [word_to_id[s[0]] for s in sentence_tag]
        sentence = torch.LongTensor(sentence).to(device)
        targets = [tag_to_id[s[1]] for s in sentence_tag]
        targets = torch.LongTensor(targets).to(device)

        tag_scores = model(sentence)
        _, indices = torch.max(tag_scores, 1)
        acc += torch.mean((targets == indices).float())
    
    score = acc.item() / len(data)
    ############################################   
    return score

In [47]:
score = evaluate(model, test)
print("Accuracy:", score)

assert score > 0.96, "accuracy should be above 96%"
assert score < 1.00, "accuracy should be less than 100!%"

print('Well done!')

  from ipykernel import kernelapp as app


Accuracy: 0.9592359209121568


AssertionError: ignored

### Question-3

- Whether or not to fine-tune the pre-trained embeddings, the number of epochs you need (whether or not to use 'early stopping'), to apply regularization... are hyperparameters that should be properly tuned on a validation set. We did not do this here. It is therefore hard to make strong claims about the model at this point. However, as a quick test, please train the POS model with the same settings, but with a standard randomly initialized embedding layer instead of the pretrained embeddings. What do you observe compared to the CRF baseline / compared to the GloVe initialization? (Note: for your final code in `POSTagger`, please make sure it again loads the pretrained embeddings).

**<font color=blue>Results:</font>**

* CRF baseline: 94.17 %
* POSTagger random: 91.05 %
* POSTagger GloVe: 95.68 %

**<font color=blue>Without the pre-trained GloVe embeddings, the model clearly performs worse than with pre-trained weights. The model with pre-trained GloVe embeddings performs better than the CRF baseline. After tuning some hyperparameters, it should be possible to outperform CRF.</font>**

### Acknowledgment

If you received help or feedback from fellow students, please acknowledge that here. We count on your academic honesty:

I did not receive any help from fellow students, besides Jarne Verhaeghe, who found a small bug in reshaping the inputs of the LSTM.
This is explained into more detail on the forum on Ufora.