# Week 4 - NLP and Deep Learning

---

# Lecture 7. RNN

In assignments for this lecture we are going to implement an RNN POS tagger in Pytorch.

You can use the following function for data reading:

In [4]:
def read_conll_file(path):
    """
    read in conll file
    
    :param path: path to read from
    :returns: list with sequences of words and labels for each sentence
    """
    data = []
    current_words = []
    current_tags = []

    for line in open(path, encoding='utf-8'):
        line = line.strip()

        if line:
            if line[0] == '#':
                continue # skip comments
            tok = line.split('\t')

            current_words.append(tok[0])
            current_tags.append(tok[1])
        else:
            if current_words:  # skip empty lines
                data.append((current_words, current_tags))
            current_words = []
            current_tags = []

    # check for last one
    if current_tags != []:
        data.append((current_words, current_tags))
    return data

train_data = read_conll_file('pos-data/en_ewt-train.conll')
dev_data = read_conll_file('pos-data/en_ewt-dev.conll')

print(len(train_data))
print(len(dev_data))
print(max([len(x[0]) for x in train_data ]))

12543
2001
159


## 1. Prepare data for use in PyTorch

* a) Convert the data to a format that can be used in a Pytorch module. This means we require:

  * training data: matrix of number of instances (12543) by the maximum sentence length (159), filled with word indices
  * training labels: matrix of the same size in the first dimension, but filled with label indexes instead ( total of 17)
  * the same two sets for the dev data (note that no word indices can be added anymore)
  
A special `<PAD>` token can be used for padding, for sentences shorter as 159 words. For the unknown words in the test set, you can use the `<PAD>` token as well.

**HINT** It will be beneficial in the long run to make a function to convert your data to the right format, as you would have to do it for the train, dev and test sets, and for any other dataset you want to evaluate on.

In [3]:
import torch


class Vocab():
    def __init__(self, pad_unk='<PAD>'):
        """
        A convenience class that can help store a vocabulary
        and retrieve indices for inputs.
        """
        self.pad_unk = pad_unk
        self.word2idx = {self.pad_unk: 0}
        self.idx2word = [self.pad_unk]

    def getIdx(self, word, add=False):
        if word not in self.word2idx:
            if add:
                self.word2idx[word] = len(self.idx2word)
                self.idx2word.append(word)
            else:
                return self.word2idx[self.pad_unk]
        return self.word2idx[word]

    def getWord(self, idx):
        return self.idx2word(idx)

max_len= max([len(x[0]) for x in train_data ])

# Your implementation goes here:



* b) Until now, we have used a batch-size of 1 in our implemented models, meaning that the models weights are updated after each sentence. This is not very computationally efficient. Larger batch-sizes increase the training speed, and can also lead to better performance (more stable training). You can easily convert existing training data to batches, by splitting it up in chunks of `batch_size` sentences, like this (*Make sure you understand this code!*):

In [None]:
import torch
# 200 instances, 100 features/weights
tmp_feats = torch.zeros((200, 100))

batch_size = 32
num_batches = int(len(tmp_feats)/batch_size)

print(num_batches)

print(tmp_feats.shape)

tmp_feats_batches = tmp_feats[:batch_size*num_batches].view(num_batches,batch_size, 100)

# 6 batches with 32 instances with 100 features
print(tmp_feats_batches.shape)

print()
for feats_batch in tmp_feats_batches:
    print(feats_batch.shape)
    # Here you can call forward/calculate the loss etc.

Note that this throws away a tiny part of the data (the last `len(tmp_feats)%batch_size`=6 samples), an alternative would be to pad, and ignore the padded part of the last batch for the loss. For the following assignments you can leave the remaining samples out (note that the dev set is dividable by 16 in this case). Furthermore, note that PyTorch supports a more advanced method for batching: [data loaders](https://pytorch.org/docs/stable/data.html), which we will not cover in this course (but you can use them for the final project).

Convert your training data and labels to batches of batch size 16

In [None]:
# Your implementation goes here:



## 2 Train an RNN

* a) Implement a POS tagger model in Pytorch using a [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer for word representations and a [`torch.nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) layer. You can use a [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer for prediction of the label. Train this tagger on the language identification data, and evaluate its performance. Note that during each training step, you now get the predictions and loss on a whole batch directly. Use the following hyperparameters: 5 epochs over the full training data, word embeddings dimension: 100, rnn dimension of 50, learning rate of 0.01 in an [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and a [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

Hints:
* **Set batch_first to true!**, as explained on the [`torch.nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) page. By default the RNN expects the input to be in the shape: `(seq_len, batch, rnn_dim)`, when it is set to true it should be: `(batch, seq_len, rnn_dim)`, which is usually easier to work with.
* Training an RNN is generally much slower compared to the machine learning models we implemented before on this data, so we suggest to start with only a sub-part of the data, like 100 or 1,000 sentences. It is also ok to use only 1,000 sentences for your final model (or use the HPC to train the full model).
* To calculate the cross entropy loss, we need the predictions to be in the first dimension. We can convert the predictions values from our model (32\*159\*18 for 1 batch) to a flatter representation (5088\*18) by using: `.view(BATCH_SIZE * max_len, -1)`. Of course, we also have to adapt the labels from 32\*159 to 5088\*1.

For more information on how to implement a Pytorch module, we refer to the code used to obtain the weights in the assignment of week 3 (`week4/train_ff.py`), and the following tutorial series: https://pytorch.org/tutorials/beginner/nlp/index.html (especially the 2nd and 4th tutorials are relevant). You can use the code below as a starting point

In [None]:
from torch import nn
import torch
torch.manual_seed(0)
DIM_EMBEDDING = 100
RNN_HIDDEN = 50
BATCH_SIZE = 32
LEARNING_RATE = 0.01
EPOCHS = 10

class TaggerModel(torch.nn.Module):
    def __init__(self, nwords, ntags):
        super().__init__()
        # TODO
        
    def forward(self, inputData):
        word_vectors = _# TODO
        
model = TaggerModel(len(idx2word), len(idx2label))
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss(ignore_index=0, reduction='sum')


for epoch in range(EPOCHS):
    model.train()
    # reset the gradient
    model.zero_grad()
    # loop over batches
    for batch in _: #TODO
        predicted_values = model.forward()
        # calculate loss (and print)
        # TODO
        
        # update
        # TODO
        
# set to evaluation mode
model.eval()

* b) Now evaluate the tagger on the dev data (`pos-data/en_ewt-dev.conll`) with accuracy (make sure to not count the padding tokens).

## 3. Implement a Bi-RNN in Pytorch
In this assignment we are going to implement a bidirectional RNN classifier in Pytorch including dropout layers, and train it for the task of topic classification.

You can use the following function for data reading:

In [None]:
def load_topics(path):
    text = []
    labels = []
    for lineIdx, line in enumerate(open(path)):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1].split(' '))
    return text, labels

topic_train_text, topic_train_labels = load_topics('topic-data/train.txt')
topic_dev_text, topic_dev_labels = load_topics('topic-data/dev.txt')

* a) Convert the data to a format that can be used in a Pytorch module. In this assignment, we can cap the size of an utterance, as each utterance only needs 1 label. Use a maximum length of 64 words, for longer sentences, only keep the first 64 words. A special `<PAD>` token can be used for padding for sentences shorter as 64 words. For the unknown words in the test set, you can use the `<PAD>` token as well.

**hint**: the shape of the training data should be 13,000 by 64

In [None]:
import torch


# Your implementation goes here:


* b) Convert your input into batches of size 32, similar as you did in assignment 1b

* c) Implement a classification model in Pytorch using a [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer and a [`torch.nn.RNN`](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) layer. Train this classification model on the language identification data, and evaluate its performance. Note that during each training step, you now get the predictions and loss on a whole batch directly. Use the following hyperparameters: 5 epochs over the full training data, word embeddings dimension: 100, rnn dimension of 50, learning rate of 0.01 in an [Adam optimizer](https://pytorch.org/docs/stable/optim.html) and a [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

Hints:
* see also the hints for assignment2a
* Set bidirectional=True for the RNN layer (so that we are training a RNN), note that the input dimensions of the next layer should then be rnn_dim*2. 
* We use words as inputs, and need only one label per sentence, so you should use the output of the last item from the forward layer, and the output of the first item for the backward layer.

In [5]:
# Hint: In torch, the BiRNN returns a concatenation of the forward and 
# backward layer. Here is an example of how these can be extracted again
#     backward_out = rnn_out[:,0,-size:].squeeze()
#     forward_out = rnn_out[:,-1,:size].squeeze()


* d) Add a `torch.nn.Dropout` layer with a masking probability of 0.2 between the word embeddings and the RNN layer and
  another dropout layer with a masking probability of 0.3 between the rnn layer and the output layer. Evaluate the
  performance again, is the performance higher?, why would this be the case?