# Week 06

Exercises:
1. watch the first part of the video [Makemore part III](https://www.youtube.com/watch?v=P6sfmUTpUmc) (until 1:18:34) and answer the questions:
    - Why is initialization important? How can we initialize the weights of a neural network in educated way?
    - what is meant with "saturated tanh"? Can this also happen with ReLU or sigmoid?
    - explain batch normalization in your own words and explain how it can be implemented in a neural network (during training and testing).
2. read chapter [1.3 Example: BERT](https://arxiv.org/abs/2501.09223) and answer these questions:
    - Is BERT an encoder or decoder? Explain.
    - What parts does the loss function consist of?
    - What is the difference between the masked language model and the next sentence prediction task?
    - explain the training process of BERT in your own words
    - what is the [CLS] token used for and how is it used in the training process?
    - what is positional encoding and why is it used in BERT?
3. RNN with attention

## 1. Batch-Norm and initialization
**Q**: Why is initialization important? How can we initialize the weights of a neural network in educated way?


**Q**: what is meant with "saturated tanh"? Can this also happen with ReLU or sigmoid?

**Q**: explain batch normalization in your own words and explain how it can be implemented in a neural network (during training and testing).

## 2. BERT

**Q**: Is BERT an encoder or decoder? Explain.

**Q**: What parts does the loss function consist of?


**Q**: What is the difference between the masked language model and the next sentence prediction task?

**Q**: explain the training process of BERT in your own words. What masking strategy is used and why?


**Q**: what is the [CLS] token used for and how is it used in the training process?

**Q**: what is positional encoding and why is it used in BERT?


## 3. RNN with attention (not graded)

This exercise is rather large. If you like, you can play around. Techniques of gradient clipping are important for the exam:

One approach to deal with this issue, is to clip the gradient, i.e. if it is larger than some value, we clip it
- watch this in [video](https://www.youtube.com/watch?v=KrQp1TxTCUY)

In [26]:
import pandas as pd
import gensim.downloader as api
import numpy as np

import torch 
if torch.cuda.is_available():
    print("GPU is available!")
    device = "cuda"
else:
    print("GPU is not available.")
    device = "cpu"

GPU is available!


In [27]:
glove = api.load('glove-wiki-gigaword-100')

Text to lower case:

In [28]:
df = pd.read_csv("../data/movie.csv")
df["text"] = df["text"].str.lower()
df.head()

Unnamed: 0,text,label
0,i grew up (b. 1965) watching and loving the th...,0
1,"when i put this movie in my dvd player, and sa...",0
2,why do people who do not know what a particula...,0
3,even though i have great interest in biblical ...,0
4,im a die hard dads army fan and nothing will e...,1


### Prepare text embeddings

Goal: generate batches of text embeddings for training a simple RNN.

In [29]:
def get_sentence_embedding(text_list, model):
    # collect embeddings for each word that is also in embedding model
    res = []
    for word in text_list:
        if model.has_index_for(word):
            res.append(model.get_vector(word))
    x = np.array(res)

    return torch.Tensor(x)

In [30]:
# example
first_text = df["text"].iloc[0].split()
print("First text:\n", first_text)
print("First text length:", len(first_text))
print("Embedding shape:", get_sentence_embedding(first_text, glove).shape)
print("")

First text:
 ['i', 'grew', 'up', '(b.', '1965)', 'watching', 'and', 'loving', 'the', 'thunderbirds.', 'all', 'my', 'mates', 'at', 'school', 'watched.', 'we', 'played', '"thunderbirds"', 'before', 'school,', 'during', 'lunch', 'and', 'after', 'school.', 'we', 'all', 'wanted', 'to', 'be', 'virgil', 'or', 'scott.', 'no', 'one', 'wanted', 'to', 'be', 'alan.', 'counting', 'down', 'from', '5', 'became', 'an', 'art', 'form.', 'i', 'took', 'my', 'children', 'to', 'see', 'the', 'movie', 'hoping', 'they', 'would', 'get', 'a', 'glimpse', 'of', 'what', 'i', 'loved', 'as', 'a', 'child.', 'how', 'bitterly', 'disappointing.', 'the', 'only', 'high', 'point', 'was', 'the', 'snappy', 'theme', 'tune.', 'not', 'that', 'it', 'could', 'compare', 'with', 'the', 'original', 'score', 'of', 'the', 'thunderbirds.', 'thankfully', 'early', 'saturday', 'mornings', 'one', 'television', 'channel', 'still', 'plays', 'reruns', 'of', 'the', 'series', 'gerry', 'anderson', 'and', 'his', 'wife', 'created.', 'jonatha', 'fra

The first text has originally 151 tokens, of which we find 107 in the glove vocabulary. Each of these representations has 100 dimensions:

In [31]:
get_sentence_embedding(first_text, glove)

tensor([[-0.0465,  0.6197,  0.5665,  ..., -0.3762, -0.0325,  0.8062],
        [ 0.8328,  0.3953,  0.1417,  ...,  0.0155,  0.7329,  0.2221],
        [ 0.2147,  0.4337,  0.3396,  ...,  0.0466,  0.8300,  0.4030],
        ...,
        [-0.1246,  0.8897, -0.0183,  ..., -0.1471,  0.8906,  0.2021],
        [-0.3775, -0.1636,  0.9482,  ..., -0.7871, -0.2698, -0.4416],
        [-0.1529, -0.2428,  0.8984,  ..., -0.5910,  1.0039,  0.2066]])

The second text has 263 embedded tokens: this is a problem because we actually need the tensors to have the same size. We can solve this by [padding](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html) the shorter text.

In [32]:
second_text = df["text"].iloc[1].split()
print("Second text length:", len(second_text))
print("Embedding shape:", get_sentence_embedding(second_text, glove).shape)

Second text length: 326
Embedding shape: torch.Size([263, 100])


Padding:

In [33]:
from torch.nn.utils.rnn import pad_sequence

def generate_tensor(text_list, model = glove, do_padding=True):
    sequences = []

    for text in text_list:
        sequence = get_sentence_embedding(text, model)
        sequences.append(sequence)

    # pad
    if do_padding:
        sequences = pad_sequence(sequences, batch_first=True)

    return sequences

See the difference: when we apply padding, we can collect all data into one tensor. This is impossible without padding because the tensors have different sizes.

In [34]:
test = [first_text, second_text]

yes_pad = generate_tensor(test)
print("shapes with padding:", yes_pad.shape)
no_pad = generate_tensor(test, do_padding=False)
print("shapes with padding:", [x.shape for x in no_pad])

shapes with padding: torch.Size([2, 263, 100])
shapes with padding: [torch.Size([130, 100]), torch.Size([263, 100])]


Verify, that the padding works as expected: the first 0 is when the padding starts for the sorter text, this is at position 130 as expected. Since the second text is longer, there is no padding at all:

In [35]:
(yes_pad.sum(-1) == 0).nonzero()[0]

tensor([  0, 130])

#### Dataset, Dataloader and collate function

Let's put this into pytorch objects:

A Dataset lets us retrieve data from any source that we want: [Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)



In [36]:
from torch.utils.data import Dataset

class MoviesDataset(Dataset):
    def __init__(self, df, text_col="text", target_col="label") -> None:
        super().__init__()
        self.df = df
        self.text_col = text_col
        self.target_col = target_col
        self.N = len(self.df)

    def __len__(self):
        return self.N

    def __getitem__(self, index):
        return self.df[self.text_col].iloc[index], self.df[self.target_col].iloc[index]


Check:

In [37]:
tt = MoviesDataset(df)

# we can now index our class using [index] notation
tt[10]

("i can't believe people are looking for a plot in this film. this is laural and hardy. lighten up already. these two were a riot. their comic genius is as funny today as it was 70 years ago. not a filthy word out of either mouth and they were able to keep audiences in stitches. their comedy wasn't sophisticated by any stretch. if a whoopee cushion can't make you grin, there's no reason to watch any of the stuff these guys did. it was a simpler time, and people laughed at stuff that was funny without a plot. i guess it takes a simple mind to enjoy this stuff, so i qualify. two man comedy teams don't compute, we're just too sophisticated... aren't we fortunate?",
 1)

Split into training and validation:

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(df, test_size=0.2, random_state=54321)
print(X_train.shape, X_test.shape)

(32000, 2) (8000, 2)


Now we can create datasets for train and test. Recall that here we still have dataframes and not tensors (yet):

In [39]:
movie_train = MoviesDataset(X_train)
movie_test = MoviesDataset(X_test)

In [40]:
print(f"{len(movie_train) = }")
print(f"{len(movie_test) = }")
next(iter(movie_test))

len(movie_train) = 32000
len(movie_test) = 8000


("<br /><br />the first thing i have to say is that i own jake speed. i've seen it at least 10 times. this movie is one of the most fun movies ever made. the film begins with margaret (karen kopins) trying to find her sister. her sister was kidnapped in paris and the family has heard nothing. along comes jake speed (wayne crawford), telling her exactly where her sister is and making an offer to find her. jake speed is a hero. he doesn't work for money because he just wants to help and have a good adventure. his partner (dennis christopher) follows him around and writes their adventures into novels. this film is a great adventure. it's hilarious, it's action-packed, it's just great. i guess it's a cult film with a very small cult following. crawford is perfect as jake speed and throws out some one-liners that you'll never forget. kopins and christopher are also good as the girl and the sidekick, respectively. john hurt, the guy who's stomach blew up in alien, plays the devilish, pervert

One way to map words into the embedding space would be to pre-embed everything before hand. But unfortunately this is to memory intensive. An alternative is, to use a [collate-function](https://pytorch.org/docs/stable/data.html#working-with-collate-fn). Using a costum collate-function we can ensure, that all observations in one batch have the same sequence length.

The collate-function takes a batch and then applies a transformation to it (in our case we have to ensure that the sequence lengths are equal):

In [41]:
def collate_fn(batch, model = glove, device=device):

    text = []
    labels = []

    for t, l in batch:
        text.append(t.split(" "))
        labels.append(l)
        
    return generate_tensor(text_list=text, model=model).to(device=device), torch.tensor(labels, device=device)


Important:
 - batch size is also a hyperparameter
 - often it is a good strategy to chose large batch sizes (as long as it fits on the GPU)
 - large batch sizes often lead to more stable gradient updates (think about the extreme case where your batch has only one sample => large variation of gradients)


In [42]:
from torch.utils.data import DataLoader
batch_size = 128

train_dataloader = DataLoader(movie_train, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)
test_dataloader = DataLoader(movie_test, shuffle=False, batch_size=batch_size, collate_fn=collate_fn) # shuffling makes no difference

Test:

In [43]:
print(f"number of batches: {len(train_dataloader) = },\
      \nsamples in total = {len(train_dataloader)*batch_size}")

number of batches: len(train_dataloader) = 250,      
samples in total = 32000


In [44]:
# verify
print(f"{train_dataloader.dataset.N = }")
print(f"{movie_train.df.shape = }")

train_dataloader.dataset.N = 32000
movie_train.df.shape = (32000, 2)


Now we can iterate over batches, see here for an example: the collate function turn the words in the dataset into a dense vector representation

In [45]:
for batch in train_dataloader:
    x = batch[0]
    y = batch[1]
    break

print(f"{x.shape = }, {x.device = }")
print(f"{y.shape = }, {y.device = }")

x.shape = torch.Size([128, 800, 100]), x.device = device(type='cuda', index=0)
y.shape = torch.Size([128]), y.device = device(type='cuda', index=0)


In [46]:
# first example in the first batch:
x[0,:,:].shape, x[0,:,:]

(torch.Size([800, 100]),
 tensor([[-0.5706,  0.4418,  0.7010,  ..., -0.6610,  0.4720,  0.3725],
         [-0.5426,  0.4148,  1.0322,  ..., -1.2969,  0.7622,  0.4635],
         [-0.2709,  0.0440, -0.0203,  ..., -0.4923,  0.6369,  0.2364],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],
        device='cuda:0'))

In [47]:
# these are the labels
y

tensor([1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
        0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
        0, 1, 1, 0, 0, 1, 0, 0], device='cuda:0')

### RNN model using the last hidden state

Let's start with a model:
 - your class here must have at least a forward pass
 - other functions are here to make everything self contained, s.t. we can easily play around with different models

In [48]:
import torch
import torch.nn as nn 
import torch.optim as optim

class MyRNN(nn.Module):
    def __init__(self, rnn, fc_layers, learning_rate, device=device) -> None:
        super(MyRNN, self).__init__()
        self.device = device
        
        # layers
        self.rnn = rnn
        self.fc_layers = nn.Sequential(*fc_layers)

        # optimizer: usually this is also a hyperparameter
        # especially the learning rate is very important
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)

        self.criterion = nn.BCEWithLogitsLoss()  # Binary cross-entropy loss for binary classification

        print(f"model has {self.count_parameters()} parameters.")

    def count_parameters(self):
            return sum(p.numel() for p in self.parameters() if p.requires_grad)

    def forward(self, x):
        # get last hidden state
        _, hn = self.rnn(x)
        return self.fc_layers(hn)
    

    def train_one_batch(self, inputs, labels):
        labels = labels.to(torch.float)

        # Zero your gradients for every batch
        self.optimizer.zero_grad()

        # predict and compute loss
        outputs = self(inputs)
        outputs = outputs.squeeze().to(torch.float)
        loss = self.criterion(outputs, labels)

        # compute gradients
        loss.backward()
    
        self.optimizer.step()

        # batch accuracy and loss
        label_hat = torch.round(torch.sigmoid(outputs))
        acc = (label_hat == labels).sum().item() / len(labels)

        return loss.detach(), acc

    def train_epochs(self, nb_epochs, training_loader, test_loader = None, verbosity_level=None):

        for epoch_index in range(nb_epochs):
            print("---------------------------------------------------------------")
            print(f"training epoch {epoch_index}")
    
            # put model in training mode
            self.train()

            losses = []
            accuracies = []

            for i, (inputs, labels) in enumerate(training_loader):
                loss, acc = self.train_one_batch(inputs, labels)

                losses.append(loss.item())
                accuracies.append(acc)

                if verbosity_level is not None and (i % verbosity_level) == 0:
                    print(f"  - batch {i = }: loss = {loss:.2f} \t accuracy = {acc:.2f}")

            print(f" - epoch loss = {np.mean(losses):.2f} \t accuracy = {np.mean(accuracies):.2f}")
            if test_loader is not None:
                self.prediction(test_loader)
        print("\n")


    def prediction(self, test_loader):
        self.eval() # put model in eval mode

        test_losses = []
        test_accuracies = []

        with torch.no_grad():
            for inputs, labels in test_loader:
                labels = labels.to(torch.float)

                outputs = self(inputs)
                outputs = outputs.squeeze().to(torch.float)
                test_loss = self.criterion(outputs, labels)

                label_hat = torch.round(torch.sigmoid(outputs))
                test_acc = (label_hat == labels).sum().item() / len(labels)

                test_losses.append(test_loss.item())
                test_accuracies.append(test_acc)

        print(f" - test loss = {np.mean(test_losses):.2f} \t accuracy = {np.mean(test_accuracies):.2f}")

        print("\n")

In [49]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# here we use two fully connected layers that map from hidden_dim (128)->100->1 
fc_layers = [nn.Linear(hidden_dim, 100), nn.Linear(100, output_dim)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model has 42441 parameters.


MyRNN(
  (rnn): RNN(100, 128, batch_first=True)
  (fc_layers): Sequential(
    (0): Linear(in_features=128, out_features=100, bias=True)
    (1): Linear(in_features=100, out_features=1, bias=True)
  )
  (criterion): BCEWithLogitsLoss()
)

In [50]:
model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=50)

---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.69 	 accuracy = 0.52
  - batch i = 50: loss = 0.69 	 accuracy = 0.55
  - batch i = 100: loss = 0.69 	 accuracy = 0.48
  - batch i = 150: loss = 0.69 	 accuracy = 0.50
  - batch i = 200: loss = 0.70 	 accuracy = 0.42
 - epoch loss = 0.69 	 accuracy = 0.50
 - test loss = 0.69 	 accuracy = 0.50


---------------------------------------------------------------
training epoch 1
  - batch i = 0: loss = 0.69 	 accuracy = 0.48
  - batch i = 50: loss = 0.69 	 accuracy = 0.52
  - batch i = 100: loss = 0.69 	 accuracy = 0.53
  - batch i = 150: loss = 0.69 	 accuracy = 0.58
  - batch i = 200: loss = 0.69 	 accuracy = 0.52
 - epoch loss = 0.69 	 accuracy = 0.50
 - test loss = 0.69 	 accuracy = 0.51






Exercise:
- it seems that we are not learning anything. Can you find out why? Tipp: we are always retrieving the last hidden state, but here padding might be a problem. Can you think of a solution?

Try to solve the problem yourself. You have to adapt our collate function. The solution is below.

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

Hint: 
- Change the collate-function such that it returns for each batch the tripple: (batch-embedding-vectors, batch-labels, sequence-token-length)
- The above version of our collate-function returns the tuple (batch-embedding-vectors, batch-labels)
&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

- we saw that the slicing in the return call of the forward pass resulted in zeros: except for sequences in a batch that have as much tokens as the max-sequence (the longest sequence in a batch)
- theses zero-vectors are the last hidden states that we are feeding into the FNN and $\frac{1}{1 + \exp(0)} = \frac{1}{2}$

In [51]:
def generate_tensor(text_list, model = glove, do_padding=True):
    sequences = []

    for text in text_list:
        sequence = get_sentence_embedding(text, model)
        sequences.append(sequence)

    # additionally we add the lengths of the sequences to get the last hidden state
    lengths = [len(seq) for seq in sequences]
    
    # pad
    if do_padding:
        sequences = pad_sequence(sequences, batch_first=True)

    return sequences, lengths


def collate_fn(batch, model = glove, device=device):
    """ returns now also length of each sequence. With this information we can easily slice at the correct position, where
     the true last hidden state is in the RNN for a given sequence. """

    text = []
    labels = []

    for t, l in batch:
        text.append(t.split(" "))
        labels.append(l)
    
    embeddings, lengths = generate_tensor(text_list=text, model=model)
    
    return embeddings.to(device=device), torch.tensor(labels, device=device), torch.tensor(lengths, device=device)


In [52]:
train_dataloader = DataLoader(movie_train, shuffle=True, batch_size=batch_size, collate_fn=collate_fn)
test_dataloader = DataLoader(movie_test, shuffle=False, batch_size=batch_size, collate_fn=collate_fn) # shuffling makes no difference

Now the forwad function has to be changed slightly:
- previously it took simply the batch of embedding-features
- now it takes also the sequence length of each sample in the batch, s.t. we can extract the last hidden embedding from the RNN

Exercise: 
 - we now have to retrieve the last hidden unit for each sequence in our batch
 - e.g. one sequence has token length of 11 and another of 123: for the first one we need to extract the hidden unit at position 11 and for the later at 123
 - write such a function $f$ that takes a batch of hidden units and a vector of sample lengths (as returned by the new collate function) and then returns for each sample the correct hidden unit
 - next update the forward pass in our MyRNN-class:
    - the forward pass gets now additionally the lengths vector
    - you push the features through the RNN (same as before)
    - then you apply $f$ to gather the correct hidden units and send those through the fully-connected layers `fc_layers`

The following functions are helpful: `torch.arange` and [`torch.gather`](https://pytorch.org/docs/stable/generated/torch.gather.html)

 

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

Let's see for a dummy RNN what the return values are and how we can filter the correct ones:

In [53]:
rnn = nn.RNN(input_size=10, hidden_size=7, batch_first=True, )

# let's create a dummy batch of data: 2 observations, where each of them has 5 timesteps. 10 is the feature dimension
input = torch.randn(2, 5, 10)

# put it through the rnn and collect both: final hidden state and intermediate ones
output, hn = rnn(input)

In [54]:
# Goal: combine them s.t. we have a resulting tensor of shape (2, 6, 20): 2=batch size, 6=all hidden units, 20=dimension of hidden units
output.shape, hn.shape

(torch.Size([2, 5, 7]), torch.Size([1, 2, 7]))

More precisely you can see that the hidden states $h_n$ are the latest hidden states and already part of `output`:

In [55]:
output

tensor([[[ 0.8713, -0.3952, -0.3394,  0.0560,  0.3431,  0.1890,  0.5058],
         [ 0.7042, -0.9685,  0.6398, -0.7054, -0.1472, -0.1512,  0.8225],
         [ 0.8866,  0.1049, -0.6180, -0.7283,  0.2903, -0.9167,  0.5151],
         [ 0.8283,  0.7434, -0.2989, -0.9555, -0.3861, -0.8627,  0.4620],
         [ 0.9597,  0.0974, -0.6873,  0.4536, -0.7505,  0.4791,  0.1225]],

        [[-0.5444, -0.3370, -0.6820,  0.1762, -0.0344, -0.3766, -0.2422],
         [ 0.4514, -0.8578,  0.1957, -0.5108,  0.7105,  0.3281, -0.0221],
         [ 0.0843,  0.2750, -0.9111, -0.2197, -0.7814, -0.1471, -0.5853],
         [ 0.3917,  0.3306, -0.3886, -0.9585, -0.7365,  0.6093, -0.5776],
         [ 0.8507, -0.8646, -0.4349,  0.7716, -0.9789,  0.1340, -0.2637]]],
       grad_fn=<TransposeBackward1>)

In [56]:
hn

tensor([[[ 0.9597,  0.0974, -0.6873,  0.4536, -0.7505,  0.4791,  0.1225],
         [ 0.8507, -0.8646, -0.4349,  0.7716, -0.9789,  0.1340, -0.2637]]],
       grad_fn=<StackBackward0>)

In [57]:
hn.squeeze(0).unsqueeze(1).shape

torch.Size([2, 1, 7])

In [58]:
combined = torch.cat((output, hn.squeeze(0).unsqueeze(1)), dim=1)
combined.shape, combined

(torch.Size([2, 6, 7]),
 tensor([[[ 0.8713, -0.3952, -0.3394,  0.0560,  0.3431,  0.1890,  0.5058],
          [ 0.7042, -0.9685,  0.6398, -0.7054, -0.1472, -0.1512,  0.8225],
          [ 0.8866,  0.1049, -0.6180, -0.7283,  0.2903, -0.9167,  0.5151],
          [ 0.8283,  0.7434, -0.2989, -0.9555, -0.3861, -0.8627,  0.4620],
          [ 0.9597,  0.0974, -0.6873,  0.4536, -0.7505,  0.4791,  0.1225],
          [ 0.9597,  0.0974, -0.6873,  0.4536, -0.7505,  0.4791,  0.1225]],
 
         [[-0.5444, -0.3370, -0.6820,  0.1762, -0.0344, -0.3766, -0.2422],
          [ 0.4514, -0.8578,  0.1957, -0.5108,  0.7105,  0.3281, -0.0221],
          [ 0.0843,  0.2750, -0.9111, -0.2197, -0.7814, -0.1471, -0.5853],
          [ 0.3917,  0.3306, -0.3886, -0.9585, -0.7365,  0.6093, -0.5776],
          [ 0.8507, -0.8646, -0.4349,  0.7716, -0.9789,  0.1340, -0.2637],
          [ 0.8507, -0.8646, -0.4349,  0.7716, -0.9789,  0.1340, -0.2637]]],
        grad_fn=<CatBackward0>))

You can see that the last two rows in each example are identical: the `out` from the RNN contains all hidden states, `h_n` is the last hidden state

In [59]:
def forward(self, x, lengths):
    lengths = lengths - 1
    out, _ = self.rnn(x)
        
    # Create indices for gathering
    indices = torch.arange(out.size(1), device=self.device)  # Assuming the sequence length is along the second dimension

    # Expand the indices to match the shape of ind
    indices = indices.unsqueeze(0).expand(lengths.size(0), -1).to(device=self.device)

    # Gather the slices
    gathered_slices = torch.gather(out, 1, indices.unsqueeze(-1).expand(-1, -1, out.size(-1))).to(device=self.device)

    # Select the slices based on the provided indices
    out = gathered_slices[torch.arange(out.size(0)), lengths]

    return self.fc_layers(out)

def train_one_batch(self, inputs, labels, lengths):
        labels = labels.to(torch.float)

        # Zero your gradients for every batch
        self.optimizer.zero_grad()

        # predict and compute loss
        outputs = self(inputs, lengths)
        outputs = outputs.squeeze().to(torch.float)
        loss = self.criterion(outputs, labels)

        # compute gradients
        loss.backward()
        
        self.optimizer.step()

        # batch accuracy and loss
        label_hat = torch.round(torch.sigmoid(outputs))
        acc = (label_hat == labels).sum().item() / len(labels)

        return loss.detach(), acc
    

def train_epochs(self, nb_epochs, training_loader, test_loader = None, verbosity_level=None):

    for epoch_index in range(nb_epochs):
        print("---------------------------------------------------------------")
        print(f"training epoch {epoch_index}")
    
        # put model in training mode
        self.train()

        losses = []
        accuracies = []

        for i, (inputs, labels, lengths) in enumerate(training_loader):
            loss, acc = self.train_one_batch(inputs, labels, lengths)

            losses.append(loss.item())
            accuracies.append(acc)

            if verbosity_level is not None and (i % verbosity_level) == 0:
                print(f"  - batch {i = }: loss = {loss:.2f} \t accuracy = {acc:.2f}")

        print(f" - epoch loss = {np.mean(losses):.2f} \t accuracy = {np.mean(accuracies):.2f}")

        if test_loader is not None:
            self.prediction(test_loader)

    print("\n")


def prediction(self, test_loader):
    self.eval()

    test_losses = []
    test_accuracies = []

    with torch.no_grad():
        for (inputs, labels, lengths) in test_loader:
            labels = labels.to(torch.float)

            outputs = self(inputs, lengths)
            outputs = outputs.squeeze().to(torch.float)
            test_loss = self.criterion(outputs, labels)

            label_hat = torch.round(torch.sigmoid(outputs))
            test_acc = (label_hat == labels).sum().item() / len(labels)

            test_losses.append(test_loss.item())
            test_accuracies.append(test_acc)

    print(f" - test loss = {np.mean(test_losses):.2f} \t accuracy = {np.mean(test_accuracies):.2f}")

    print("\n")

Old forward pass:

In [60]:
import inspect
print(inspect.getsource(MyRNN.forward))

    def forward(self, x):
        # get last hidden state
        _, hn = self.rnn(x)
        return self.fc_layers(hn)



Update forward pass:

In [61]:
MyRNN.forward = forward
MyRNN.train_one_batch = train_one_batch
MyRNN.train_epochs = train_epochs
MyRNN.prediction = prediction
print(inspect.getsource(MyRNN.forward))

def forward(self, x, lengths):
    lengths = lengths - 1
    out, _ = self.rnn(x)
        
    # Create indices for gathering
    indices = torch.arange(out.size(1), device=self.device)  # Assuming the sequence length is along the second dimension

    # Expand the indices to match the shape of ind
    indices = indices.unsqueeze(0).expand(lengths.size(0), -1).to(device=self.device)

    # Gather the slices
    gathered_slices = torch.gather(out, 1, indices.unsqueeze(-1).expand(-1, -1, out.size(-1))).to(device=self.device)

    # Select the slices based on the provided indices
    out = gathered_slices[torch.arange(out.size(0)), lengths]

    return self.fc_layers(out)



We test again using only one layer:

In [62]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model has 29569 parameters.


MyRNN(
  (rnn): RNN(100, 128, batch_first=True)
  (fc_layers): Sequential(
    (0): Linear(in_features=128, out_features=1, bias=True)
  )
  (criterion): BCEWithLogitsLoss()
)

In [63]:
model.train_epochs(2, training_loader=train_dataloader, verbosity_level=20)

---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.71 	 accuracy = 0.43
  - batch i = 20: loss = 0.72 	 accuracy = 0.48
  - batch i = 40: loss = 0.66 	 accuracy = 0.63
  - batch i = 60: loss = 0.67 	 accuracy = 0.58
  - batch i = 80: loss = 0.67 	 accuracy = 0.63
  - batch i = 100: loss = 0.61 	 accuracy = 0.71
  - batch i = 120: loss = 0.68 	 accuracy = 0.60
  - batch i = 140: loss = 0.66 	 accuracy = 0.66
  - batch i = 160: loss = 0.60 	 accuracy = 0.73
  - batch i = 180: loss = 0.64 	 accuracy = 0.66
  - batch i = 200: loss = 0.70 	 accuracy = 0.52
  - batch i = 220: loss = 0.69 	 accuracy = 0.52
  - batch i = 240: loss = 0.68 	 accuracy = 0.53
 - epoch loss = 0.67 	 accuracy = 0.59
---------------------------------------------------------------
training epoch 1
  - batch i = 0: loss = 0.68 	 accuracy = 0.57
  - batch i = 20: loss = 0.67 	 accuracy = 0.62
  - batch i = 40: loss = 0.68 	 accuracy = 0.55
  - batch i = 60: loss = 

There is still a strange pattern: the model gets better and better in the first epoch and worse afterwards in the second epoch.

Let's look at the gradients:
- here we update the `train_one_batch` and the `train_epochs` functions: 

In [64]:
def train_one_batch(self, inputs, labels, lengths):
        labels = labels.to(torch.float)

        # Zero your gradients for every batch
        self.optimizer.zero_grad()

        # predict and compute loss
        outputs = self(inputs, lengths)
        outputs = outputs.squeeze().to(torch.float)
        loss = self.criterion(outputs, labels)

        # compute gradients
        loss.backward()
        
        self.optimizer.step()

        # get gradients
        total_norm = 0.
        for p in self.parameters():
            param_norm = p.grad.detach().data.norm(2)
            total_norm += param_norm.item() ** 2
        total_norm = total_norm ** 0.5

        # batch accuracy and loss
        label_hat = torch.round(torch.sigmoid(outputs))
        acc = (label_hat == labels).sum().item() / len(labels)

        return loss.detach(), acc, total_norm

def train_epochs(self, nb_epochs, training_loader, test_loader=None, verbosity_level=None):

        for epoch_index in range(nb_epochs):
            print("---------------------------------------------------------------")
            print(f"training epoch {epoch_index}")
    
            # put model in training mode
            self.train()

            losses = []
            accuracies = []

            for i, (inputs, labels, lengths) in enumerate(training_loader):
                loss, acc, total_norm = self.train_one_batch(inputs, labels, lengths)

                losses.append(loss.item())
                accuracies.append(acc)

                if verbosity_level is not None and (i % verbosity_level) == 0:
                    print(f"  - batch {i = }: loss = {loss:.2f} \t accuracy = {acc:.2f} \tgradient_norm = {total_norm:.2f}")

            print(f" - epoch loss = {np.mean(losses):.2f} \taccuracy = {np.mean(accuracies):.2f}")
            
            if test_loader is not None:
                self.prediction(test_loader)
        print("\n")


MyRNN.train_one_batch = train_one_batch
MyRNN.train_epochs = train_epochs

In [66]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)


model has 29569 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.70 	 accuracy = 0.50 	gradient_norm = 0.36
  - batch i = 20: loss = 0.70 	 accuracy = 0.55 	gradient_norm = 0.57
  - batch i = 40: loss = 0.59 	 accuracy = 0.70 	gradient_norm = 6.89
  - batch i = 60: loss = 0.61 	 accuracy = 0.67 	gradient_norm = 36.73
  - batch i = 80: loss = 0.68 	 accuracy = 0.53 	gradient_norm = 0.34
  - batch i = 100: loss = 0.68 	 accuracy = 0.55 	gradient_norm = 0.26
  - batch i = 120: loss = 0.67 	 accuracy = 0.59 	gradient_norm = 0.33
  - batch i = 140: loss = 0.65 	 accuracy = 0.65 	gradient_norm = 0.24
  - batch i = 160: loss = 0.61 	 accuracy = 0.66 	gradient_norm = 0.49
  - batch i = 180: loss = 0.71 	 accuracy = 0.53 	gradient_norm = 0.63
  - batch i = 200: loss = 0.62 	 accuracy = 0.70 	gradient_norm = 0.32
  - batch i = 220: loss = 0.66 	 accuracy = 0.56 	gradient_norm = 0.78
  - batch i = 240: loss = 0.65 	 accuracy = 

When you run it a few times you see this pattern: large gradients and low accuracy go often hand in hand. A sign of exploding gradients.

One approach to deal with this issue, is to clip the gradient, i.e. if it is larger than some value, we clip it
- watch this in [video](https://www.youtube.com/watch?v=KrQp1TxTCUY)
- you don't have to, but if you are interested here is an [article](https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem) that goes into more details

As you saw, there are two approaches to address this problem:
- gradient clipping by norm
- gradient clipping by threshold

Update your code and implement gradient clipping by norm


&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

In [67]:
def train_one_batch(self, inputs, labels, lengths, do_norm_clipping=True):
    labels = labels.to(torch.float)

    # Zero your gradients for every batch
    self.optimizer.zero_grad()

    # predict and compute loss
    outputs = self(inputs, lengths)
    outputs = outputs.squeeze().to(torch.float)
    loss = self.criterion(outputs, labels)

    # compute gradients
    loss.backward()

    if do_norm_clipping:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
    self.optimizer.step()

    # get gradients
    total_norm = 0.
    for p in self.parameters():
        param_norm = p.grad.detach().data.norm(2)
        total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5

    # batch accuracy and loss
    label_hat = torch.round(torch.sigmoid(outputs))
    acc = (label_hat == labels).sum().item() / len(labels)

    return loss.detach(), acc, total_norm

MyRNN.train_one_batch = train_one_batch

Test: it looks better (at least for the runs here)

In [68]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 29569 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.70 	 accuracy = 0.53 	gradient_norm = 0.16
  - batch i = 20: loss = 0.69 	 accuracy = 0.54 	gradient_norm = 0.29
  - batch i = 40: loss = 0.66 	 accuracy = 0.61 	gradient_norm = 0.23
  - batch i = 60: loss = 0.66 	 accuracy = 0.62 	gradient_norm = 1.00
  - batch i = 80: loss = 0.57 	 accuracy = 0.71 	gradient_norm = 1.00
  - batch i = 100: loss = 0.63 	 accuracy = 0.65 	gradient_norm = 1.00
  - batch i = 120: loss = 0.59 	 accuracy = 0.72 	gradient_norm = 1.00
  - batch i = 140: loss = 0.60 	 accuracy = 0.69 	gradient_norm = 1.00
  - batch i = 160: loss = 0.60 	 accuracy = 0.69 	gradient_norm = 1.00
  - batch i = 180: loss = 0.59 	 accuracy = 0.70 	gradient_norm = 0.91
  - batch i = 200: loss = 0.58 	 accuracy = 0.70 	gradient_norm = 1.00
  - batch i = 220: loss = 0.57 	 accuracy = 0.73 	gradient_norm = 1.00
  - batch i = 240: loss = 0.50 	 accuracy = 0

Another important issue before we start with different hyperparmeterizations is the weight initialization. Initially the weights are initialized using random numbers. Often it turns out, that the chosen initialization is an important choice.

Luckily there are appropriate initializations (&rarr; Deep Learning Courses):
- below I copied our latest changes and make weight initialization accessible through the init function
- in the init we now do have Xavier initializations, look [here](https://www.deeplearning.ai/ai-notes/initialization/index.html) if you like or you can also use He initialization (video exercise 1)

In [69]:
import torch.nn.init as init

class MyRNN(nn.Module):
    def __init__(self, rnn, fc_layers, learning_rate, do_weight_init = True, device=device) -> None:
        super(MyRNN, self).__init__()
        self.device = device
        
        # layers
        self.rnn = rnn
        self.fc_layers = nn.Sequential(*fc_layers)

        # optimizer: usually this is also a hyperparameter
        # especially the learning rate is very important
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)

        self.criterion = nn.BCEWithLogitsLoss()  # Binary cross-entropy loss for binary classification


        if do_weight_init:
            self._weight_initialization()
        
        print(f"model has {self.count_parameters()} parameters.")

    def _weight_initialization(self):
        for name, param in self.named_parameters():
            if 'weight' in name:
                init.xavier_uniform_(param)

    def count_parameters(self):
            return sum(p.numel() for p in self.parameters() if p.requires_grad)
    

    def forward(self, x, lengths):
        lengths = lengths - 1
        
        out, _ = self.rnn(x)
        
        # Create indices for gathering
        indices = torch.arange(out.size(1), device=self.device)  # Assuming the sequence length is along the second dimension

        # Expand the indices to match the shape of ind
        indices = indices.unsqueeze(0).expand(lengths.size(0), -1).to(device=self.device)

        # Gather the slices
        gathered_slices = torch.gather(out, 1, indices.unsqueeze(-1).expand(-1, -1, out.size(-1))).to(device=self.device)

        # Select the slices based on the provided indices
        out = gathered_slices[torch.arange(out.size(0)), lengths]

        return self.fc_layers(out)
    
    def train_one_batch(self, inputs, labels, lengths, do_norm_clipping=True):
        labels = labels.to(torch.float)

        # Zero your gradients for every batch
        self.optimizer.zero_grad()

        # predict and compute loss
        outputs = self(inputs, lengths)
        outputs = outputs.squeeze().to(torch.float)
        loss = self.criterion(outputs, labels)

        # compute gradients
        loss.backward()

        if do_norm_clipping:
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        self.optimizer.step()

        # get gradients
        total_norm = 0.
        for p in self.parameters():
            param_norm = p.grad.detach().data.norm(2)
            total_norm += param_norm.item() ** 2
        total_norm = total_norm ** 0.5

        # batch accuracy and loss
        label_hat = torch.round(torch.sigmoid(outputs))
        acc = (label_hat == labels).sum().item() / len(labels)

        return loss.detach(), acc, total_norm
    

    def train_epochs(self, nb_epochs, training_loader, test_loader=None, verbosity_level=None):

        for epoch_index in range(nb_epochs):
            print("---------------------------------------------------------------")
            print(f"training epoch {epoch_index}")
    
            # put model in training mode
            self.train()

            losses = []
            accuracies = []

            for i, (inputs, labels, lengths) in enumerate(training_loader):
                loss, acc, total_norm = self.train_one_batch(inputs, labels, lengths)

                losses.append(loss.item())
                accuracies.append(acc)

                if verbosity_level is not None and (i % verbosity_level) == 0:
                    print(f"  - batch {i = }: loss = {loss:.2f} \t accuracy = {acc:.2f} \tgradient_norm = {total_norm:.2f}")

            print(f" - epoch loss = {np.mean(losses):.2f} \taccuracy = {np.mean(accuracies):.2f}")
            if test_loader is not None:
                self.prediction(test_loader)

        print("\n")


    def prediction(self, test_loader):
        self.eval()

        test_losses = []
        test_accuracies = []

        with torch.no_grad():
            for (inputs, labels, lengths) in test_loader:
                labels = labels.to(torch.float)

                outputs = self(inputs, lengths)
                outputs = outputs.squeeze().to(torch.float)
                test_loss = self.criterion(outputs, labels)

                label_hat = torch.round(torch.sigmoid(outputs))
                test_acc = (label_hat == labels).sum().item() / len(labels)

                test_losses.append(test_loss.item())
                test_accuracies.append(test_acc)

        print(f" - test loss = {np.mean(test_losses):.2f} \t accuracy = {np.mean(test_accuracies):.2f}")

        print("\n")

In [70]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 29569 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.76 	 accuracy = 0.48 	gradient_norm = 1.00
  - batch i = 20: loss = 0.72 	 accuracy = 0.50 	gradient_norm = 0.44
  - batch i = 40: loss = 0.68 	 accuracy = 0.59 	gradient_norm = 0.42
  - batch i = 60: loss = 0.67 	 accuracy = 0.59 	gradient_norm = 0.50
  - batch i = 80: loss = 0.68 	 accuracy = 0.56 	gradient_norm = 0.85
  - batch i = 100: loss = 0.68 	 accuracy = 0.63 	gradient_norm = 1.00
  - batch i = 120: loss = 0.63 	 accuracy = 0.63 	gradient_norm = 1.00
  - batch i = 140: loss = 0.55 	 accuracy = 0.73 	gradient_norm = 0.92
  - batch i = 160: loss = 0.65 	 accuracy = 0.62 	gradient_norm = 0.55
  - batch i = 180: loss = 0.56 	 accuracy = 0.70 	gradient_norm = 1.00
  - batch i = 200: loss = 0.70 	 accuracy = 0.54 	gradient_norm = 1.00
  - batch i = 220: loss = 0.60 	 accuracy = 0.72 	gradient_norm = 1.00
  - batch i = 240: loss = 0.69 	 accuracy = 0

It doesn't make a huge difference. Next we can apply batch normalization:

In [71]:
def forward(self, x, lengths):
        lengths = lengths - 1

        # add LayerNorm
        x = nn.BatchNorm1d(x.shape[1], device=self.device)(x)
        #x = nn.LayerNorm(x.shape[2], device=self.device)(x)
        out, _ = self.rnn(x)

        # Create indices for gathering
        indices = torch.arange(out.size(1), device=self.device)  # Assuming the sequence length is along the second dimension

        # Expand the indices to match the shape of ind
        indices = indices.unsqueeze(0).expand(lengths.size(0), -1).to(device=self.device)

        # Gather the slices
        gathered_slices = torch.gather(out, 1, indices.unsqueeze(-1).expand(-1, -1, out.size(-1))).to(device=self.device)

        # Select the slices based on the provided indices
        out = gathered_slices[torch.arange(out.size(0)), lengths]

        # add LayerNorm
        #out = nn.LayerNorm(out.shape[1], device=self.device)(out)
        out = nn.BatchNorm1d(out.shape[1], device=self.device)(out)
        
        return self.fc_layers(out)

MyRNN.forward = forward

In [72]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(
    input_size = embedding_dim,
    hidden_size = hidden_dim,
    num_layers=num_layers,
    batch_first = True,
    dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 29569 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.87 	 accuracy = 0.53 	gradient_norm = 1.00
  - batch i = 20: loss = 0.74 	 accuracy = 0.55 	gradient_norm = 1.00
  - batch i = 40: loss = 0.73 	 accuracy = 0.55 	gradient_norm = 1.00
  - batch i = 60: loss = 0.75 	 accuracy = 0.55 	gradient_norm = 1.00
  - batch i = 80: loss = 0.71 	 accuracy = 0.50 	gradient_norm = 1.00
  - batch i = 100: loss = 0.72 	 accuracy = 0.58 	gradient_norm = 1.00
  - batch i = 120: loss = 0.72 	 accuracy = 0.59 	gradient_norm = 1.00
  - batch i = 140: loss = 0.63 	 accuracy = 0.69 	gradient_norm = 1.00
  - batch i = 160: loss = 0.61 	 accuracy = 0.67 	gradient_norm = 1.00
  - batch i = 180: loss = 0.72 	 accuracy = 0.59 	gradient_norm = 1.00
  - batch i = 200: loss = 0.66 	 accuracy = 0.61 	gradient_norm = 1.00
  - batch i = 220: loss = 0.72 	 accuracy = 0.55 	gradient_norm = 1.00
  - batch i = 240: loss = 0.65 	 accuracy = 0

Since it is not really better, we can try to include a non linearity before doing fc_layers:

In [73]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.ReLU(),nn.Linear(hidden_dim, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)

model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 29569 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.76 	 accuracy = 0.55 	gradient_norm = 1.00
  - batch i = 20: loss = 0.72 	 accuracy = 0.51 	gradient_norm = 0.72
  - batch i = 40: loss = 0.68 	 accuracy = 0.62 	gradient_norm = 0.69
  - batch i = 60: loss = 0.67 	 accuracy = 0.56 	gradient_norm = 0.73
  - batch i = 80: loss = 0.66 	 accuracy = 0.58 	gradient_norm = 0.61
  - batch i = 100: loss = 0.61 	 accuracy = 0.66 	gradient_norm = 0.65
  - batch i = 120: loss = 0.61 	 accuracy = 0.69 	gradient_norm = 0.73
  - batch i = 140: loss = 0.67 	 accuracy = 0.60 	gradient_norm = 0.79
  - batch i = 160: loss = 0.66 	 accuracy = 0.60 	gradient_norm = 1.00
  - batch i = 180: loss = 0.58 	 accuracy = 0.71 	gradient_norm = 0.67
  - batch i = 200: loss = 0.58 	 accuracy = 0.74 	gradient_norm = 1.00
  - batch i = 220: loss = 0.60 	 accuracy = 0.64 	gradient_norm = 0.97
  - batch i = 240: loss = 0.67 	 accuracy = 0

It does not seem to be very helpful. More layers?

In [74]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification

rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer: recall that layers have to match in dimension, i.e. hidden_dim -> 64 -> 64 -> 1 (here important 64 -> 64)
fc_layers = [nn.ReLU(), nn.Linear(hidden_dim, 64), nn.ReLU(), nn.Linear(64, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)
model.train_epochs(2, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 37761 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.78 	 accuracy = 0.45 	gradient_norm = 0.80
  - batch i = 20: loss = 0.71 	 accuracy = 0.53 	gradient_norm = 0.63
  - batch i = 40: loss = 0.67 	 accuracy = 0.58 	gradient_norm = 0.58
  - batch i = 60: loss = 0.68 	 accuracy = 0.55 	gradient_norm = 0.59
  - batch i = 80: loss = 0.68 	 accuracy = 0.57 	gradient_norm = 0.63
  - batch i = 100: loss = 0.67 	 accuracy = 0.61 	gradient_norm = 0.73
  - batch i = 120: loss = 0.67 	 accuracy = 0.62 	gradient_norm = 1.00
  - batch i = 140: loss = 0.64 	 accuracy = 0.56 	gradient_norm = 0.71
  - batch i = 160: loss = 0.62 	 accuracy = 0.70 	gradient_norm = 0.49
  - batch i = 180: loss = 0.69 	 accuracy = 0.57 	gradient_norm = 0.62
  - batch i = 200: loss = 0.62 	 accuracy = 0.66 	gradient_norm = 1.00
  - batch i = 220: loss = 0.66 	 accuracy = 0.64 	gradient_norm = 0.60
  - batch i = 240: loss = 0.64 	 accuracy = 0

Of course we could train longer:

It does not really work very well - remember that we achieved a test accuracy of 0.79 using elastic net!

Let's see, if we should change the learning rate:

In [75]:
for lr in [1e-1, 1e-2, 1e-4]:
    model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=lr)
    model.to(device)
    print(f"learning rate: {lr}")
    model.train_epochs(1, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=50)
    print("============================================================\n")

model has 37761 parameters.
learning rate: 0.1
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.74 	 accuracy = 0.51 	gradient_norm = 0.97
  - batch i = 50: loss = 0.67 	 accuracy = 0.60 	gradient_norm = 0.06
  - batch i = 100: loss = 0.71 	 accuracy = 0.45 	gradient_norm = 0.10
  - batch i = 150: loss = 0.69 	 accuracy = 0.53 	gradient_norm = 0.07
  - batch i = 200: loss = 0.69 	 accuracy = 0.45 	gradient_norm = 0.07
 - epoch loss = 0.76 	accuracy = 0.50
 - test loss = 0.69 	 accuracy = 0.50





model has 37761 parameters.
learning rate: 0.01
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.72 	 accuracy = 0.51 	gradient_norm = 0.67
  - batch i = 50: loss = 0.70 	 accuracy = 0.61 	gradient_norm = 0.26
  - batch i = 100: loss = 0.69 	 accuracy = 0.51 	gradient_norm = 0.24
  - batch i = 150: loss = 0.68 	 accuracy = 0.55 	gradient_norm = 0.19
  - batch i = 200: loss = 0.6

Not a huge difference, but maybe we can try using a larger learning rate of 0.01.

## Attention

Let's give attention a try! Recall that in our current version in the forward pass we are ignoring all but the last hidden states:

In [76]:
# a mock up example of our logic at the moment
rnn = nn.RNN(input_size=3, hidden_size=6, batch_first=True, )

# let's create a dummy batch of data: 2 observations, where each of them has 5
input = torch.randn(2, 5, 3)

# put it through the rnn and collect both: final hidden state and intermediate ones
out, _ = rnn(input)
out.shape, out

(torch.Size([2, 5, 6]),
 tensor([[[ 0.0325,  0.5809,  0.0696,  0.0675,  0.3881,  0.3815],
          [ 0.0966, -0.1537,  0.1609,  0.3482, -0.0783, -0.1984],
          [-0.0360,  0.6509, -0.4764, -0.2346,  0.0666,  0.1816],
          [-0.3213,  0.0098,  0.4404,  0.4634,  0.3899,  0.2281],
          [ 0.6624,  0.4431,  0.4900,  0.2121, -0.1384, -0.0854]],
 
         [[ 0.3022, -0.1820,  0.4094,  0.2782,  0.3534,  0.2801],
          [ 0.0847, -0.0765, -0.7470, -0.2452, -0.3739, -0.3442],
          [-0.0805,  0.2477,  0.2492,  0.3788,  0.5024,  0.2896],
          [ 0.6400,  0.2728,  0.3495,  0.3839, -0.3025, -0.4391],
          [ 0.8876, -0.1754, -0.4368,  0.5128, -0.5406, -0.7968]]],
        grad_fn=<TransposeBackward1>))

But what we actually want, is to calculate retrieve the last hidden state $\mathbf h_l$ and calculate the dot-product with the previous hidden states $\mathbf h_i, i\le l$ before feeding it to the softmax:

In [81]:
# let's ignore here the lengths variable we have to deal with
out, hn = rnn(input)
print(f"{out.shape=}, {hn.shape=}")
out, hn

out.shape=torch.Size([2, 5, 6]), hn.shape=torch.Size([1, 2, 6])


(tensor([[[ 0.0325,  0.5809,  0.0696,  0.0675,  0.3881,  0.3815],
          [ 0.0966, -0.1537,  0.1609,  0.3482, -0.0783, -0.1984],
          [-0.0360,  0.6509, -0.4764, -0.2346,  0.0666,  0.1816],
          [-0.3213,  0.0098,  0.4404,  0.4634,  0.3899,  0.2281],
          [ 0.6624,  0.4431,  0.4900,  0.2121, -0.1384, -0.0854]],
 
         [[ 0.3022, -0.1820,  0.4094,  0.2782,  0.3534,  0.2801],
          [ 0.0847, -0.0765, -0.7470, -0.2452, -0.3739, -0.3442],
          [-0.0805,  0.2477,  0.2492,  0.3788,  0.5024,  0.2896],
          [ 0.6400,  0.2728,  0.3495,  0.3839, -0.3025, -0.4391],
          [ 0.8876, -0.1754, -0.4368,  0.5128, -0.5406, -0.7968]]],
        grad_fn=<TransposeBackward1>),
 tensor([[[ 0.6624,  0.4431,  0.4900,  0.2121, -0.1384, -0.0854],
          [ 0.8876, -0.1754, -0.4368,  0.5128, -0.5406, -0.7968]]],
        grad_fn=<StackBackward0>))

We use pytorch broadcasting to calculate the dot-product:

In [82]:
print(f"{hn.shape=}")
hn = hn.squeeze(0).unsqueeze(1)
print(f"{hn.shape=}")
hn, hn.shape

hn.shape=torch.Size([1, 2, 6])
hn.shape=torch.Size([2, 1, 6])


(tensor([[[ 0.6624,  0.4431,  0.4900,  0.2121, -0.1384, -0.0854]],
 
         [[ 0.8876, -0.1754, -0.4368,  0.5128, -0.5406, -0.7968]]],
        grad_fn=<UnsqueezeBackward0>),
 torch.Size([2, 1, 6]))

In [84]:
pointwise = out * hn
pointwise.shape, pointwise

(torch.Size([2, 2, 5, 6]),
 tensor([[[[ 0.0215,  0.2574,  0.0341,  0.0143, -0.0537, -0.0326],
           [ 0.0640, -0.0681,  0.0789,  0.0739,  0.0108,  0.0169],
           [-0.0239,  0.2884, -0.2334, -0.0498, -0.0092, -0.0155],
           [-0.2129,  0.0044,  0.2158,  0.0983, -0.0540, -0.0195],
           [ 0.4388,  0.1963,  0.2401,  0.0450,  0.0192,  0.0073]],
 
          [[ 0.2002, -0.0806,  0.2006,  0.0590, -0.0489, -0.0239],
           [ 0.0561, -0.0339, -0.3660, -0.0520,  0.0517,  0.0294],
           [-0.0534,  0.1097,  0.1221,  0.0803, -0.0695, -0.0247],
           [ 0.4239,  0.1209,  0.1713,  0.0814,  0.0419,  0.0375],
           [ 0.5879, -0.0777, -0.2140,  0.1088,  0.0748,  0.0680]]],
 
 
         [[[ 0.0288, -0.1019, -0.0304,  0.0346, -0.2098, -0.3040],
           [ 0.0858,  0.0270, -0.0703,  0.1786,  0.0423,  0.1581],
           [-0.0320, -0.1142,  0.2081, -0.1203, -0.0360, -0.1447],
           [-0.2852, -0.0017, -0.1924,  0.2376, -0.2108, -0.1818],
           [ 0.5879, -0.07

Now we can sum along the rows to get the scalar product:

In [85]:
score = pointwise.sum(2)
score

tensor([[[ 0.2876,  0.6783,  0.3354,  0.1817, -0.0869, -0.0433],
         [ 1.2148,  0.0384, -0.0860,  0.2775,  0.0500,  0.0863]],

        [[ 0.3854, -0.2686, -0.2990,  0.4393, -0.3395, -0.4042],
         [ 1.6277, -0.0152,  0.0767,  0.6710,  0.1953,  0.8051]]],
       grad_fn=<SumBackward1>)

Check:

In [86]:
x = pointwise[0,0,:] # first row
x, sum(x)

(tensor([[ 0.0215,  0.2574,  0.0341,  0.0143, -0.0537, -0.0326],
         [ 0.0640, -0.0681,  0.0789,  0.0739,  0.0108,  0.0169],
         [-0.0239,  0.2884, -0.2334, -0.0498, -0.0092, -0.0155],
         [-0.2129,  0.0044,  0.2158,  0.0983, -0.0540, -0.0195],
         [ 0.4388,  0.1963,  0.2401,  0.0450,  0.0192,  0.0073]],
        grad_fn=<SliceBackward0>),
 tensor([ 0.2876,  0.6783,  0.3354,  0.1817, -0.0869, -0.0433],
        grad_fn=<AddBackward0>))

Next calcualte the softmax:

In [87]:
weights = torch.nn.functional.softmax(score, dim=1)
weights

tensor([[[0.2835, 0.6547, 0.6038, 0.4761, 0.4658, 0.4676],
         [0.7165, 0.3453, 0.3962, 0.5239, 0.5342, 0.5324]],

        [[0.2240, 0.4370, 0.4072, 0.4423, 0.3694, 0.2298],
         [0.7760, 0.5630, 0.5928, 0.5577, 0.6306, 0.7702]]],
       grad_fn=<SoftmaxBackward0>)

In [89]:
weights.sum(1, keepdim=True)

tensor([[[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]],

        [[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]]],
       grad_fn=<SumBackward1>)

In our data we have also the actual length of the sequence. We only use hidden states up to the length of the sequence - hidden states after the sequence lengths are ignored. The easiest way is to set the weights to 0 (or equivalently set the argument to the softmax to $-\infty$):

In [90]:
# the actual lengths 
l = torch.tensor([1,2])

# some batch that we want to mask
X = torch.tensor([[[1, 2, 3], [4, 5, 6], [1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12], [1, 2, 3], [4, 5, 6]]])
l, X

(tensor([1, 2]),
 tensor([[[ 1,  2,  3],
          [ 4,  5,  6],
          [ 1,  2,  3],
          [ 4,  5,  6]],
 
         [[ 7,  8,  9],
          [10, 11, 12],
          [ 1,  2,  3],
          [ 4,  5,  6]]]))

masking:

In [91]:
l[:, None].shape,  torch.arange(X.size(1)).unsqueeze(0).shape

(torch.Size([2, 1]), torch.Size([1, 4]))

In [92]:
l[:, None],  torch.arange(X.size(1)).unsqueeze(0) 

(tensor([[1],
         [2]]),
 tensor([[0, 1, 2, 3]]))

In [93]:
mask = torch.arange(X.size(1)).unsqueeze(0) >= l[:, None]
mask

tensor([[False,  True,  True,  True],
        [False, False,  True,  True]])

In [94]:
X[mask] = -9999
X

tensor([[[    1,     2,     3],
         [-9999, -9999, -9999],
         [-9999, -9999, -9999],
         [-9999, -9999, -9999]],

        [[    7,     8,     9],
         [   10,    11,    12],
         [-9999, -9999, -9999],
         [-9999, -9999, -9999]]])

In [95]:
def forward(self, x, lengths):
    lengths = lengths - 1

    # add LayerNorm
    x = nn.LayerNorm(x.shape[2], device=self.device)(x)
    out, _ = self.rnn(x)        

    # get last hidden state:
    # Create indices for gathering
    indices = torch.arange(out.size(1), device=self.device)  # Assuming the sequence length is along the second dimension

    # Expand the indices to match the shape of ind
    indices = indices.unsqueeze(0).expand(lengths.size(0), -1).to(device=self.device)

    # Gather the slices
    gathered_slices = torch.gather(out, 1, indices.unsqueeze(-1).expand(-1, -1, out.size(-1))).to(device=self.device)

    # Select the slices based on the provided indices
    last_hidden = gathered_slices[torch.arange(out.size(0)), lengths]
        
    # attention calculations
    last_hidden = last_hidden.squeeze(0).unsqueeze(1)
    pointwise = out * last_hidden

    mask = torch.arange(out.size(1), device=device).unsqueeze(0) >= lengths[:, None]
    pointwise[mask] = -9999

    score = pointwise.sum(2)
    weights = nn.Softmax(dim=1)(score)
    res = torch.sum(weights.unsqueeze(-1) * out, 1)

        

    # add LayerNorm
    out = nn.LayerNorm(res.shape[1], device=self.device)(res)

    return self.fc_layers(out)

MyRNN.forward = forward

In [96]:
embedding_dim = 100
hidden_dim = 128 # a hyperparamter
learning_rate = 1e-3 # a hyperparamter
drop_out = 0 # drop out is a regularization technique, and also a hyperparamter
num_layers = 1 # we can stack layers on top of each other if we like
output_dim = 1 # we are doing a binary classification


rnn = nn.RNN(input_size = embedding_dim,
                 hidden_size = hidden_dim,
                 num_layers=num_layers,
                 batch_first = True,
                 dropout = drop_out)
        
# now we try only one layer
fc_layers = [nn.ReLU(), nn.Linear(hidden_dim, 64), nn.ReLU(), nn.Linear(64, 1)]

model = MyRNN(rnn = rnn, fc_layers=fc_layers, learning_rate=learning_rate)
model.to(device)
model.train_epochs(10, training_loader=train_dataloader, test_loader=test_dataloader, verbosity_level=20)

model has 37761 parameters.
---------------------------------------------------------------
training epoch 0
  - batch i = 0: loss = 0.71 	 accuracy = 0.57 	gradient_norm = 0.79
  - batch i = 20: loss = 0.71 	 accuracy = 0.50 	gradient_norm = 0.61
  - batch i = 40: loss = 0.69 	 accuracy = 0.55 	gradient_norm = 0.91
  - batch i = 60: loss = 0.67 	 accuracy = 0.57 	gradient_norm = 0.70
  - batch i = 80: loss = 0.66 	 accuracy = 0.60 	gradient_norm = 0.59
  - batch i = 100: loss = 0.65 	 accuracy = 0.63 	gradient_norm = 1.00
  - batch i = 120: loss = 0.61 	 accuracy = 0.66 	gradient_norm = 1.00
  - batch i = 140: loss = 0.62 	 accuracy = 0.70 	gradient_norm = 1.00
  - batch i = 160: loss = 0.61 	 accuracy = 0.70 	gradient_norm = 0.61
  - batch i = 180: loss = 0.83 	 accuracy = 0.50 	gradient_norm = 1.00
  - batch i = 200: loss = 0.59 	 accuracy = 0.72 	gradient_norm = 1.00
  - batch i = 220: loss = 0.53 	 accuracy = 0.75 	gradient_norm = 1.00
  - batch i = 240: loss = 0.66 	 accuracy = 0

### Summary

Take home messages:

- how a pytorch loop goes: forward, zero_grad, backward, optimizer step, etc.
- there are many hyperparameters (which ones again?)
- padding and retrieving the correct hidden state is non-trivial
- gradient clipping, batch norm (and later layer norm), non-linearities
- attention
- we use way too many parameters for this problem. Remember that we have achieved a test accuracy of 0.79 using elastic net!