# TL;DR

1. In this lab scenario you will have a chance to compare performance of the classic RNN and LSTM on a toy example. 
2. This toy example will show that maintaining memory over even 20 steps is non-trivial. 
3. Finally, you will see how curriculum learning may allow to train a model on larger sequences.

# Problem definition

Here we consider a toy example, where the goal is to discriminate between two types of binary sequences:
* [Type 0] a sequence with exactly one zero (remaining entries are equal to one).
* [Type 1] a sequence full of ones,

We are especially interested in the performance of the trained models on discriminating between a sequence full of ones versus a sequence with leading zero followed by ones. Note that in this case the goal of the model is to output the first element of the sequence, as the label (sequence type) is fully determined by the first element of the sequence.

#Implementation

## Importing torch

Install `torch` and `torchvision`

In [1]:
!pip3 install torch torchvision



In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(1)

<torch._C.Generator at 0x7fee01e1e610>

## Understand dimensionality

Check the input and output specification [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) and [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). The following snippet shows how we can process
a sequence by LSTM and output a vector of size `hidden_dim` after reading
each token of the sequence. 

In [28]:
hidden_dim = 5
lstm = nn.LSTM(1, hidden_dim)  # Input sequence contains elements - vectors of size 1

# create a random sequence
sequence = [torch.randn(1) for _ in range(10)]

# initialize the hidden state (including cell state)
hidden = (torch.zeros(1, 1, 5),
          torch.zeros(1, 1, 5))

for i, elem in enumerate(sequence):
  # we are processing only a single element of the sequence, and there
  # is only one sample (sequence) in the batch, the third one
  # corresponds to the fact that our sequence contains elemenents,
  # which can be treated as vectors of size 1
  out, hidden = lstm(elem.view(1, 1, 1), hidden)
  print(f'i={i} out={out.detach()}')
print(f'Final hidden state={hidden[0].detach()} cell state={hidden[1].detach()}')

i=0 out=tensor([[[ 0.0643, -0.0487,  0.1485, -0.0262,  0.1324]]])
i=1 out=tensor([[[-0.0315, -0.0625,  0.2592,  0.0080,  0.1264]]])
i=2 out=tensor([[[ 0.0315, -0.0643,  0.3079, -0.0314,  0.1696]]])
i=3 out=tensor([[[ 0.1544, -0.0443,  0.3007, -0.1117,  0.2052]]])
i=4 out=tensor([[[ 0.0084, -0.0472,  0.3254, -0.0395,  0.1552]]])
i=5 out=tensor([[[ 0.0901, -0.0520,  0.3421, -0.0814,  0.1843]]])
i=6 out=tensor([[[ 0.1868, -0.0343,  0.3160, -0.1368,  0.2074]]])
i=7 out=tensor([[[ 0.1509, -0.0336,  0.3085, -0.1008,  0.1976]]])
i=8 out=tensor([[[ 0.2490, -0.0251,  0.2757, -0.1662,  0.2150]]])
i=9 out=tensor([[[ 0.0179, -0.0416,  0.3123, -0.0531,  0.1389]]])
Final hidden state=tensor([[[ 0.0179, -0.0416,  0.3123, -0.0531,  0.1389]]]) cell state=tensor([[[ 0.0352, -0.1567,  0.8403, -0.1186,  0.3227]]])


## To implement

Process the whole sequence all at once by calling `lstm` only once and check that the output is exactly the same as above (remember to initialize the hidden state the same way).

In [32]:
# #########################################################
#                    To implement
# #########################################################
hidden = (torch.zeros(1, 1, 5),
          torch.zeros(1, 1, 5))

seq = torch.stack(sequence)
seq = seq.reshape(10, 1, 1)
out, hidden = lstm(seq, hidden)

In [33]:
hidden

(tensor([[[ 0.0179, -0.0416,  0.3123, -0.0531,  0.1389]]],
        grad_fn=<StackBackward0>),
 tensor([[[ 0.0352, -0.1567,  0.8403, -0.1186,  0.3227]]],
        grad_fn=<StackBackward0>))

## Training a model

Below we define a very simple model, which is a single layer of LSTM, where the output in each time step is processed by relu followed by a single fully connected layer, the output of which is a single number. We are going
to use the number generated after reading the last element of the sequence,
which will serve as the logit for our classification problem.

In [34]:
class Model(nn.Module):

    def __init__(self, hidden_dim):
        super(Model, self).__init__()
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(1, self.hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        sequence_len = x.shape[0]
        logits = self.hidden2label(F.relu(out[-1].view(-1)))
        return logits

Below is a training loop, where we only train on the two hardest examples.

In [35]:
SEQUENCE_LEN = 10

# Pairs of (sequence, label)
HARD_EXAMPLES = [([0.]+(SEQUENCE_LEN-1)*[1.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in HARD_EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = Model(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in HARD_EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [36]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([-0.1427]), tensor([-0.1430])]
Logits for hard examples=[tensor([0.0459]), tensor([0.0459])]
Logits for hard examples=[tensor([-0.0013]), tensor([-0.0012])]
Logits for hard examples=[tensor([-0.0029]), tensor([-0.0028])]
Logits for hard examples=[tensor([0.0038]), tensor([0.0039])]
Logits for hard examples=[tensor([0.0005]), tensor([0.0007])]
Logits for hard examples=[tensor([0.0015]), tensor([0.0018])]
Logits for hard examples=[tensor([0.0005]), tensor([0.0012])]
Logits for hard examples=[tensor([0.0021]), tensor([0.0043])]
Logits for hard examples=[tensor([0.3903]), tensor([0.5400])]
Logits for hard examples=[tensor([-0.2760]), tensor([-0.2749])]
Logits for hard examples=[tensor([0.1003]), tensor([0.1016])]
Logits for hard examples=[tensor([0.0585]), tensor([0.0597])]
Logits for hard examples=[tensor([0.0128]), tensor([0.0136])]
Logits for hard examples=[tensor([-0.0128]), tensor([-0.0115])]
Logits for hard examples=[tensor([0.0126]), tensor([0.0153])

## To implement

1. Check for what values of `SEQUENCE_LEN` the model is able to discriminate betweeh the two hard examples (after training).




In [37]:
SEQUENCE_LEN = 100

# Pairs of (sequence, label)
HARD_EXAMPLES = [([0.]+(SEQUENCE_LEN-1)*[1.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in HARD_EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = Model(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in HARD_EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [38]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([-0.1585]), tensor([-0.1585])]
Logits for hard examples=[tensor([0.0024]), tensor([0.0024])]
Logits for hard examples=[tensor([0.0241]), tensor([0.0241])]
Logits for hard examples=[tensor([-0.0002]), tensor([-0.0002])]
Logits for hard examples=[tensor([-0.0011]), tensor([-0.0011])]
Logits for hard examples=[tensor([0.0023]), tensor([0.0023])]
Logits for hard examples=[tensor([0.0014]), tensor([0.0014])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0014]), tensor([0.0014])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Lo

2. Instead of training on `HARD_EXAMPLES` only, modify the training loop to train on sequences where zero may be in any position of the sequence (so any valid sequence of `Type 0`, not just the hardest one). After modifying the training loop check for what values of `SEQUENCE_LEN` you can train the model successfully.

In [43]:
SEQUENCE_LEN = 10


# Pairs of (sequence, label)
EXAMPLES = [((SEQUENCE_LEN-1)*[1.]+[0.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = Model(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [44]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([0.1264]), tensor([0.1092])]
Logits for hard examples=[tensor([-4.1522]), tensor([4.2849])]
Logits for hard examples=[tensor([-9.3930]), tensor([8.9250])]
Logits for hard examples=[tensor([-10.1448]), tensor([9.5306])]
Logits for hard examples=[tensor([-10.3701]), tensor([9.6762])]
Logits for hard examples=[tensor([-10.4821]), tensor([9.7562])]
Logits for hard examples=[tensor([-10.5620]), tensor([9.8250])]
Logits for hard examples=[tensor([-10.6233]), tensor([9.8919])]
Logits for hard examples=[tensor([-10.6672]), tensor([9.9561])]
Logits for hard examples=[tensor([-10.7036]), tensor([10.0230])]
Logits for hard examples=[tensor([-10.7339]), tensor([10.0935])]
Logits for hard examples=[tensor([-10.7594]), tensor([10.1686])]
Logits for hard examples=[tensor([-10.7804]), tensor([10.2497])]
Logits for hard examples=[tensor([-10.7967]), tensor([10.3378])]
Logits for hard examples=[tensor([-10.8090]), tensor([10.4341])]
Logits for hard examples=[tensor([-10.

In [45]:
SEQUENCE_LEN = 100


# Pairs of (sequence, label)
EXAMPLES = [((SEQUENCE_LEN-1)*[1.]+[0.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = Model(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [46]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([-0.0153]), tensor([-0.0160])]
Logits for hard examples=[tensor([-3.5542]), tensor([4.5477])]
Logits for hard examples=[tensor([-9.2871]), tensor([9.3131])]
Logits for hard examples=[tensor([-9.9792]), tensor([10.1100])]
Logits for hard examples=[tensor([-10.0684]), tensor([10.2785])]
Logits for hard examples=[tensor([-10.0797]), tensor([10.3536])]
Logits for hard examples=[tensor([-10.0814]), tensor([10.4098])]
Logits for hard examples=[tensor([-10.0825]), tensor([10.4574])]
Logits for hard examples=[tensor([-10.0842]), tensor([10.4996])]
Logits for hard examples=[tensor([-10.0864]), tensor([10.5384])]
Logits for hard examples=[tensor([-10.0892]), tensor([10.5753])]
Logits for hard examples=[tensor([-10.0924]), tensor([10.6105])]
Logits for hard examples=[tensor([-10.0961]), tensor([10.6425])]
Logits for hard examples=[tensor([-10.1001]), tensor([10.6725])]
Logits for hard examples=[tensor([-10.1044]), tensor([10.7018])]
Logits for hard examples=[tenso

3. Replace LSTM by a classic RNN and check for what values of `SEQUENCE_LEN` you can train the model successfully.


In [50]:
class ModelRNN(nn.Module):

    def __init__(self, hidden_dim):
        super(ModelRNN, self).__init__()
        self.hidden_dim = hidden_dim
        self.rnn = nn.RNN(1, self.hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        out, _ = self.rnn(x)
        sequence_len = x.shape[0]
        logits = self.hidden2label(F.relu(out[-1].view(-1)))
        return logits

In [51]:
SEQUENCE_LEN = 10

# Pairs of (sequence, label)
HARD_EXAMPLES = [([0.]+(SEQUENCE_LEN-1)*[1.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in HARD_EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = ModelRNN(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in HARD_EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [52]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([-0.0428]), tensor([-0.0432])]
Logits for hard examples=[tensor([-6.7880]), tensor([10.1588])]
Logits for hard examples=[tensor([-13.7029]), tensor([19.4810])]
Logits for hard examples=[tensor([-14.6298]), tensor([20.7230])]
Logits for hard examples=[tensor([-14.7542]), tensor([20.8893])]
Logits for hard examples=[tensor([-14.7711]), tensor([20.9116])]
Logits for hard examples=[tensor([-14.7735]), tensor([20.9145])]
Logits for hard examples=[tensor([-14.7740]), tensor([20.9149])]
Logits for hard examples=[tensor([-14.7743]), tensor([20.9149])]
Logits for hard examples=[tensor([-14.7745]), tensor([20.9148])]
Logits for hard examples=[tensor([-14.7747]), tensor([20.9148])]
Logits for hard examples=[tensor([-14.7750]), tensor([20.9147])]
Logits for hard examples=[tensor([-14.7752]), tensor([20.9147])]
Logits for hard examples=[tensor([-14.7754]), tensor([20.9147])]
Logits for hard examples=[tensor([-14.7756]), tensor([20.9146])]
Logits for hard examples=[t

In [53]:
SEQUENCE_LEN = 100

# Pairs of (sequence, label)
HARD_EXAMPLES = [([0.]+(SEQUENCE_LEN-1)*[1.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in HARD_EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = ModelRNN(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % 100 == 0:
            eval_on_hard_examples(model)

        for sequence, label in HARD_EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step()   

In [54]:
train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([0.2561]), tensor([0.2561])]
Logits for hard examples=[tensor([-0.0146]), tensor([-0.0146])]
Logits for hard examples=[tensor([-0.0136]), tensor([-0.0136])]
Logits for hard examples=[tensor([-0.0045]), tensor([-0.0045])]
Logits for hard examples=[tensor([0.0020]), tensor([0.0020])]
Logits for hard examples=[tensor([0.0017]), tensor([0.0017])]
Logits for hard examples=[tensor([0.0010]), tensor([0.0010])]
Logits for hard examples=[tensor([0.0014]), tensor([0.0014])]
Logits for hard examples=[tensor([0.0012]), tensor([0.0012])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Logits for hard examples=[tensor([0.0013]), tensor([0.0013])]
Lo

4. Write a proper curricullum learning loop, where in a loop you consider longer and longer sequences, where expansion of the sequence length happens only after the model is trained successfully on the current length.

Note that for steps 2-4 you may need to change the value of `num_steps`.