# Training an LSTM on a single sentence

In my early days of practising Pytorch and training LSTMs on text for next character prediction, I with struggled with some of the basics, often ending up with errors.

But then, I managed to successfully implement one.

This notebook is about training a very LSTM on a very simple training dataset consisting of a simple sequence "I am passionate about computational drug discovery", and the accompanying commentary.

Let's get the dataset ready first:

In [15]:
data = "i am passionate about computational drug discovery"

#Length of the sequence
seq_len = len(data)

For simplicity, let's keep the sentence small, only considering all lowercase letters, a space and a terminator symbol (#, to mark end of sentence) as my allowed set of characters.

This can be generated in python as:

In [16]:
import string

vocabulary = string.ascii_lowercase+' #'

n_vocabulary = len(vocabulary)
print(n_vocabulary)
print('Letter set is '+vocabulary)

28
Letter set is abcdefghijklmnopqrstuvwxyz #


We can’t directly input characters to the LSTM model and hence we need to convert into numbers; some form of encoding to feed our model. 

The method of choice for most ML practisioners is usually one hot encoding. This encoding is quite simple and straight forward. There are 28 allowed letters. (26 letters + 1 space + 1 ‘#’). I will then convert each of my character into a 28 dimensional vector. All the other dimensions have value 0, except the index of character in our string letters, which we defined above.

So, letter ```a``` will be encoded into ```[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]```. Similarly ```b``` will be encoded as ```[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]``` and so on...

```#``` is encoded as ```[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]```. 

We will be dealing with these vectors from now on. In pytorch, basic datatype is tensor, which is a multidimensional array. So, we store all our encodings in tensors.

Below function converts a letter to tensor:

In [17]:
#This function takes a character and returns an encoded vector

import torch
import torch.nn as nn

def ltt(ch):
    ans = torch.zeros(n_vocabulary)
    ans[vocabulary.find(ch)]=1
    return ans

print("Encoding of 'a' ",ltt('a'))
print("Encoding of 'b' ",ltt('b'))
print("Encoding of '#' ",ltt('#'))

Encoding of 'a'  tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Encoding of 'b'  tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Encoding of '#'  tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1.])


In [18]:
# The following function takes a string and returns a 3D tensor

def getLine(s):
    ans = []
    for c in s:
        ans.append(ltt(c))
    return torch.cat(ans,dim=0).view(len(s),1,n_vocabulary)

Now let’s define our Neural Network Class

In [19]:
#This is our neural network class. every Neural Network in pytorch extends nn.Module

class MyLSTM(nn.Module):
    
    def __init__(self,input_dim,hidden_dim):
        super(MyLSTM,self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        #LSTM takes, input dimensions, hidden dimensions and num layers in case of stacked LSTMs (Default is 1)
        self.LSTM = nn.LSTM(input_dim,hidden_dim)
        
#Input must be 3 dimensional (Sequence len, batch, input dimensions)
#hc is a tuple which contains the vectors h (hidden/feedback) and c (cell state vector)

    def forward(self,inp,hc):
        #this gives outut for each input and also (hidden and cell state vector)
        output,_= self.LSTM(inp,hc)
        return output

Now, let’s use the above class to initialize our network

In [20]:
#Dimensions of output of neural network is (seq_len, batch , hidden_dim). Since we want output dimensions to be
#the same as n_vocabulary, hidden_dim = n_vocabulary (output dimensions = hidden_dimensions)
hidden_dim = n_vocabulary     

#Invoking model. Input dimensions = n_vocabulary i.e 28. output dimensions = hidden_dimensions = 28
model = MyLSTM(n_vocabulary,hidden_dim)

#I'm using Adam optimizer here
optimizer = torch.optim.Adam(params = model.parameters(),lr=0.01)

#Loss function is CrossEntropyLoss
LOSS = torch.nn.CrossEntropyLoss()

The model we are trying to build is a character generation model. That means, we will input the sequence , “i am passionate about computational drug discovery” to the model. Since this is a supervised technique, we must have a dataset to train the model. I’m taking the output for a sequence is, next letter in the sequence. 

That means, for the input sequence “i am passio”, output character must be “n”, since n is the next letter in the original sequence. 

Similarily, for input sequence “i am pa”, output sequence is “s”. 

Now, let’s try to build actual outputs. Let’s call them ‘targets’.

In [21]:
#List to store targets
targets = []

#Iterate through all chars in the sequence, starting from second letter. Since output for 1st letter is the 2nd letter
for x in data[1:]+'#':
    #Find the target index. For a, it is 0, For 'b' it is 1 etc..
    targets.append(vocabulary.find(x))

#Convert into tensor
targets = torch.tensor(targets)

targets

tensor([26,  0, 12, 26, 15,  0, 18, 18,  8, 14, 13,  0, 19,  4, 26,  0,  1, 14,
        20, 19, 26,  2, 14, 12, 15, 20, 19,  0, 19,  8, 14, 13,  0, 11, 26,  3,
        17, 20,  6, 26,  3,  8, 18,  2, 14, 21,  4, 17, 24, 27])

Now, let’s build input sequence to train

In [22]:
#List to store input
inpl = []

#Iterate through all inputs in the sequence
for c in data:
    #Convert into tensor
    inpl.append(ltt(c))

#Convert list to tensor
inp = torch.cat(inpl, dim=0)

#Reshape tensor into 3 dimensions (sequence length, batches = 1, dimensions = n_vocabulary (28))
inp = inp.view(seq_len, 1, n_vocabulary)

### Let the training begin

In [23]:
#Let's note down start time to track the training time
import time
start = time.time()

#Number of iterations
n_iters = 150

for itr in range(n_iters):
    
    #Zero the previosus gradients
    model.zero_grad()
    optimizer.zero_grad()
    
    #Initialize h and c vectors
    h = torch.rand(hidden_dim).view(1,1,hidden_dim)
    c = torch.rand(hidden_dim).view(1,1,hidden_dim)
    
    #Find the output
    output = model(inp,(h,c))
    
    #Reshape the output to 2 dimensions. This is done, so that we can compare with target and get loss
    output = output.view(seq_len,n_vocabulary)
    
    #Find loss
    loss = LOSS(output,targets)
    
    #Print loss for every 10th iteration
    if itr%10==0:
        print(itr,' ',(loss) )
        
    #Back propagate the loss
    loss.backward()
    
    #Perform weight updation
    optimizer.step()
    
print((time.time()-start))

0   tensor(3.3191, grad_fn=<NllLossBackward0>)
10   tensor(3.0017, grad_fn=<NllLossBackward0>)
20   tensor(2.6948, grad_fn=<NllLossBackward0>)
30   tensor(2.4703, grad_fn=<NllLossBackward0>)
40   tensor(2.3628, grad_fn=<NllLossBackward0>)
50   tensor(2.2722, grad_fn=<NllLossBackward0>)
60   tensor(2.2023, grad_fn=<NllLossBackward0>)
70   tensor(2.1528, grad_fn=<NllLossBackward0>)
80   tensor(2.1192, grad_fn=<NllLossBackward0>)
90   tensor(2.0845, grad_fn=<NllLossBackward0>)
100   tensor(2.0473, grad_fn=<NllLossBackward0>)
110   tensor(2.0167, grad_fn=<NllLossBackward0>)
120   tensor(2.1217, grad_fn=<NllLossBackward0>)
130   tensor(2.0452, grad_fn=<NllLossBackward0>)
140   tensor(2.0015, grad_fn=<NllLossBackward0>)
0.39878296852111816


Utility function that predicts next letter , given the sequence

In [24]:
#This utility method predicts the next letter given the sequence  

def predict(s):
    #Get the vector for input
    inp = getLine(s)
    
    #Initialize h and c vectors
    h = torch.rand(1,1,hidden_dim)
    c = torch.rand(1,1,hidden_dim)
    
    #Get the output
    out = model(inp,(h,c))
    
    #Find the corresponding letter from the output
    return vocabulary[out[-1][0].topk(1)[1].detach().numpy().item()]

Perform a prediction:

In [25]:
predict('i am passionate abou')

't'

If I provide input ‘i love neu’, it’s able to correctly predict the next letter as ‘r’. Except for the highlighted sequence, it’s able to correctly predict for the other inputs. :-)

The following function that recursively generates the sequence given an input


In [26]:
#THis method recursively generates the sequence using the trained model

def gen(s):
    #If generated sequence length is too large, or terminate char is generated, we can print the generated output so far
    if s[-1]=='#' or len(s)>=len(data)+5:
        print(s)
        return

    #Predict with sequence s
    pred = predict(s)
    
    #Continue prediction with sequence s + predicted value
    print(s+pred)
    
    #Recurse
    gen(s+pred)


In [27]:
gen("i am pa")

i am pas
i am pass
i am passi
i am passio
i am passion
i am passiona
i am passionaa
i am passionaaa
i am passionaaat
i am passionaaat 
i am passionaaat b
i am passionaaat bo
i am passionaaat bou
i am passionaaat bout
i am passionaaat bout 
i am passionaaat bout c
i am passionaaat bout co
i am passionaaat bout com
i am passionaaat bout comp
i am passionaaat bout compu
i am passionaaat bout comput
i am passionaaat bout computa
i am passionaaat bout computat
i am passionaaat bout computati
i am passionaaat bout computatio
i am passionaaat bout computation
i am passionaaat bout computationa
i am passionaaat bout computational
i am passionaaat bout computational 
i am passionaaat bout computational d
i am passionaaat bout computational dr
i am passionaaat bout computational dru
i am passionaaat bout computational drud
i am passionaaat bout computational drud 
i am passionaaat bout computational drud i
i am passionaaat bout computational drud io
i am passionaaat bout computational drud io 
i

Clearly the model has been overfitted by training on multiple iterations using the same SMILES again and again. 

Important points to remember :

* Using Adam optimizer , loss comes down quickly compared to using SGD.

* I’ve clearly witnessed the problem with vanilla RNN implementation. i.e vanishing gradient problem. The model weights quickly turned to NaNs that would stop my model from being trained further.

* Incorrect dimensions are a problem while implementing. Special care needs to be taken to ensure that all the inputs dimensions are correct.

* Don’t forget to zero the existing gradient values by using optimizer.zero_grad(). If we fail to do that, gradient values get accumulated and model won’t get trained correctly, hence will never converge