In [1]:
import time
import torch
import torch.nn as nn
import torchvision.datasets as dsets
import torchvision.transforms as transforms
from torch.autograd import Variable

In this lecture we will build a Long short-term memory (LSTM) based NN and use the same dataset i.e. MNIST

Let us import the training and test data from MNIST. 

In [2]:
train_dataset = dsets.MNIST(root='./data', 
                            train=True, 
                            transform=transforms.ToTensor(),  
                            download=True)

In [3]:
test_dataset = dsets.MNIST(root='./data', 
                           train=False, 
                           transform=transforms.ToTensor())

These will get saved in the $data$ folder (Should already be present from lecture 2) and their processed forms will be stored in the $data/processed$ folder.

The next step is setting up the hyperparameters.

In [4]:
# Input image size (no of columns)
sequence_length = 28
# Input image size (no of rows)
input_size = 28
# Size of the the hidden layer
hidden_size = 128
# How many recurrent layers do we want
num_layers = 2
# 10 classes since 10 digits
num_classes = 10
# Depending on how powerful your machine is you can increase the batch size
batch_size = 100
# Let us train it for 5 epochs i.e. 10 training cycles
num_epochs = 5
# To avoid overfitting, we start with a low learning rate
learning_rate = 0.01

Now that we have procured the data and set the hyperparameters, we will set up the pipeline for the input as we did before.

In [5]:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

In [6]:
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

The data preparation phase is over, we have the data, a configured pipeline and now we will create the CNN model. 

We will create a 2 LSTM layer model with Batch normalization, ReLU and Max Pooling.

In [11]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        # We create a 2 layer LSTM with input size of 28 (since 28 channels for each row)
        # And a hidden layer size of 128
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        # Output of each layer is fed as input to the next layer
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        # Output of each layer is fed as input to the next layer
        # Set initial states to zero
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))
        c0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size))
        
        # Forward propagate LSTM
        out = self.lstm(x, (h0, c0))
        out = out.view(out.size(0), -1)
        # Decode hidden state of last time step
        out = self.fc(out[:, -1, :])
        return out

Congratulations on creating your first LSTM model. Go ahead and experiment with the __learning rate__, __batch size__ and __number of epochs__ once you are done to understand how they affect your model's performance.

Here we pass each $28\times28$ image as $28$ rows with a length of $28$ each. Go ahead and try another approach.

In [8]:
lstm = LSTM(input_size, hidden_size, num_layers, num_classes)

Note that we used a fully connected layer and the outermost layers and used a ReLU activation in the hidden layer.

Now that we have our model we need to decide what loss function we want to use and the optimizer that will minimize our loss function. We choose the Cross Entropy (or log) loss function and the [Adam Optimizer](https://arxiv.org/abs/1412.6980v8).

In [9]:
criterion = nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)

We now train the data i.e. we iterate over the test data and pass it to the model. We calculate loss, backpropagate and then try to optimize our loss function.

A common practice while training is to print the status and the value of the loss after every $n$ steps. Here we choose $100$. 

Let us get an estimate of how long this process takes using the $time$ package.

In [10]:
start = time.time()
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Convert torch tensor to Variable for calculations
        images = Variable(images.view(-1, sequence_length, input_size))
        labels = Variable(labels)
        
        # Forward
        optimizer.zero_grad()  # zero the gradient buffer
        outputs = lstm(images)
        loss = criterion(outputs, labels)
        # Backward
        loss.backward()
        # Optimize
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [%d/%d], Iteration [%d/%d], Loss: %.4f' 
                   %(epoch+1, num_epochs, i+1, len(train_dataset)//batch_size, loss.data[0]))

end = time.time()
print(end - start)

Epoch [1/5], Iteration [100/600], Loss: 0.5682
Epoch [1/5], Iteration [200/600], Loss: 0.2132
Epoch [1/5], Iteration [300/600], Loss: 0.3123
Epoch [1/5], Iteration [400/600], Loss: 0.1894
Epoch [1/5], Iteration [500/600], Loss: 0.2319
Epoch [1/5], Iteration [600/600], Loss: 0.0868
Epoch [2/5], Iteration [100/600], Loss: 0.1751
Epoch [2/5], Iteration [200/600], Loss: 0.0617
Epoch [2/5], Iteration [300/600], Loss: 0.1325
Epoch [2/5], Iteration [400/600], Loss: 0.0404
Epoch [2/5], Iteration [500/600], Loss: 0.0960
Epoch [2/5], Iteration [600/600], Loss: 0.0485
Epoch [3/5], Iteration [100/600], Loss: 0.0653
Epoch [3/5], Iteration [200/600], Loss: 0.0883
Epoch [3/5], Iteration [300/600], Loss: 0.0746
Epoch [3/5], Iteration [400/600], Loss: 0.1601
Epoch [3/5], Iteration [500/600], Loss: 0.1081
Epoch [3/5], Iteration [600/600], Loss: 0.0699
Epoch [4/5], Iteration [100/600], Loss: 0.0679
Epoch [4/5], Iteration [200/600], Loss: 0.0089
Epoch [4/5], Iteration [300/600], Loss: 0.0421
Epoch [4/5], 

After $5$ epochs we get a loss of $0.09$ and the overall training time was $450$ seconds on the CPU. 

Note: Total number of iterations = size_of_dataset / batch_size 

Congratualtions on training your first model with an image dataset. No go ahead and test the model's performance and repeat this process after changing the hyperparameters.

In [12]:
lstm.eval()
correct = 0
total = 0
for images, labels in test_loader:
    images = Variable(images.view(-1, sequence_length, input_size))
    outputs = lstm(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()
    
print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

Accuracy of the network on the 10000 test images: 98 %


When you are finally satisfied with your results you can even save your model for future use as below.

In [14]:
torch.save(lstm.state_dict(), 'mnist_lstm.pkl')