# MNIST Classification using RNNs and Pytorch

## Advanced topics on machine learning <br>

Mauricio A. Álvarez, PhD <br>
Cristian Guarnizo PhD, y Hernán F. García PhD (c) - TA

Jupyter notebook for the course of Bayesian Machine Learning, section: Introduction to Recurrent Neural Networks. Universidad Tecnológica de Pereira.

In this example we use Pytorch 1.1. If you already have this version of Pytorch installed in your machine, you can skip this step. However, if your are running this code in a Google Colab session, then you need to excute this next line.

In [None]:
!pip install -q torch==1.1.0 torchvision

As usual, we start by importing all the libraries that we require for this example

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets

## Loading MNIST Dataset

In [None]:
train_dataset = dsets.MNIST(root='./data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='./data', 
                           train=False, 
                           transform=transforms.ToTensor())

## Dataset set-up

We define the number of iterations, epochs and batch size that will be used over the trainig phase.

In [None]:
batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

## Create the model based on RNNs

In [None]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTMModel, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim

        # Number of hidden layers
        self.layer_dim = layer_dim

        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape
        # (batch_dim, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)

        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # Initialize hidden state with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # Initialize cell state
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # One time step
        # This is part of truncated backpropagation through time (BPTT)
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))

        # Index hidden state of last time step
        # out.size() --> 100, 28, 10
        # out[:, -1, :] --> 100, 10 --> just want last time step hidden states! 
        out = self.fc(out[:, -1, :]) 
        # out.size() --> 100, 10
        return out

## Initialize the RNN model

Here, we define the dimensions of our RNNs. We have 10 classes and each image row size is 28 pixels.

In [None]:
input_dim = 28
hidden_dim = 100
layer_dim = 1
output_dim = 10

model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)

### Question: 
If our input $x_t$ is a row from the image, how many rows we need to process in order to evaluate the error function? <br>
Analise the size of Hidden Weights when the number of layers is 1 and 2. Access to model.lstm.weight_hh_l0 (where l0 indicates the first layer).

Next, we allow Pytorch to work using the GPU or the CPU regarding availability.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

## Setup Training process

First, we define the Error function as Cross-Entropy because we are dealing with a classification task.

In [None]:
criterion = nn.CrossEntropyLoss()

We adopt the Stochastic Gradient Descent with learning rate equal to 0.1 as our optimizer.

In [None]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

## Train the model

According to the above answer, we need to analyse 28 rows in order to process a sample (an image).

In [None]:
# Number of steps to unroll
seq_dim = 28  

Here, we define the the training process and its accuracy.

In [None]:
iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images as Variable
        images = images.view(-1, seq_dim, input_dim).requires_grad_().to(device)
        labels = labels.to(device)

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        # outputs.size() --> 100, 10
        outputs = model(images)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        iter += 1
        #Every 500 iterarions the accuracy is evalouated
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = images.view(-1, seq_dim, input_dim).to(device)
                labels = labels.to(device)

                # Forward pass only to get logits/output
                outputs = model(images)

                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)

                # Total number of labels
                total += labels.size(0)

                # Total correct predictions
                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()

            accuracy = 100 * correct / total

            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.item(), accuracy))

## Exercises

1. Evaluate the model using one, two and three layers and save each result. <br>
Note: Remember that GRU and RNN models don't have a $C_t$ variable. ["RNNs in Pytorch"](https://pytorch.org/docs/stable/nn.html#recurrent-layers) <br>
2. Repeat step 1 but using a vanilla RNN model using ReLU and tanh activation functions. (See the example below) <br>
3. Repeat step 1 but using a GRU model. (nn.GRU) <br>
4. Compare the results obtained for all models.

In [None]:
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim

        # Number of hidden layers
        self.layer_dim = layer_dim

        # Building your RNN
        # batch_first=True causes input/output tensors to be of shape
        # (batch_dim, seq_dim, feature_dim)
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='tanh') #tanh, relu

        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # Initialize hidden state with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # One time step
        # We need to detach the hidden state to prevent exploding/vanishing gradients
        # This is part of truncated backpropagation through time (BPTT)
        out, hn = self.rnn(x, h0.detach())

        # Index hidden state of last time step
        # out.size() --> 100, 28, 10
        # out[:, -1, :] --> 100, 10 --> just want last time step hidden states! 
        out = self.fc(out[:, -1, :]) 
        # out.size() --> 100, 10
        return out

layer_dim = 3
model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  

In the cell below, you can code the GRUModel.