## Model B: 2 Hidden Layer (w. ReLU)
* Unroll 28 time steps
    * Each step input size 28 x 1
    * Total per unroll: 28 x 28
        * Feedforward Neural Network input size: 28 x 28
* **2 Hidden Layer**
* ReLU activation function

### Steps
* STEP 1: Load dataset
* STEP 2: Make dataset iterable
* STEP 3: Create model class
* **STEP 4: Instantiate model class**
* STEP 5: Instantiate loss class
* STEP 6: Instantiate optimizer class
* STEP 7: Train Model! 

In [1]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [2]:
'''
STEP 1: Load Dataset
'''
train_dataset = dsets.MNIST(root="./data",
                            train=True,
                            transform=transforms.ToTensor(), 
                            download=True)

test_dataset = dsets.MNIST(root='./data',
                           train=False,
                           transform=transforms.ToTensor())

'''
STEP 2: Make Dataset Iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = int(n_iters / (len(train_dataset)/batch_size))

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

'''
STEP 3: Create Model Class
'''

class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        # Hidden Dimensions
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # Building your RNN
        # batch_first=True causes input/output Tensors to be of shape
        # (batch_dim, seq_dim, input_dim)
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='relu')
        
        # Readout Layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        # (layer_dim, batch_size, hidden_dim)
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # One time step
        out, hn = self.rnn(x, h0)
        
        # Index hidden state of last time step
        # out.size() --> 100, 28, 100
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states!
        out = self.fc(out[:, -1, :])
        # out.size() --> 100, 10
        return out

'''
STEP 4: Instantiate Model Class
'''

input_dim = 28
hidden_dim = 100
layer_dim = 2   # This is the only change to add another hidden layer to the model! Simple! 
output_dim = 10

model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

# Printing model and parameters to illustrate the difference with our 2 hidden layer neural network

print(model)
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())
    
'''
STEP 5: Instantiate Loss Class
'''

criterion = nn.CrossEntropyLoss()

'''
STEP 6: Instantiate Optimizer Class
'''

learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

'''
STEP 7: Train the Model
'''

# Number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images.view(-1, seq_dim, input_dim)) 
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        # outputs.size() --> (100, 10)
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

RNNModel (
  (rnn): RNN(28, 100, num_layers=2, batch_first=True)
  (fc): Linear (100 -> 10)
)
10
torch.Size([100, 28])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([100, 100])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([10, 100])
torch.Size([10])
Iteration: 500. Loss: 1.021502137184143. Accuracy: 62.14
Iteration: 1000. Loss: 0.6722710132598877. Accuracy: 73.9
Iteration: 1500. Loss: 0.24204854667186737. Accuracy: 91.98
Iteration: 2000. Loss: 0.31326016783714294. Accuracy: 93.07
Iteration: 2500. Loss: 2.313302993774414. Accuracy: 11.35
Iteration: 3000. Loss: 2.301129102706909. Accuracy: 11.35


# Weird, my loss is NaN. Something strange going on here, despite me copying / pasting the code over from the previous model. I think it was getting confused because of the re-use of variable names, etc. After shutting down the notebooks and restarting, it looks like I'm getting positive results, which mirror the instructor's results. Strange, It bombed out at 2500 iterations, going from 93% accuracy to this 11.35% number... I'm going to move on for now, but come back later to figure out why the loss balloons. 

* **10 sets of parameters** 
* First Hidden Layer
    * $A_1 = [100,28]$
    * $A_3 = [100,100]$
    * $B_1 = [100]$
    * $B_3 = [100]$
* Second Hidden Layer
    * $A_2 = [100,100]$
    * $A_5 = [100,100]$
    * $B_2 = [100]$
    * $B_5 = [100]$
* Readout Layer
    * $A_5 = [10,100]$
    * $B_5 = [10]$

## Model B: 2 Hidden Layer (w. ReLU)
* Unroll 28 time steps
    * Each step input size 28 x 1
    * Total per unroll: 28 x 28
        * Feedforward Neural Network input size: 28 x 28
* **2 Hidden Layer**
* Tanh activation function
### Steps
* STEP 1: Load dataset
* STEP 2: Make dataset iterable
* **STEP 3: Create model class**
* STEP 4: Instantiate model class
* STEP 5: Instantiate loss class
* STEP 6: Instantiate optimizer class
* STEP 7: Train Model! 

In [3]:
'''
STEP 3: Create Model Class
'''

class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        # Hidden Dimensions
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # Building your RNN
        # batch_first=True causes input/output Tensors to be of shape
        # (batch_dim, seq_dim, input_dim)
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='tanh')
        
        # Readout Layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        # (layer_dim, batch_size, hidden_dim)
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # One time step
        out, hn = self.rnn(x, h0)
        
        # Index hidden state of last time step
        # out.size() --> 100, 28, 100
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states!
        out = self.fc(out[:, -1, :])
        # out.size() --> 100, 10
        return out

'''
STEP 4: Instantiate Model Class
'''

input_dim = 28
hidden_dim = 100
layer_dim = 2   # This is the only change to add another hidden layer to the model! Simple! 
output_dim = 10

model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

# Printing model and parameters to illustrate the difference with our 2 hidden layer neural network

print(model)
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())
    
'''
STEP 5: Instantiate Loss Class
'''

criterion = nn.CrossEntropyLoss()

'''
STEP 6: Instantiate Optimizer Class
'''

learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

'''
STEP 7: Train the Model
'''

# Number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images.view(-1, seq_dim, input_dim)) 
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        # outputs.size() --> (100, 10)
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

RNNModel (
  (rnn): RNN(28, 100, num_layers=2, batch_first=True)
  (fc): Linear (100 -> 10)
)
10
torch.Size([100, 28])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([100, 100])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([10, 100])
torch.Size([10])
Iteration: 500. Loss: 0.4070070683956146. Accuracy: 84.93
Iteration: 1000. Loss: 0.3393940031528473. Accuracy: 91.68
Iteration: 1500. Loss: 0.2780081629753113. Accuracy: 92.05
Iteration: 2000. Loss: 0.17639566957950592. Accuracy: 94.96
Iteration: 2500. Loss: 0.111669160425663. Accuracy: 95.99
Iteration: 3000. Loss: 0.101287841796875. Accuracy: 95.93


### Summary of Results 

**Model A** | **Model B** | **Model C**
--- | --- | ---
ReLU | ReLU | tanH
1 Hidden Layer | 2 Hidden Layers | 2 Hidden Layers
100 Hidden Units | 100 Hidden Units | 100 Hidden Units
~92.2% | ~95.0% | ~95.9%

### Deep Learning

* 2 ways to expand a recurrent neural network
    * More non-linear activations units (neurons)
    * More hidden layers
* Cons
    * Need a larger dataset
        * Curse of dimensionality
    * Does not necessarily mean higher accuracy