# 8. Long Short-Term Memory (LSTM) network with PyTorch
## 1. About LSTM's: Special RNN
* Capable of learning long-term dependencies
* LSTM = RNN on super-juice (haha!)

[graphic illustrating LSTM network]

Pretty involved explanation, but reading other material helped me understand the Instructor's walk through...

**Ooh, now a video on the math behind LSTM's. Yowza!**
A lot to digest, but the idea is that this is all about sequential operations, where we scale information from old states to current states, forgetting some info, bringing some other info, to move information to the new state. 

## 2. Building an LSTM with PyTorch
### Model A: 1 hidden layer
* Unroll 28 time steps
    * Each step input size: 28 x 1
    * Total per unroll: 28 x 28
        * Feedforward neural network input size: 28 x 28
* 1 Hidden Layer

### Steps
1. Load Dataset
- Make Dataset iterable
- Create model class
- Instantiate model class
- Instantiate loss class
- Instantiate optimizer class
- Train model!

In [1]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

### Step 1: Load Dataset

In [2]:
train_dataset = dsets.MNIST(root="./data",
                            train=True,
                            transform=transforms.ToTensor(), 
                            download=True)

test_dataset = dsets.MNIST(root='./data',
                           train=False,
                           transform=transforms.ToTensor())

In [3]:
print(train_dataset.train_data.size())

torch.Size([60000, 28, 28])


In [4]:
print(train_dataset.train_labels.size())

torch.Size([60000])


In [5]:
print(test_dataset.test_data.size())

torch.Size([10000, 28, 28])


In [6]:
print(test_dataset.test_labels.size())

torch.Size([10000])


### Step 2: Make Dataset iterable

In [7]:
batch_size = 100
n_iters = 3000
num_epochs = int(n_iters / (len(train_dataset)/batch_size))

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

### Step 3: Create model class

In [8]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTMModel, self).__init__()
        # Hidden Dimensions
        self.hidden_dim = hidden_dim
        
        # Number of Hidden Layers
        self.layer_dim = layer_dim
        
        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape
        # (batch_size, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        
        # Readout Layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # Initialize cell state
        c0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # 28 time steps
        out, (hn, cn) = self.lstm(x, (h0, c0))
        
        # Index hidden state of last time step
        # out.size() --> (100, 28, 100)
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states!
        out = self.fc(out[:, -1, :])
        # out.size() --> (100, 10)
        return out

### Step 4: Instantiate Model Class
* 28 time steps
    * Each time step: input dimension = 28
* 1 hidden layer
* MNIST 0-9 digits $\to$ output dimension = 10

In [9]:
input_dim = 28
hidden_dim = 100
layer_dim = 1
output_dim = 10

In [10]:
model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)

### Step 5: Instantiate Loss Class
RNN will also use **Cross Entropy Loss** like FNN, CNN, logistic regression

In [11]:
criterion = nn.CrossEntropyLoss()

### Step 6: Instantiate Optimizer Class

In [12]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

### RNN Model Parameters In-depth

In [13]:
print(len(list(model.parameters())))

6


In [14]:
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

torch.Size([400, 28])
torch.Size([400, 100])
torch.Size([400])
torch.Size([400])
torch.Size([10, 100])
torch.Size([10])


**Parameters**
* **Input $\to$ Gates**
    * $[400,28] \to w_1, w_3, w_5, w_7$
    * $[400] \to b_1, b_3, b_5, b_7$
* **Hidden State $\to$ Gates**
    * $[400,100] \to w_2, w_4, w_6, w_8$
    * $[400] \to b_2, b_4, b_6, b_8$
* **Hidden State $\to$ Output**
    * $[10,100] \to w_9$
    * $[10] \to b_9$

### Step 7: Train Model! 
**Process**
1. **Convert inputs/labels to Variables**
    * LSTM input (1, 28)
    * RNN Input (1, 28)
    * CNN Input (1, 28, 28)
    * FNN Input (1, 28*28)
- Clear gradient buffers
- Get output given inputs
- Get Loss
- Get gradients w.r.t. parameters
- Update parameters using gradients
    * parameters = parameters - learning rate * paramters_gradients
- REPEAT

In [15]:
# Number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images.view(-1, seq_dim, input_dim)) 
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        # outputs.size() --> (100, 10)
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

Iteration: 500. Loss: 2.215986490249634. Accuracy: 21.04
Iteration: 1000. Loss: 0.7689857482910156. Accuracy: 73.39
Iteration: 1500. Loss: 0.46011725068092346. Accuracy: 87.85
Iteration: 2000. Loss: 0.7676885724067688. Accuracy: 91.3
Iteration: 2500. Loss: 0.1691460758447647. Accuracy: 93.64
Iteration: 3000. Loss: 0.29353076219558716. Accuracy: 95.97


### Same issue with non-convergence. I'm not sure if this is a PyTorch thing, a Jupyter thing, or a hardware issue or something else entirely.. I'll try re-running tomorrow. 

### Update (tomorrow): It worked after clearing the outputs and restarting the kernel. I wonder what is causing that issue. At least I have a fix for it. 

### Model B: 2 Hidden Layer
* Unroll 28 time steps
    * Each step input size: 1 x 28
    * Total per unroll 28 x 28
        * Feedforward network input size: 28 x 28
* **2 Hidden Layers**

### Steps
1. Load Dataset
- Make Dataset iterable
- Create model class
- **Instantiate model class**
- Instantiate loss class
- Instantiate optimizer class
- Train model!

### Adding another hidden layer is as simple as changing the line of code where we define how many hidden layers we want. Starting with step 4 below (Instantiate model class), since everything before that is exactly the same!

In [17]:
'''
Step 4: Instantiate model class
'''

input_dim = 28
hidden_dim = 100
layer_dim = 2  #The only change from one layer to two layers is here!! 
output_dim = 10

model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)

# Hell, since nothing else changes, I'm only going to add the new code for printing the model layers and
# Step 7 (Training the model)

print(model)
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

'''
Step 7: Train the model
'''

# Number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images.view(-1, seq_dim, input_dim)) 
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        # outputs.size() --> (100, 10)
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

LSTMModel (
  (lstm): LSTM(28, 100, num_layers=2, batch_first=True)
  (fc): Linear (100 -> 10)
)
10
torch.Size([400, 28])
torch.Size([400, 100])
torch.Size([400])
torch.Size([400])
torch.Size([400, 100])
torch.Size([400, 100])
torch.Size([400])
torch.Size([400])
torch.Size([10, 100])
torch.Size([10])
Iteration: 500. Loss: 2.3123300075531006. Accuracy: 8.92
Iteration: 1000. Loss: 2.3015358448028564. Accuracy: 8.92
Iteration: 1500. Loss: 2.3122527599334717. Accuracy: 8.92


KeyboardInterrupt: 

**Parameters (Layer 1)**
* **Input $\to$ Gates**
    * $[400,28] \to $
    * $[400] \to $
* **Hidden State $\to$ Gates**
    * $[400,100] \to $
    * $[400] \to $
    
**Parameters (Layer 2)**
* **Input $\to$ Gates**
    * $[400,100] \to $
    * $[400] \to $
* **Hidden State $\to$ Gates**
    * $[400,100] \to $
    * $[400] \to $
    
**Parameters (Readout Layer)**
* **Hidden State $\to$ Output**
    * $[10,100] \to $
    * $[10] \to $

### Model B: 3 Hidden Layer
* Unroll 28 time steps
    * Each step input size: 1 x 28
    * Total per unroll 28 x 28
        * Feedforward network input size: 28 x 28
* **3 Hidden Layers**

### Steps
1. Load Dataset
- Make Dataset iterable
- Create model class
- **Instantiate model class**
- Instantiate loss class
- Instantiate optimizer class
- Train model!

### The cell below can be run standalone for 3 hidden layers. Not going to run it now, since it's going to take a long time on this machine. It also may be that 3 hidden layers only adds time and not accuracy for this example (MNIST)

In [None]:
'''
Step 1: Load dataset
'''
train_dataset = dsets.MNIST(root="./data",
                            train=True,
                            transform=transforms.ToTensor(), 
                            download=True)

test_dataset = dsets.MNIST(root='./data',
                           train=False,
                           transform=transforms.ToTensor())
'''
Step 2: Make dataset iterable
'''
batch_size = 100
n_iters = 3000
num_epochs = int(n_iters / (len(train_dataset)/batch_size))

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)
'''
Step 3: Create model class
'''
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTMModel, self).__init__()
        # Hidden Dimensions
        self.hidden_dim = hidden_dim
        
        # Number of Hidden Layers
        self.layer_dim = layer_dim
        
        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape
        # (batch_size, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        
        # Readout Layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        # Initialize hidden state with zeros
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # Initialize cell state
        c0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # 28 time steps
        out, (hn, cn) = self.lstm(x, (h0, c0))
        
        # Index hidden state of last time step
        # out.size() --> (100, 28, 100)
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states!
        out = self.fc(out[:, -1, :])
        # out.size() --> (100, 10)
        return out
'''
Step 4: Instantiate model class
'''

input_dim = 28
hidden_dim = 100
layer_dim = 3  #The only change from one layer to two layers is here!! 
output_dim = 10

model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)

# Hell, since nothing else changes, I'm only going to add the new code for printing the model layers and
# Step 7 (Training the model)

print(model)
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())
'''
Step 5: Instantiate Loss Class
'''
criterion = nn.CrossEntropyLoss()
'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
'''
Step 7: Train the model
'''

# Number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images.view(-1, seq_dim, input_dim)) 
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        # outputs.size() --> (100, 10)
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

### Comparison with RNN

**RNN Model A** | **RNN Model B** | **RNN Model C**
--- | --- | ---
ReLU | ReLU | tanH
1 Hidden Layer | 2 Hidden Layers | 2 Hidden Layers
100 Hidden Units | 100 Hidden Units | 100 Hidden Units
~92.2% | ~95.0% | ~95.9%

**LSTM Model A** | **LSTM Model B** | **LSTM Model C**
--- | --- | ---
1 Hidden Layer | 2 Hidden Layers | 3 Hidden Layers
100 Hidden Units | 100 Hidden Units | 100 Hidden Units
~96.0% | ~95.2% | ~91.2%

### Deep Learning

* 2 ways to expand a recurrent (LSTM) neural network
    * More hidden units
        * `(o, i, f, g) gates`
    * More hidden layers
* Cons
    * Need a larger dataset
        * Curse of dimensionality
    * Does not necessarily mean higher accuracy