# 9. Feedforward NN Transition to Convolutional Neural Networks

[Graphic illustrating 1 hidden layer feed forward neural network]

## 1. Basic Convolutional Neural Network

* Add Convolution and Pooling layers before the feedforward neural network
* Layer with linear function & non-linearity: **fully connected layer**

## 1.2. One Convolutional Layer: High Level View

More diagrams that I can't recreate. I can always refer to the lectures if need be. 
* Input Depth = 1: **grayscale image**
* Convolution is like mapping from your input layer to a matrix of numbers as output. This is called a feature map, or activation map. 
* Each filter or kernel takes **strides** across your image, and the output of each stride adds a digit to the feature map
* This is the basis of those cool demos that show which filters get activated when a CNN "looks" at input images. The filters (or kernels) are receptive to different features (horizontal lines, curves, circles, edges, etc.)
* Filter $\cdot$ pixel data in the receptive field = each data point in the feature map

## Input Depth of 3: One Convolutional Layer

* Input depth = 1: **grayscale**
* Input depth = 3: **color (RGB)**
* Filter / Kernel must have the same depth as the input depth of your image. (Matrix multiplication req't)


## High Level Summary: One Convolutional Layer
* As the kernel is sliding/convolving across the image $\to$ 2 operations done **per patch**
    1. Element-wise multiplication
    - Summation
* More kernels = more feature map channels
    * Can capture more information about the input

## 1.3. Multiple Convolutional Layers: High Level View
* After a convolutional layer, there is usually a pooling / downsampling layer.
* Subsequent convolutional layers can have more kernels, allowing us to learn more about the image, going deeper and deeper into the model. 


## 1.4. Pooling Layers: High Level View
* 2 Common Types
    1. Max Pooling
    - Average Pooling
* Question: Does having pooling layers help prevent overfitting? i.e. Does it make the model better able to generalize
* Pooling uses a kernel of its own, sliding it across the feature map created in the convolutional layer. The kernel takes the max (or average) value inside the kernel at each stride. This is why the subsequent layer is downsampled. Only taking one number per stride. Size of kernel determines how much the layer downsamples. 
* Can go on with convolution-pooling-convolution-pooling forever. That's the essence of "Deep" learning. 

## 1.6. Padding
* Method to either preserve or modify the size of the image from input to feature map. 
* Types
    * Valid Padding (zero padding) Output size < Input size
    * Same Padding (non-zero padding) Output size = Input size. Adds zeros around the image as needed

## 1.8. Dimension Calculations
* $O=\frac{W-K+2P}{S}+1$
    * $O$: output height/length
    * $W$: input height/length
    * $K$: filter size / kernel size
    * $P$: padding
        * $P=\frac{K-1}2$
    * $S$: stride

## 2. Building a Convolutional Neural Network with PyTorch
* Here we go! 
### Model A:
* 2 Convolutional Layers
    * Same Padding: output size = input size
* 2 Max Pooling Layers
* 1 Fully Connected Layer

### Steps
* STEP 1: Load dataset
* STEP 2: Make dataset iterable
* STEP 3: Create model class
* STEP 4: Instantiate model class
* STEP 5: Instantiate loss class
* STEP 6: Instantiate optimizer class
* Step 7: Train Model! 

### Step 1: Load Dataset

In [2]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [3]:
train_dataset = dsets.MNIST(root="./data",
                            train=True,
                            transform=transforms.ToTensor(), 
                            download=True)

test_dataset = dsets.MNIST(root='./data',
                           train=False,
                           transform=transforms.ToTensor())

In [4]:
print(train_dataset.train_data.size())

torch.Size([60000, 28, 28])


In [5]:
print(train_dataset.train_labels.size())

torch.Size([60000])


In [6]:
print(test_dataset.test_data.size())

torch.Size([10000, 28, 28])


In [7]:
print(test_dataset.test_labels.size())

torch.Size([10000])


### Step 2: Make dataset iterable

In [8]:
batch_size = 100
n_iters = 3000
num_epochs = int(n_iters / (len(train_dataset)/batch_size))

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

### Step 3: Create Model Class (Herein lies the good stuff!)

* $O=\frac{W-K+2P}{S}+1$
    * $O$: output height/length
    * $W$: input height/length
    * $K$: filter size / kernel size: **5**
    * $P$: padding: **same padding (non-zero)**
        * $P=\frac{K-1}2=\frac{5-1}2=2$
    * $S$: stride: **=1**
**Output size for Max Pooling**
* $O=\frac WK$
    * $W$: Input height/width
    * $K$: **filter size = 2** (downsamples to half input size, i.e. 2 in the denominator)

In [16]:
# Above two equations will give us the numbers we need as we build our model. 

class CNNModel(nn.Module):
    def __init__(self):
        super(CNNModel, self).__init__()
        
        # Convolution 1
        self.cnn1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2)
        self.relu1 = nn.ReLU()
        
        # Max Pooling 1
        self.maxpool1 = nn.MaxPool2d(kernel_size=2)
        
        # Convolution 2
        self.cnn2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2)
        self.relu2 = nn.ReLU()
        
        # Max Pooling 2
        self.maxpool2 = nn.MaxPool2d(kernel_size=2)
        
        # Fully Connected Layer
        self.fc1 = nn.Linear(32 * 7 * 7, 10)
        
    def forward(self, x):
        
        # Convolution 1
        out = self.cnn1(x)
        out = self.relu1(out)
        
        # Max Pooling 1
        out = self.maxpool1(out)
        
        # Convolution 2
        out = self.cnn2(out)
        out = self.relu2(out)
        
        # Max Pooling 2
        out = self.maxpool2(out)
        
        # Resize for the linear function in the readout layer
        # Original size: (100, 32, 7, 7)
        # out.size(0): 100
        # Desired out size: (100, 32*7*7)
        out = out.view(out.size(0), -1)  #I might need to look into this to see what is going on under the hood
        
        # Linear Function (readout) Layer
        out = self.fc1(out)
        
        return out

### Step 4: Instantiate Model Class

In [17]:
model = CNNModel()

### Step 5: Instantiate Loss Class
* Convolutional Neural Network: **Cross Entropy Loss**

In [18]:
criterion = nn.CrossEntropyLoss()

### Step 6: Instantiate Optimizer Class

In [19]:
learning_rate = 0.01

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

### In-depth Look at Our Model's Parameters

In [20]:
print(model.parameters())
print(len(list(model.parameters()))) #This will show us how many items are in our list of parameters (trainable layers)

# Convolution 1: 16 kernels
print(list(model.parameters())[0].size())

# Convolution 1 bias: 16 kernels
print(list(model.parameters())[1].size())

# Convolution 2: 32 kernels with depth = 16
print(list(model.parameters())[2].size())

# Convolution 2 bias: 32 kernels with depth = 16
print(list(model.parameters())[3].size())

# Fully Connected Layer 1
print(list(model.parameters())[4].size())

# Fully Connected Layer 1 Bias: 
print(list(model.parameters())[5].size())

<generator object Module.parameters at 0x1152f6570>
6
torch.Size([16, 1, 5, 5])
torch.Size([16])
torch.Size([32, 16, 5, 5])
torch.Size([32])
torch.Size([10, 1568])
torch.Size([10])


### Step 7: Train the model!!

#### Process
1. **Convert inputs & labels to Variables**
    * CNN Input: (1, 28, 28) CNN can take in a 2-dimensional tensor (28 by 28)
    * Feedforward input: (1, 28*28)

In [21]:
iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
    
        images = Variable(images) #No need to resize, because the image is already in a form that the CNN can use.
        labels = Variable(labels)
        
        #Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        #Forward pass to get outputs / logits
        outputs = model(images)
        
        #Calculate Loss: softmax --> Cross Entropy Loss
        loss = criterion(outputs, labels)
        
        #Get gradients w.r.t. parameters
        loss.backward()
        
        #Update parameters
        optimizer.step()
        
        iter += 1
        if iter % 500 == 0:
            # Calculate Accuracy
            correct = 0 
            total = 0
            #Iterate through the test dataset
            for images, labels in test_loader:
                # Load images to Torch Variable
                images = Variable(images)
                
                # Forward pass only to get outputs/logits
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
                    
            accuracy = 100 * correct / total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data[0], accuracy))

Iteration: 500. Loss: 0.31840232014656067. Accuracy: 90.05
Iteration: 1000. Loss: 0.16880546510219574. Accuracy: 93.02
Iteration: 1500. Loss: 0.2797934412956238. Accuracy: 94.54
Iteration: 2000. Loss: 0.3394652307033539. Accuracy: 95.7
Iteration: 2500. Loss: 0.16251006722450256. Accuracy: 96.39
Iteration: 3000. Loss: 0.07263852655887604. Accuracy: 97.07


As we can see, this model is extremely accurate after just 5 epochs. One thing to note though, is that training took considerably longer than previous models. I imagine the CNN is just that much more complex than previous models. The speedup on GPU will likely be very noticeable. Going to move to a new notebook for the exploration of different CNN's