# Deep Learning: An Overview

## What are we gonna see in this workshop?

- Understand what DL is, how it is different from ML, and why it is so effective.
- Understand the Perceptron algorithm and how it is the building block for a Neural Network.
- Explore prominent DL architectures in Pytorch.

`pip install matplotlib numpy torch torchvision`

In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

In [None]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Initialize the Hyperparameters 
input_size = 784
hidden_size = 500
num_classes = 10
num_layers = 2
sequence_length = 28
num_epochs = 5
batch_size = 100
learning_rate = 0.001

## The MNIST Dataset

![](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

In [None]:
# MNIST dataset 
train_dataset = torchvision.datasets.MNIST(root='./data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),  
                                           download=True)

In [None]:
test_dataset = torchvision.datasets.MNIST(root='./data', 
                                          train=False, 
                                          transform=transforms.ToTensor())

In [None]:
# Train data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)


In [None]:
# Test data loader
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

In [None]:
print(train_dataset)
print(test_dataset)
print(train_loader)
print(test_loader)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

# functions to show an image
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()

# show images
fig, ax = plt.subplots(figsize=(14, 10))
imshow(torchvision.utils.make_grid(images))

## What is Deep Learning?

- **The Objective of ML**: Build intelligent systems by extracting patterns from raw data. Build a **hierarchy of concepts**, with each concept being built out of simpler ones. 
- **Deep Learning**: The hierarchy of concepts allows complicated concepts to be built out of simpler ones, and the **graph** representing the relations b/w concepts is very deep. Hence the name 'Deep' Learning.

## Why is Deep Learning Necessary?

- Performance of ML algorithms depends heavily on the *representation of the data*. The representation is composed of many pieces of information on the data point, also known as **features** of the data point.

- A simple algorithm for building AI:
    1. Design appropriate set of features for the task.
    2. Feed the data to an ML algorithm

- **The Problem**: *It is often difficult to construct features for certain problems*. If for example we wanted to recognize wheels on cars, it is not easy to describe what wheels look like in terms of pixel values.

- One solution is to use ML to learn the representations themselves. This is called **representation learning**. But it can be difficult to obtain such a representation. This is where Deep Learning helps us; *it introduces representations in terms of simpler representations*

### The Perceptron

<img src=https://miro.medium.com/max/732/1*74YD-gADYB8xC7MQ36apFA.jpeg width="200">

- Simple model of computation that takes in several binary inputs $x_1, x_2, \dots, x_n$ and their corresponding weights $w_1, w_2, \dots, w_n$ and produces a single binary output. The output depends on whether the weighted sum of the inputs is above or below a *threshold* value.

$$
\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      1 & \mbox{if } \sum_j w_j x_j\geq \mbox{ threshold} \\
      0 & \mbox{if } \sum_j w_j x_j< \mbox{ threshold}
      \end{array} \right.
\tag{1}\end{eqnarray}
$$

- **Intuition**: Just a device that makes a decision by *weighing* up evidence.
- Varying the weights and the thresholds can allow us to produce different models of decision making.

- Extending this idea to a network of perceptrons allows us to make more abstract and complex decisions:
![](http://neuralnetworksanddeeplearning.com/images/tikz1.png)

- **Rewriting the Perceptron equation with a bias**:
$$
\begin{eqnarray}
  \mbox{output} & = & \left\{ \begin{array}{ll}
      1 & \mbox{if } \sum_j w_j x_j + b \geq \mbox{ threshold} \\
      0 & \mbox{if } \sum_j w_j x_j + b < \mbox{ threshold}
      \end{array} \right.
\tag{1}\end{eqnarray}
$$

- **Intuition**: The bias is a measure of how easy it is for the perceptron to output a 1. 

<img src=https://miro.medium.com/max/15164/1*-oWHnqj0hjipXyeaUy8k8A.jpeg width="700">

### Learning Algorithms

- **Core Idea**: Automatically tune the weights and biases of the network of neurons.
- **Problem**: Small changes in the weights and biases might cause the output of the perceptron to flip completely.

#### The Sigmoid Neuron
- Similar to perceptrons, but add an extra layer of functionality: it can take any input between 0 and 1, and makes sure that small changes in the weights and biases cause small changes in the output.
- The equation becomes $\sigma(w \cdot x + b)$, where
$$
\begin{eqnarray} 
  \sigma(z) \equiv \frac{1}{1+e^{-z}}.
\tag{2}\end{eqnarray}
$$

$$
\begin{eqnarray} 
  \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j}
  \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b,
\tag{3}\end{eqnarray}
$$

#### Generalizing the Notion: Activation Functions
- $f(w \cdot x + b)$, where $f$ is the activation function. Allows us to control the values of the partial derivatives in Eq.3. For instance, the sigmoid activation function is popular because *exponentials have lovely properties when differentiated*.

## The Neural Network (a.k.a Multilayered Perceptrons)

![](http://neuralnetworksanddeeplearning.com/images/tikz11.png)

- Consists of three types of layers: **input layers, hidden layers, and output layer**.
- Each neuron in a layer is connected to every neuron in the next layer.
- The output of one layer is used as the input to the next layer. That's why we call them **feedforward neural networks**. There are no loops.

### Learning using Gradient Descent and Backpropagation

- Necessary to define a loss function quantize the difference between the desired output and the actual output. There are various choices such as RMSE, SVM, and Hinge loss.

- *Our objective is to minimize the loss function*. This is done by moving in the negative direction of the gradient.

$$
x_{t+1} = x_t - \eta_t \nabla f(x_t)
$$

where $\eta_t$ and $\nabla f(x_t)$ is the learning rate and gradient of the cost function respectively at time $t$

- Notice that we need to update the parameters corresponding to each neuron in the network. But, how does one compute the derivatives when we have multiple layers? We use the chain rule!
- We start with the derivative of the loss function with respect to the last layer and then 'propagate' thoseerrors backward through the hidden layers until we arrive at the input layer. This process is called **backpropagation**.  

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('Ilg3gGewQ5U', width=1000, height=500)

In [None]:
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

In [None]:
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  

In [None]:
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

In [None]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

In [None]:
# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')

## Learning to see better: Convolutional Neural Networks

![](http://cs231n.github.io/assets/cnn/cnn.jpeg)

### The Convolution Layer

#### What does it do?
- Consists of a set of **learnable filters**. Will be small spatially along the width and height, but extends across the entire depth.
- **Process**: Slide the filter across the input volume and compute dot products between the entries of the filter and the corresponding input. This produces a **2D activation map**. 
- **Intuition**: Network learns filters that activate in response to visual stimuli, such as an edge or a colored blotch. Each layer will have an entire set of filters that we stack on top of each to produce the output volume.

![](https://miro.medium.com/max/1052/1*GcI7G-JLAQiEoCON7xFbhg.gif)

#### Local Connectivity
- Impractical to connect each neuron to all neurons in the previous volume. Instead, we connect each neuron to a local region of the input volume. The spatial extent of this is a hyperparameter called the **receptive field** of the neuron. 

- The following hyperparameters control the size of the output volume:
1. **Depth**: The number of filters we would like to use, each learning to look for something different in the input.
2. **Stride**: The stride with which we slide the filter.
3. **Zero Padding**: Sometimes, it is convenient to pad the input volume with zeros around the border to control the size of the output volume. The size of this zero padding is a hyperparameter.

Let $W$ be the input volume size, $F$ be the receptive field size of the filters, $S$ be the stride of the filters, and $P$ be the amount of zero padding on the border. The size of the output volume $O$ is:

$$
O = \frac{W - F + 2P}{S} + 1
$$

#### Parameter Sharing
- Used to control the number of parameters. The key idea is to make all neurons in a particular feature map share a single set of weights and biases. Makes computation far more efficient.


### The Pooling Layer
- **Purpose:** to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. 
- There are various types such as max pooling, average pooling, and L2-norm pooling.

![](http://cs231n.github.io/assets/cnn/maxpool.jpeg)

### The Fully Connected Layer

- Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks.

### Build a CNN (two-layers) with Pytorch

- Pytorch is an open source library developed by Facebook for machine learning, natural language processing and more.
- You can also try out some other popular DL libraries such as Tensorflow (and Keras, a wrapper for Tensorflow) and Theanos. These libraries provide different levels of readabilities and abstraction for you to build neural networks like building blocks.
- High suggest beginners to try out Pytorch and Keras (more high level) first before moving into the details of Tensorflow APIs.

In [None]:
# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)
        self.features = [self.layer1, self.layer2, self.fc]
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

In [None]:
# Create a ConvNet instance
model = ConvNet(num_classes).to(device)

In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

In [None]:
model = ConvNet(num_classes)
model.load_state_dict(torch.load('cnnmodel.ckpt'))

In [None]:
# Test the model
model.eval()  # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

In [None]:
# Save the model checkpoint
torch.save(model.state_dict(), 'cnnmodel.ckpt')

### Visualize the filers of our ConvNet

In [None]:
def plot_filters_single_channel(t):
    
    #kernels depth * number of kernels
    nplots = t.shape[0]*t.shape[1]
    ncols = 12
    
    nrows = 1 + nplots//ncols
    #convert tensor to numpy image
    npimg = np.array(t.numpy(), np.float32)
    
    count = 0
    fig = plt.figure(figsize=(ncols, nrows))
    
    #looping through all the kernels in each channel
    for i in range(t.shape[0]):
        for j in range(t.shape[1]):
            count += 1
            ax1 = fig.add_subplot(nrows, ncols, count)
            npimg = np.array(t[i, j].numpy(), np.float32)
            npimg = (npimg - np.mean(npimg)) / np.std(npimg)
            npimg = np.minimum(1, np.maximum(0, (npimg + 0.5)))
            ax1.imshow(npimg)
            ax1.set_title(str(i) + ',' + str(j))
            ax1.axis('off')
            ax1.set_xticklabels([])
            ax1.set_yticklabels([])
   
    plt.tight_layout()
    plt.show()

In [None]:
def plot_filters_multi_channel(t):
    
    #get the number of kernals
    num_kernels = t.shape[0]    
    
    #define number of columns for subplots
    num_cols = 12
    #rows = num of kernels
    num_rows = num_kernels
    
    #set the figure size
    fig = plt.figure(figsize=(num_cols,num_rows))
    
    #looping through all the kernels
    for i in range(t.shape[0]):
        ax1 = fig.add_subplot(num_rows,num_cols,i+1)
        
        #for each kernel, we convert the tensor to numpy 
        npimg = np.array(t[i].numpy(), np.float32)
        #standardize the numpy image
        npimg = (npimg - np.mean(npimg)) / np.std(npimg)
        npimg = np.minimum(1, np.maximum(0, (npimg + 0.5)))
        npimg = npimg.transpose((1, 2, 0))
        ax1.imshow(npimg)
        ax1.axis('off')
        ax1.set_title(str(i))
        ax1.set_xticklabels([])
        ax1.set_yticklabels([])
          
    plt.tight_layout()
    plt.show()

In [None]:
def plot_weights(model, layer_num, single_channel = True, collated = False):
  
    #extracting the model features at the particular layer number
    layer = model.features[layer_num][0]

    #checking whether the layer is convolution layer or not 
    if isinstance(layer, nn.Conv2d):
        #getting the weight tensor data
        weight_tensor = model.features[layer_num][0].weight.data

        if single_channel:
            if collated:
                plot_filters_single_channel_big(weight_tensor)
            else:
                plot_filters_single_channel(weight_tensor)

        else:
            if weight_tensor.shape[1] == 3:
                plot_filters_multi_channel(weight_tensor)
            else:
                print("Can only plot weights with three channels with single channel = False")

    else:
        print("Can only visualize layers which are convolutional")
        
#visualize weights for 1st convolutional layers
plot_weights(model, 0, single_channel = True)

### Visualize the convolved image (feature map) for each layer

In [None]:
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

model.features[0][0].register_forward_hook(get_activation('conv1'))
model.features[1][0].register_forward_hook(get_activation('conv2'))
data, _ = test_dataset[0]
data.unsqueeze_(0)
output = model(data)

In [None]:
# Visualize the convolved image for the first conv layer (16 kernels)
act = activation['conv2'].squeeze()
num_cols = 4
num_rows = act.size(0)
fig = plt.figure(figsize = (20,100))

for i in range(act.size(0)):
    ax1 = fig.add_subplot(num_rows,num_cols,i + 1)
    ax1.set_title(str(i))
    ax1.imshow(act[i])

### References

- "MIT Deep Learning Basics: Introduction and Overview with TensorFlow" https://medium.com/tensorflow/mit-deep-learning-basics-introduction-and-overview-with-tensorflow-355bcd26baf0
- "Chapter 1: An Introduction to Deep Learning" http://www.deeplearningbook.org/contents/intro.html
- "Chapter 2: How the Backpropagation algorithm works?": http://neuralnetworksanddeeplearning.com/chap2.html
- "Deep Learning with PyTorch: A 60-Minute Blitz" https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
- "Neural Networks and Deep Learning": http://neuralnetworksanddeeplearning.com/chap1.html
- "Convolutional Neural Networks": http://cs231n.github.io/convolutional-networks/
- "3Blue1Brown Neural networks": https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi