In [29]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = (9,9)
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.datasets as dsets
import torchvision.transforms as transforms
from torch.autograd import Variable
np.random.seed(0)

![simpleresnet.png](simpleresnet.png)

This exercies uses a simple implementation of a deep neural network to explore the vanishing gradient problem

In [292]:
# Choose an activation function
activation = torch.tanh

# Choose a number of iterations
n = 4

# Store the feed-forward steps
w_list = []
z_list = []
a_list = []

# Make up some data
z_obs = torch.tensor([1.0])

# Initial value
x = torch.tensor([10.],requires_grad=True)
z_prev = x

# Loop over a number of hidden layers
for i in range(n):
    # New weight
    w_i = torch.tensor([1.0],requires_grad=True)

    # Linear transform
    a_i = z_prev*w_i

    # Activation
    zprime_i = activation(a_i)

    #TODO: replace the line below with one that would add a skip connection
    z_i = zprime_i
    
    # Store forward model stuff
    w_list.append(w_i)
    z_list.append(zprime_i)
    a_list.append(a_i)

    # output of layer i becomes input for layer i+1
    z_prev = z_i

# Objective function
L = 0.5*(z_i - z_obs)**2

# Reverse-mode AD
L.backward()

# Print each weight's gradient
print([w_.grad for w_ in w_list])


[tensor([-0.]), tensor([-0.0727]), tensor([-0.1319]), tensor([-0.1892])]


Now that we have seen how implementing skip connections seemingly solve the problem of vanishing gradients, we've learned all we can from the paper, lets look at some applications

------------

Below is a simple example of an image processing problem where vanishing gradient becomes an issue (no need to show it this time)

For training and testing data I generated random images for a training and test set. If the small problems are too easy feel free to increase the size of the datasets to make for more challenging problems

After you get done with the conceptual questions below, feel free to change the architecture of the below net. Make 3 changes to the architecture, record the loss differnece after 100 iterations, and come up with a justification for that difference in loss

In [293]:
# basic net class
class Net(nn.Module):
    def __init__(self, num_input_images):
        
        # batch size is needed to configure 
        self.num_input_images = num_input_images
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 5, 3)
        self.linearization = nn.Linear(5*26*26,10)        
        
    def forward(self, x):
        # convolution
        x = self.conv1(x)
        # activation
        x = F.relu(x)
        # outputed images needed to be flattened for a linear layer
        x = x.view(self.num_input_images, 5*26*26)
        # find linear patterns in non-linear data
        x = self.linearization(x)
        return x    

In [298]:
num_input_images = 100
num_epochs = 20
num_classes = 10

# Everyone's playing with the same seed, same data
torch.manual_seed(0)
rand_train_data = torch.randn(num_input_images, 1, 28, 28)
rand_train_labels = torch.LongTensor(num_input_images).random_(0, 10)
rand_test_data = torch.randn(num_input_images, 1, 28, 28)
rand_test_labels = torch.LongTensor(num_input_images).random_(0, 10)

learning_rate = 1e-3  # The speed of convergence

# net class
net = Net(num_input_images)

# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

In [303]:
for epoch in range(num_epochs):
    optimizer.zero_grad() # Intialize the hidden weight to all zeros
    outputs = net(rand_train_data) # Forward pass: compute the output class given a image
    loss = criterion(outputs, rand_train_labels) # Compute the loss: difference between the output class and the pre-given label
    loss.backward() # Backward pass: compute the weight
    optimizer.step()
    test_output = net(rand_test_data)
    loss = criterion(test_output, rand_test_labels)
    print(loss.item())

2.7982895374298096
2.7994678020477295
2.8006298542022705
2.8017754554748535
2.8029046058654785
2.804018974304199
2.8051183223724365
2.8062021732330322
2.807271957397461
2.808326005935669
2.809366464614868
2.8103935718536377
2.811405658721924
2.8124053478240967
2.8133926391601562
2.8143680095672607
2.8153305053710938
2.8162829875946045
2.8172245025634766
2.8181543350219727


**Questions**

1. What is the vanishing gradient problem, and what is its primary cause?

2. What are 4 limitations to optimizing a deep convolutional neural network?

3. In terms of how a given block of a network is "fitted", what is the key difference between using skip connections and traditional blocks?

4. In the context of model hyper-parameters, what additional parameters is added in the res-net implementation?

5. How do skip connections resolve the "vanishing gradient" problem?

6. Give an appropriate anology for how kernels are used to extract features from images (i.e. sanding wood)

7. max's questions: was this a good paper when it was released? Is it a good paper now? What has changed between now and it's initial release point? What other methods are there of solving the vanishing gradient problem?