# TP1: Not a PyTorch Tutorial

To work with neural networks we will use PyTorch. We will give a very brief review of the main elements that were are going to use. The objective is not to give a comprehensive PyTorch tutorial. If you want to learn more about PyTorch, you can start with the *official* [tutorials](https://pytorch.org/tutorials/) and [documentation](https://pytorch.org/docs/).

In [None]:
# This notebook can also run on colab (https://colab.research.google.com/)
# The following lines install the necessary packages in the colab environment
try: 
    from google.colab import files
    !pip install torch==0.4.0
    !pip install torchvision 
    !pip install Pillow==4.0.0
    !pip install scikit-image
    !pip install hdf5storage
    
    !pip install git+https://github.com/szagoruyko/pytorchviz
    !apt-get install graphviz
    
    !rm -fr MVA2018-denoising
    !git clone  https://github.com/gfacciol/MVA2018-denoising.git
    !cp -r MVA2018-denoising/* .

except ImportError:
    pass


## Setup code for the notebook

# These are all the includes used through the notebook
import numpy as np     
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from skimage import io # read and write images
import vistools        # image visualization toolbox

#%matplotlib notebook
# Autoreload external python modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

-----------------------------

### Data in torch: N dimensional tensors

torch stores data in tensors (n-dimensional arrays). Let's create some tensors and show some manipulations.

In [None]:
# let's import pytorch (or torch)
import torch

# 5x3 matrix with random entries
x = torch.zeros(5,3)
print('\nx = torch.zeros(5,3)')
print(x)

# a 3D tensor with zeros (3 'channels', 2 rows, 4 columns)
x = torch.rand(3,2,4)
print('\nx = torch.rand(3,2,4)')
print(x)

# check size
print('\nx.size()')
print(x.size())

# we can add, multiply tensors, etc
print('\nz = 2*torch.ones(x.size())')
z = 2*torch.ones(x.size())

print('\nz + x')
print(z + x)

print('\nz * x')
print(z * x) # element-wise product

# we can 'slice' tensors
print('\nx[1,:,:]')
print(x[1,:,:]) # second 'channel' of x

print('\nx[:,:,2]')
print(x[:,:,2]) # 3rd column of x

### Functions and gradients

torch uses backpropagation to compute the gradients of any function of one or more tensors. We will define a simple function fun of two tensors, and compute the gradient with respect to each of them.

In [None]:
# we tell torch that we intend to compute a gradient with 
# respect to these tensors
w = torch.rand(1,3, requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# suppose we aren't interested in the grad of fun w.r.t. x 
x = torch.rand(1,3)

# let's show their values
print('\nw = ')
print(w)

print('\nx = ')
print(x)

print('\nb = ')
print(b)

# now let's define fun as fun(w,x,b) = <w, x> + b
fun = (w * x).sum() + b

# For torch, fun is a tensor, but it has kept track of the 
# operations performed so far, and is thus able to run 
# the back propagation algorithm to compute the gradients
print('\nfun = (w * x).sum() + b =')
print(fun)

# to run backprop, just call
fun.backward()

# The gradients of fun w.r.t. all parameters having 
# requires_grad=True have been computed. We can access them

print('\ndfun/dw =')
print(w.grad) # grad of fun w.r.t. w

print('\ndfun/db =')
print(b.grad) # grad of fun w.r.t. b

print('\ndfun/dx =')
print(x.grad) # grad of fun w.r.t. x

### Our first network

We have all elements to define a neural network and train it: we can define functions of tensors and compute any of its gradients. In the following example, we will define an affine fc layer in a neural net as a class with parameters weights and biases.

**Note:** A *class* is just a way of bounding together data and functions that use these data. In this case the data are going to be the weights, and the function is the actual computation of the fc layer given an input x.

In [None]:
class fc_layer():
    """ 
    An example fc_layer class.
    """
    
    # The __init__ functions initializes the class when it's created
    # Every class has an __init__ funcion. Don't pay attention to the
    # self parameter.
    def __init__(self, in_size, out_size):
        
        # initialize random weights distributed as N(0,1) and 0 bias
        self.weight = torch.randn(out_size, in_size, requires_grad=True)
        self.bias = torch.zeros(out_size, 1, requires_grad=True)
        
    # We define a forward function that applies the fc layer
    # to a tensor x
    def forward(self, x):
        # torch.mm(A,B) is the matrix multiplication AB
        return torch.mm(self.weight, x) + self.bias

In [None]:
# We are done with the definition of our fc class. Let's use it to 
# a network with two layers (a hidden layer with a tanh non-linearity
# and an output).

fc1 = fc_layer(3, 5) # first fc layer
fc2 = fc_layer(5, 4) # output fc layer

# Let's look at a layer
print('Initial parameters of fc layer 1:')
print(fc1.weight)
print(fc1.bias)

# Let's test it on some input data
x = torch.randn(3,1)

# Run forward pass of the network
out = fc1.forward(x)   # first fc layer
out = torch.tanh(out)  # activation function
out = fc2.forward(out) # output fc layer

print('\nNetworks output:')
print(out)


# During training, we know the desired output or 'label' y for x, and
# want to minimize the loss. 

y = torch.randn(4,1) # we invent a desired output

# Let's use the squared L2 norm as loss
loss = ((out - y)**2).sum()

print('\nLoss between networks output and label:')
print(loss)

# To compute the gradients, we backprop from the loss
loss.backward()

# gradients with respect to the first layer's params
print('\nGradient of the loss w.r.t. first layer params:')
print(fc1.weight.grad)
print(fc1.bias.grad)

# gradients with respect to the first layer's params
print('\nGradient of the loss w.r.t. second layer params:')
print(fc2.weight.grad)
print(fc2.bias.grad)

# Note how the gradients w.r.t. the first layer's weight are much
# smaller, as a consequence of the tanh saturation

# To train, we loop over a data set extracting mini-batches of data.
# We compute the loss over each mini-batch (i.e. he loss is the sum 
# over the mini-batch, instead of a single x), and compute the gradients
# using backpropagation.
# The parameters are updated according to an update rule. In case of the 
# gradient descent, the rule is simply:
learning_rate = 1e-3
fc1.weight = fc1.weight - learning_rate * fc1.weight.grad
fc1.bias   = fc1.bias   - learning_rate * fc1.bias.grad
fc2.weight = fc2.weight - learning_rate * fc2.weight.grad
fc2.bias   = fc2.bias   - learning_rate * fc2.bias.grad

### A CNN using torch modules

torch has a series of modules to facilitate defining and training neural networks. The torch.nn has a large number of useful classes implementing layers and combination of layers. The previous fc net could be implemented using `torch.nn` layers as follows:

`fc1 = torch.nn.Linear(3,5)`<br>
`fc2 = torch.nn.Linear(5,4)`

We will define our second network: a convolutional network with 2 layers and ReLU activations. In practice, it's useful (if not necessary) to encapsulate the whole network in a class. Let's show an example.

In [None]:
class simple_cnn(torch.nn.Module):
    """
    A network with two conv layers and a ReLU non-linearity.

    Note: the class inherits from torch.nn.Module. Don't worry if you are not
    familiar with classes and inheritance, you can ignore this comment. Essentially,
    it means that the functions and data of torch.nn.Module are automatically
    available here.

    It has the following parameters:
        - im_channels: number of input and output image channels
        - num_features: number of features (or channels) of the hidden layer
        - kernel_size: size of convolution kernels on both layers
    """

    def __init__(self, im_channels, num_features, kernel_size):
        super(simple_cnn, self).__init__() # related with the inheritance from nn.Module

        # create convolutional layers (parameters are initialized at random)
        self.conv1 = torch.nn.Conv2d(im_channels, num_features, kernel_size)
        self.conv2 = torch.nn.Conv2d(num_features, im_channels, kernel_size)

        # relu activation ('inplace' to overwrite input data with output)
        self.relu  = torch.nn.ReLU(inplace=True)

    def forward(self, x):

        # run the network
        out = self.conv1(x)
        out = self.relu(out)
        out = self.conv2(out)

        return out

# We can now create a network. For example one image channel, 3 hidden features, 5x5 kernels
net = simple_cnn(1,3,5)

In [None]:
# Let's print its parameters (check the sizes of the weights)
print('\n1st layer convolution kernels:')
print(net.conv1.weight.size())
print(net.conv1.weight)
print('\n1st layer biases:')
print(net.conv1.bias)
print('\n2nd layer convolution kernels:')
print(net.conv2.weight.size())
print(net.conv2.weight)
print('\n2nd layer biases:')
print(net.conv2.bias)

In [None]:
# We will apply our network to an image.
# Try running this several times and changing parameters.
net = simple_cnn(1,5,3)

# Load an image to test the net on it
image = io.imread('datasets/BSD68/test002.png', dtype='float32')

# Let's convert it to a 4D torch floating point tensor
# The input data to convolutional layers in torch has to be 4D
# since it expects a mini-batch, the size is:
# mini-batch_size x channels x height x with
img = torch.FloatTensor(image[np.newaxis, np.newaxis, :, :])

# Now we can test our network in on the image. 
# The following line tells torch that we don't want to propagate.
with torch.no_grad():
    # covert it as a 4D torch tensor
    out = net.forward(img)

# Let's scale the output's range to 0, 255.
out_scaled = (out - out.min())*255./(out.max() - out.min())

# Show as a gallery
vistools.display_gallery([out_scaled.numpy().clip(0,255), image], 
                         ['scaled output', 'input'])

### What's left?

torch we covered tensors, automatic differentiation using backprop and construction of neural nets using the `torch.nn` packages.

torch has other very useful packages, but we won't cover them in this tutorial. We will use two of them in the rest of the TP:
- `torch.optim` is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough, so that more sophisticated ones can be also easily integrated in the future. This simplifies a lot the update of the weights.
- `torch.utils.data` a set of tools for handling datasets, such as splitting a data set in training, evaluation and testing, loading mini-batches, etc.