## Deep Learning: Introduction

The goal of this exercice is to discover PyTorch. We will start with a simple 2D example for regression. You will have the opportunity to play with the network architecture and the optimization algorithm.

In [None]:
import numpy as np

# PyTorch:
import torch

# For visualization:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

### A 2D function
We will first approximate the following 2D function with a 2-layer network:


In [None]:
def F(x1, x2):
  return np.sin(x1) * np.sin(x2)

You can use the code below to visualize the function:

In [None]:
def show_fn(fn, A):
  Size = 200
  step = 2 * A / Size
  U,V = np.mgrid[-A:+A:step, -A:+A:step]
  fn_X = fn(U, V)
  I = fn_X.reshape(Size, Size)
  plt.imshow(I, cmap = cm.Greys)

In [None]:
show_fn(F, np.pi)

We will now generate training samples using this function. Make sure you understand the code.

In [None]:
nb_samples = 1000
X_train = 2 * A * torch.rand(nb_samples, 2) - A 
y_train = np.vectorize(F)(X_train[:,0], X_train[:,1])
y_train = torch.tensor(y_train)
print(X_train[0:10])
print(y_train[0:10])

### First prototype

Let's try a first prototype of a 2-layer network. This network has the structure:

$h_1 = \text{ReLU}(W_1 x) \> , h_2 = w_2^\top h_1$

(note there is no bias yet)

In [None]:
nb_neurons = 3
W1 = torch.randn(2, nb_neurons, requires_grad=True)
W2 = torch.randn(nb_neurons, 1, requires_grad=True)

# X_train.mm(W1) is xW1
# (PyTorch uses right side multiplication)
# here, W1 is applied to all the samples: 
h1 = X_train.mm(W1)
# clamp(min=0) is ReLU:
h1 = h1.clamp(min=0)

h2 = h1.mm(W2)

loss = (h2 - y_train).pow(2).sum() / nb_samples

**Question:** What are the size and type of $h_1$, $h_2$, and $loss$?  (Hint: you can use the .size() function)

**Question:** What is the current value of the loss? (Hint: you have to use the .item() function)

### First optimization

Let's try to optimize our network with gradient descent:

In [None]:
learning_rate = 1e-5
for t in range(10):
  # the same code as before but in 1 line only:
  h1 = X_train.mm(W1).clamp(min=0)
  h2 = h1.mm(W2)
  loss = (h2 - y_train).pow(2).sum() / nb_samples
  print(t, loss.item())
  # Computes all the partial derivatives:
  loss.backward()
  # Does not include these computations in the derivative graph: 
  with torch.no_grad():
    W1 -= learning_rate * W1.grad
    W2 -= learning_rate * W2.grad
    # .grad.zero_() zeroes the gradients after calling backward()
    # Required because AutoGrad does not simply
    # replace the gradient values but accumulates (sums) them: 
    W1.grad.zero_()
    W2.grad.zero_()

**Make sure you understand the code above!**

**Question:** What happens when you use 1e-6 as learning rate?  1e-2?

### Adding the biases

**Question:** Starting from the code below, add the biases to the architecture of our network ie the network should now have the structure:

$h_1 = \text{ReLU}(W_1 x + b_1) \> , h_2 = w_2^\top h_1 + b_2$


In [None]:
nb_neurons = 3
W1 = torch.randn(2, nb_neurons, requires_grad=True)
#b1 = ... What should be the size of b1?
W2 = torch.randn(nb_neurons, 1, requires_grad=True)
#b2 = ...
ones = torch.ones(nb_samples,1)

learning_rate = 1e-5
for t in range(10):
  # Note the use of ones. Why is it needed?
  b1_ = ones.mm(b1)
  # Change this line to introduce b1_:
  h1 = X_train.mm(W1).clamp(min=0)
  # Change this line to introduce b2_:
  h2 = h1.mm(W2)
  loss = (h2 - y_train).pow(2).sum() / nb_samples
  print(t, loss.item())
  # Computes all the partial derivatives:
  loss.backward()
  # Does not include these computations in the derivative graph: 
  with torch.no_grad():
    W1 -= learning_rate * W1.grad
    W2 -= learning_rate * W2.grad
    # What do you have to add here?

    # .grad.zero_() zeroes the gradients after calling backward()
    # Required because AutoGrad does not simply
    # replace the gradient values but accumulates (sums) them: 
    W1.grad.zero_()
    W2.grad.zero_()
    # What do you have to add here?


### Using a class from PyTorch

We can use classes from PyTorch for standard architectures:

In [None]:
nb_samples = 1000
X_train = 2 * np.pi * torch.rand(nb_samples, 2) - np.pi
y_train = np.vectorize(F)(X_train[:,0], X_train[:,1])
y_train = torch.tensor(y_train)

In [None]:
nb_neurons = 3
twolayer_net = torch.nn.Sequential(
  torch.nn.Linear(2, nb_neurons),
  torch.nn.ReLU(),
  torch.nn.Linear(nb_neurons, 1),
)

We can also use a predefined loss function (MSE stands for Mean Squared Error):

In [None]:
loss_fn = torch.nn.MSELoss(reduction='sum')

and we can use an optimization method implemented by PyTorch:

In [None]:
# Using stochastic gradient descent for optimization:
optimizer = torch.optim.SGD(twolayer_net.parameters(), lr = 1e-7)
for t in range(10):
  # Forward pass:
  y_pred = twolayer_net(X_train)
  # Computes loss:
  loss = loss_fn(y_pred, y_train) 
  print(t, loss.item())
  # Computes all the partial derivatives:
  optimizer.zero_grad()
  loss.backward()
  # 1 iteration step of the optimizer:
  optimizer.step()

**Question:** Try to get a good approximation of the F function. You can use the code below to visualise the approximation by the network.

Things you can do:
- increase the number of iterations;
- change the number of neurons;
- change the optimizer (try using Adam);
- add more layers.

If the optimization diverges (ie you get nan for the loss function), you have to reinitialize the network by running the line `twolayer_net = ...` again.


In [None]:
def show_nn(nn, A):
  Size = 200
  step = 2.0 * A / Size
  U,V = np.mgrid[-A:+A:step, -A:+A:step]
  UV = np.vstack((U.flatten(), V.flatten())).T
  nn_X = nn(torch.tensor(UV).float()).detach().numpy()
  I = nn_X.reshape(Size, Size)
  plt.imshow(I, cmap = cm.Greys)

In [None]:
show_nn(twolayer_net, np.pi)

### Creating our own class

We can encapsulate our code into a class inheriting from `torch.nn.Module`:

In [None]:
class TwoLayerNet(torch.nn.Module): 
  def __init__(self, D_in, H, D_out):
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    h_relu = self.linear1(x).clamp(min=0) 
    y_pred = self.linear2(h_relu)
    return y_pred

    
# Instantiates the class defined defined above:
twolayer_net = TwoLayerNet(2, 20, 1)

The rest of the code remains the same.