# Lecture 2 - Basic Neural Network APIs

We will focus on building a small two-layer neural network using PyTorch. We will follow the example in PyTorch documentation, written by Justin Johnson, and discuss everything in detail: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html. All code is based on this website. I will describe the code in detail and give comments as necessary.

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

We will mimic the input shape of MNIST dataset, where each image is $28\times28$ in size. Before using real datasets, we will learn the PyTorch APIs using randomly generated data. 

Every major deep learning platform is built to support either the $NCHW$ or $NHCW$ format, where $N$ is the batch size, $C$ is the number of channels, $H$ is the height, and $W$ is the width of the data. This is easier to relate with images. For example, a black and white square image of side 10 would have $H=10, W=10$ and $C=1$. If we have 20 such images in each batch of data, $N=20$. 

## Multi-Layer Perceptron (MLP) or Fully-Connected Network
### Pure Numpy Implementation

Let us start with the pure Numpy version of a two layer neural network training. We're going to use dense/fully-connected layers in this exercise and create random data with the MNIST dimensions $28\times28$. So our flattened input dimensions, $D_{in} = 28*28 = 784$. For MNIST, we have 10 outputs. So $D_{out} = 10$. Let us consider one hidden layer with 100 neurons, $H=100$. Also, recall that neural networks optimize the parameters using backpropagation. Using pure Numpy, we have to do forward and backpropagation while defining the derivatives ourselves. We saw that torch has `autograd`. We'll look into it in a bit. For now, pure Numpy...

<img alt="2 Layer Neural Network" src="images/nn2layer.jpg" width=600>

In [None]:
N, D_in, H, D_out = 64, 784, 100, 10

# Create random input and output data simulating the MNIST dimensions.
# Note that the first dimension is the batch size 
# Note that D_in is 784, simulating the MNIST input dimensions flattened.
# Note that D_out is 10, for the 0-9 digits.
x = np.random.randn(N, D_in) 
y = np.random.randn(N, D_out)

In [None]:
print(x.shape)

In [None]:
print(y.shape)

In [None]:
# We have to create two weight matrices: one between input and hidden layers, 
# and another between hidden and output layers.

# We will randomly initialize weights, but there are many other weight initializations
# commonly used in practice.
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [None]:
print(w1.shape)

In [None]:
print(w2.shape)

These are the parameters of the neural network that we are trying to optimize.

Now, we will define the learning rate at which we do our optimizations. In the future we will see that learning rate can be actively changed during training based on the validation loss or other criteria.

In [None]:
learning_rate = 1e-6

Now, we have to iterate through the data and change the weights according to the predictions that we get.

Basic algorithm:

    While epoch not final_epoch:
    
        1. Fetch input x.

        2. Do forward pass.
        a) This means you pass x to h, and then h to output.
        b) Value generated at output will be the predictions y_hat in code.

        3. Find the L2 (squared sum) loss based on y and y_hat. 
        Final loss is 
        
$\mathbb{L} = \sum_{n=1}^{N} \left \|\tilde{y}_n - y_n\right \|^2_2$.

        4. Compute the gradients for backpropagation. 
        Here, gradient is based on the final loss function, generating 
$2\times (\tilde{y}_n - y_n)$.

        5. Multiply the gradient with the weights of top layers, and back propagate the gradients to the first layer.

        6. After we have all the gradients, update the weight vectors using the equation w = w-lr*gradients, where lr is the learning rate and gradients is the corresponding gradient calculated for the weight vector.

In [None]:
loss_as_list = []
NUM_EPOCHS = 500

for EPOCH in range(NUM_EPOCHS):
    # Forward pass: compute predicted y
    h = x.dot(w1) # first, multiply x with w1
    h_relu = np.maximum(h, 0) # here we remove negative values for ReLU activation.
    y_pred = h_relu.dot(w2) # multiply h with w2 to generate the predictions.

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    loss_as_list.append(loss)
    print("Epoch: {}, Loss: {}".format(EPOCH, loss))

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y) # Note that we have to manually define the derivatives
    grad_w2 = h_relu.T.dot(grad_y_pred) # First backprop step. 
    # Note that we are mutiplying h with output gradients to directly find the gradients of w2.
    
    # Finding grad_w1 requires a few additional calculations.
    grad_h_relu = grad_y_pred.dot(w2.T) # One step in, we find gradients of hidden layer.
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0 # We apply ReLU activation.
    grad_w1 = x.T.dot(grad_h) # and finally compute the gradients of w1.

    # Update weights w1 and w2 based of equation from step 6 in our algorithm.
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    
# This will run the optimization algorithm for 500 epochs. 
# An epoch is where you traverse the entire dataset once. 
# In our case, we are sending the whole dataset at once. So each pass is one epoch.

In [None]:
plt.plot(loss_as_list, 'k')
_ = plt.title("Loss Curve")
_ = plt.xlabel("Epochs")
_ = plt.ylabel("Loss")

Awesome! We see that our neural network learned to fit the curve well, even on random data! We'll soon look at actual MNIST code, but for now, let's understand the APIs better.

### PyTorch Tensors

The pure Numpy code can be improved by using PyTorch tensors. This allows for GPU compute and easier integrations with other powerful APIs within PyTorch. We will still be doing forward and backpropagation by-hand, since we are not using `autograd` methods yet.

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
    print("YEAH")
else:
    device = torch.device('cpu')

N, D_in, H, D_out = 64, 784, 100, 10

# Create random input and output data
# Instead of Numpy arrays, we will initialize PyTorch tensors.
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
# Note the changes in APIs.
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

loss_as_list = []
learning_rate = 1e-6
NUM_EPOCHS = 500

for EPOCH in range(NUM_EPOCHS):
    # Forward pass: compute predicted y
    h = x.mm(w1) # We will use tensor.mm() to do matrix multiplications.
    h_relu = h.clamp(min=0) # We can use the tensor.clamp() function to apply ReLU.
    y_pred = h_relu.mm(w2)

    # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
    # of shape (); we can get its value as a Python number with loss.item().
    loss = (y_pred - y).pow(2).sum()
    loss_as_list.append(loss)
    print("Epoch: {}, Loss: {}".format(EPOCH, loss))

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    # Note that the equations were almost the same, and the mainly changed the APIs.
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

In [None]:
plt.plot(loss_as_list, 'k')
_ = plt.title("Loss Curve")
_ = plt.xlabel("Epochs")
_ = plt.ylabel("Loss")

### PyTorch Autograd

In [None]:
N, D_in, H, D_out = 64, 784, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

Recall that `requires_grad` will enable the tensors with `grad_fn` attribute. This way, the gradients of each tensor can be found without us manually defining them.

In [None]:
learning_rate = 1e-6
NUM_EPOCHS = 500
loss_as_list = []

for EPOCH in range(NUM_EPOCHS):
    # Forward pass: compute predicted y using operations on Tensors. Since w1 and
    # w2 have requires_grad=True, operations involving these Tensors will cause
    # PyTorch to build a computational graph, allowing automatic computation of
    # gradients. Since we are no longer implementing the backward pass by hand we
    # don't need to keep references to intermediate values.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
    # is a Python number giving its value.
    loss = (y_pred - y).pow(2).sum()
    loss_as_list.append(loss)
    print("Epoch: {}, Loss: {}".format(EPOCH, loss))

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Update weights using gradient descent. For this step we just want to mutate
    # the values of w1 and w2 in-place; we don't want to build up a computational
    # graph for the update steps, so we use the torch.no_grad() context manager
    # to prevent PyTorch from building a computational graph for the updates
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after running the backward pass
        w1.grad.zero_()
        w2.grad.zero_()

In [None]:
plt.plot(loss_as_list, 'k')
_ = plt.title("Loss Curve")
_ = plt.xlabel("Epochs")
_ = plt.ylabel("Loss")

### PyTorch nn Module

PyTorch `nn` module (https://pytorch.org/docs/stable/nn.html) provides a plethora of APIs available for Linear layers, Activations, Convolution, Pooling, etc. abstracting away the manual building of the neural network architecture. In our case, we have a 3-layer neural network with one hidden-layer. We can use `nn.Linear()` to build fully-connected linear layers. ReLU activation is available as `nn.ReLU()` and does not need any attributes. Overall, a neural network can be easily built in a sequence of layers by using the `nn.Sequential()` method. Once the model is defined, we can pass it to the correct device by calling `.to(device)`, where the device is either CPU or GPU based.

In [None]:
N, D_in, H, D_out = 64, 784, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4 # Note that we increased the learning rate here.
NUM_EPOCHS = 500
loss_as_list = []

for EPOCH in range(NUM_EPOCHS):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the loss.
    loss = loss_fn(y_pred, y)
    loss_as_list.append(loss)
    print("Epoch: {}, Loss: {}".format(EPOCH, loss))

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its data and gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param.data -= learning_rate * param.grad

In [None]:
plt.plot(loss_as_list, 'k')
_ = plt.title("Loss Curve")
_ = plt.xlabel("Epochs")
_ = plt.ylabel("Loss")

### PyTorch Optim Method

So far in our code updating the weights have been done manually. Even when we used `autograd`, we had to manually update the weights by using `w1 -= learning_rate * w1.grad`. Torch has an `optim` package to support various optimization algorithms such as the popular Stochastic gradient descent, Adam, etc. (https://pytorch.org/docs/stable/optim.html). In this example, we shall see how we can use Adam (adaptive momentum) optimizer to automatically carry out weight updates.

In [None]:
N, D_in, H, D_out = 64, 784, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

learning_rate = 1e-4 # Note that we increased the learning rate here.
NUM_EPOCHS = 500
loss_as_list = []

for EPOCH in range(NUM_EPOCHS):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    loss_as_list.append(loss)
    print("Epoch: {}, Loss: {}".format(EPOCH, loss))

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the Tensors it will update (which are the learnable weights
    # of the model)
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its parameters
    optimizer.step()

In [None]:
plt.plot(loss_as_list, 'k')
_ = plt.title("Loss Curve")
_ = plt.xlabel("Epochs")
_ = plt.ylabel("Loss")