# Feed forward network

In this part of the assignment we will develop our own building blocks for constructing a feed forward network.
We will follow a modular approach so that we can use these building blocks in feed forward architecture of our choice.

We will follow the logic of computation graphs where the layers and the loss have the characteristics of the compute nodes in terms of locality and ability to communicate with upstream and downstream blocks.

Instead of defining the forward and backward steps as functions that need to pass around cached variables, we will implement the compute nodes as statefull objects - instantiations of python classes with forward and backward methods.

We will then conscruct a 2 layer neural network and use our newly developed functionality to predict the target values and compute the parameter gradients.

Work with the code in `ann_code/layers.py` and complete it as instructed here below.

In [1]:
# necessary initialization
%load_ext autoreload
%autoreload 2

import torch

In [2]:
# load data
from ann_code.helpers import load_data
in_data, labels = load_data(filename='./ann_data/toy_data.csv') # correct filename if necessary

# get data dimensions
num_inst, num_dim = in_data.shape
print(f"Number of instances: {num_inst}, input dimensions: {num_dim}.")

Number of instances: 90, input dimensions: 3.


## 1) Forward pass

We first work on the forward pass functionality of our layer objects.

### Linear layer

We start by defyining the linear layer.
Complete the `__init__` and `forward` methods of the `Linear` class in `ann_code/layers.py`.

The class object instances shall be initialized with the linear function parameters (weight and bias) as the instance attributes.
The other local information (inputs, outputs and their gradients) shall be also defined as the instance object attributes and will be populated by the `forward` and `backward` methods.

In [3]:
# after implementing Linear class, check it here
from ann_code.layers import Linear

# initiate w and b buffers
# we use these for initiating the model parameters instead of the usual random init
# this is to make sure that yours and mine results match
w_buffer = torch.logspace(start=0.1, end=10, steps=1000)
b_buffer = torch.logspace(start=0.1, end=10, steps=1000, base=2)

# linear layer dimensions
in_features = num_dim
out_features = 10
W=w_buffer[:30].view(10,3)
b=b_buffer[:10].view(1,10)
linearlayer=Linear(W,b)

# forward pass in_data through the layer
outputs = linearlayer.forward(in_data)

# check outputs for the first two data instances
print(f'Your outputs {outputs[:2,:]}')

Your outputs tensor([[ 1.0220,  1.0258,  1.0295,  1.0329,  1.0361,  1.0391,  1.0418,  1.0441,
          1.0462,  1.0479],
        [-0.4527, -0.5533, -0.6615, -0.7779, -0.9030, -1.0374, -1.1819, -1.3370,
         -1.5037, -1.6827]])


Expected outputs

`tensor([[ 1.0220,  1.0258,  1.0295,  1.0329,  1.0361,  1.0391,  1.0418,  1.0441,
          1.0462,  1.0479],
        [-0.4527, -0.5533, -0.6615, -0.7779, -0.9030, -1.0374, -1.1819, -1.3370,
         -1.5037, -1.6827]])`

### ReLU nonlinearity

We next defined the class for the Rectified Linear Unit which is an element-wise operation defined as $\sigma(x) = max(0, x).$

Complete the `forward` methods of the `Relu` class in `ann_code/layers.py`. Note that in this case, there are no parameters that should be included in the object instances as initial states.

In [4]:
# After implementing Relu class, check it here
from ann_code.layers import Relu

# relu instance
relu1 = Relu()

# forward pass in_data through the layer
outputs = relu1.forward(in_data)

# check outputs for the first two data instances
print(f'Your outputs {outputs[:2,:]}')

Your outputs tensor([[0.8872, 0.0000, 0.3707],
        [0.0000, 1.3094, 0.0000]])


Expected outputs

`tensor([[0.8872, 0.0000, 0.3707],
        [0.0000, 1.3094, 0.0000]])`

### Define 2 layer network

We use the linear and relu classes to create a network with the following architecture. 
We combine the layers through the `Model` class that I defined for you in the `ann_code/layers.py`

We will add the MSE less in a later step, now do just the forward pass through the layers to obtain the predicitons.

<center><img src="net_diagram.png">



In [5]:
# work with Model class to do the forward pass through the network
from ann_code.layers import Model

W2=w_buffer[:12].view(4,3)
b2=b_buffer[:4].view(1,4)
lin1=Linear(W2,b2)
relu1=Relu()
W3=w_buffer[12:16].view(1,4)
b3=b_buffer[4:5].view(1,1)
lin2=Linear(W3,b3)

layers = [lin1,relu1,lin2]

model = Model(layers)
ypred = model.forward(in_data)


# check outputs for the first two data instances
print(f'Your outputs {ypred[:2,:]}')

Your outputs tensor([[8.1458],
        [1.1016]])


Expected output

`tensor([[8.1458],
        [1.1016]])`

## 3) MSE loss

We use the MSE loss functions defined in `ann_code/linear_regression.py` to get the mse loss for our predictions and the corresponding gradients.

In [6]:
# use mse functions defined for linear regression to get the MSE and gradient with respect to predictions
from ann_code.linear_regression import mse_forward, mse_backward

loss, mse_cache = mse_forward(ypred, labels)
ypredgrad, _ = mse_backward(mse_cache)

## 3) Backward propagation

Finally, you need to implement the `backward` methods in for the `Linear` and `Relu` classes.

Remember that you need to use the chain rule and combine the local and the upstream gradient to obtain the global gradients. Do not forget that ReLu is an element-wise operation.

In [7]:
# After implementing the backward passes of Linear class test it here

# do the backward pass of last linear layer
lin2.backward(torch.ones(num_inst, 1))

# check global gradients
print(f'Global gradient of loss with respect to weight parameters {lin2.W.g}')
print(f'Global gradient of loss with respect to bias parameters {lin2.b.g}')
print(f'Global gradient of loss with respect to linear layer inputs {lin2.ins.g[:2,:]}')

Global gradient of loss with respect to weight parameters tensor([[106.2968, 108.7577, 111.4530, 114.4143]])
Global gradient of loss with respect to bias parameters tensor([[90.]])
Global gradient of loss with respect to linear layer inputs tensor([[1.6555, 1.6937, 1.7328, 1.7728],
        [1.6555, 1.6937, 1.7328, 1.7728]])


Expected results

`Global gradient of loss with respect to weight parameters tensor([[106.2968, 108.7577, 111.4530, 114.4143]])
Global gradient of loss with respect to bias parameters tensor([[90.]])
Global gradient of loss with respect to linear layer inputs tensor([[1.6555, 1.6937, 1.7328, 1.7728],
        [1.6555, 1.6937, 1.7328, 1.7728]])`

In [8]:
# After implementing the backward passes of relu class test it here

# do the backward pass of relu

relu1.backward(torch.arange(num_inst*4).view(num_inst, 4))

# check global gradients
print(f'Global gradient of loss with respect to relu inputs {relu1.ins.g[:2,:]}')

Global gradient of loss with respect to relu inputs tensor([[0, 1, 2, 3],
        [0, 0, 0, 0]])


Expected results

`Global gradient of loss with respect to relu inputs tensor([[0., 1., 2., 3.],
        [0., 0., 0., 0.]])`

## Complete backward pass

We shall use the Model class to get the gradients of all the layers and their parameters with respect to the loss.

In [9]:
from ann_code.helpers import grad_model

# do the backward pass through the model
model.backward(ypredgrad)

# print out your gradients of loss with respect to the parameters of the 1st model layer
print(f'Your dLoss/dW1: {model.layers[0].W.g}')
print(f'Your dLoss/db1: {model.layers[0].b.g}')
print(f'Your dLoss/dins: {model.layers[0].ins.g[:2, :]}')

# print out correct gradients of loss with respect to the parameters of the 1st model layer
# these should be the same as your gradients from above
model_check = grad_model(model, in_data, labels)
print(f'Correct dLoss/dW1: {model_check.layers[0].W.grad}')
print(f'Correct dLoss/db1: {model_check.layers[0].b.grad}')
print(f'Correct dLoss/dins: {model_check.layers[0].ins.grad[:2, :]}')

Your dLoss/dW1: tensor([[10.4693,  6.8379,  4.1449],
        [10.5790,  7.0695,  4.3389],
        [10.8324,  7.2315,  4.4382],
        [11.0693,  7.3818,  4.5600]])
Your dLoss/db1: tensor([[31.2568, 31.9208, 32.6484, 33.4148]])
Your dLoss/dins: tensor([[1.6884, 1.7274, 1.7673],
        [0.0000, 0.0000, 0.0000]])
Correct dLoss/dW1: tensor([[10.4693,  6.8379,  4.1449],
        [10.5790,  7.0695,  4.3389],
        [10.8324,  7.2315,  4.4382],
        [11.0693,  7.3818,  4.5600]])
Correct dLoss/db1: tensor([[31.2568, 31.9208, 32.6484, 33.4148]])
Correct dLoss/dins: tensor([[1.6884, 1.7274, 1.7673],
        [0.0000, 0.0000, 0.0000]])


  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


## 4) Multilayer feed forward network

Finally, use your `Linear` and `Relu` classes and combine them with the `Model` class to construct a more complicated network.

Define a network with the following architecture:
Linear: input_dim = 3, output_dim = 5 -> Relu ->
Linear: input_dim = 5, output_dim = 10 -> Relu ->
Linear: input_dim = 10, output_dim = 4 -> Relu ->
Linear: input_dim = 4, output_dim = 1

Initialize all the linear layers with parameters W and b sampled randomly from standardat normal distribution.

Combine the layers using the `Model` class and get the predictions (`forward` method).

Use the MSE forward and backward functions to get the loss and the gradient with respect to the predictions.

Use the `backward` method of `Model` to get all the gradients.

In [10]:

relu_1=Relu()
relu_2=Relu()
relu_3=Relu()
layer1=Linear(torch.randn((5,3)),torch.randn((1,5)))
layer2=Linear(torch.randn((10,5)),torch.randn((1,10)))
layer3=Linear(torch.randn((4,10)),torch.randn((1,4)))
layer4=Linear(torch.randn((1,4)),torch.randn((1,1)))
layers1=(layer1,relu_1,layer2,relu_2,layer3,relu_3,layer4)
mffn = Model(layers1)
y_pred_1=mffn.forward(in_data)
loss,mse_cache_1=mse_forward(y_pred_1,labels)
ypredgrad_1,xx=mse_backward(mse_cache_1)
print(mffn.layers[0].ins.shape)
mffn.backward(ypredgrad_1)


torch.Size([90, 3])


tensor([[-2.9881e-01, -1.1856e+00,  2.5806e+00],
        [-1.6428e+00,  5.3316e+00, -3.1710e+00],
        [ 5.4719e-01,  2.6398e+00, -1.4216e+00],
        [ 4.1268e-01, -8.1125e-01,  1.4487e+00],
        [ 2.9934e-01, -4.9563e-01,  1.0859e+00],
        [ 2.4937e-01, -3.1799e-01,  6.7148e-01],
        [-9.2622e-01,  5.6796e+00, -3.1998e+00],
        [ 4.6507e-01, -5.8765e-01,  1.3185e+00],
        [ 2.1730e-01, -3.5979e-01,  7.8831e-01],
        [ 4.4679e-01, -3.9645e-01,  1.0944e+00],
        [ 4.7645e-01, -6.0202e-01,  1.3507e+00],
        [-5.2619e-01,  3.2266e+00, -1.8178e+00],
        [-9.6453e-03, -3.8756e-01,  8.9241e-01],
        [ 6.5319e-01, -5.7960e-01,  1.6000e+00],
        [ 6.4715e-01, -5.7423e-01,  1.5851e+00],
        [-2.6455e-01, -1.0497e+00,  2.2847e+00],
        [ 2.0613e-01, -2.6720e-01,  5.8783e-01],
        [ 3.9051e-01, -6.4657e-01,  1.4166e+00],
        [-8.6004e-03, -3.4558e-01,  7.9574e-01],
        [ 6.5478e-01,  3.1589e+00, -1.7011e+00],
        [ 2.8304e-01

#### Check model architecture

In [11]:
# check architecture
from ann_code.helpers import check_architecture

check_architecture(mffn)

You NN architecture definitions seems CORRECT.


#### Check gradient computation

In [12]:
# print out your gradients of loss with respect to the parameters of the 1st model layer
print(f'Your dLoss/dW1: {mffn.layers[0].W.g}')
print(f'Your dLoss/db1: {mffn.layers[0].b.g}')
print(f'Your dLoss/dins: {mffn.layers[0].ins.g[:2, :]}') 
    
# print out correct gradients of loss with respect to the parameters of the 1st model layer
# these should be the same as your gradients from above
model_check = grad_model(mffn, in_data, labels)
print(f'Correct dLoss/dW1: {model_check.layers[0].W.grad}')
print(f'Correct dLoss/db1: {model_check.layers[0].b.grad}')
print(f'Correct dLoss/dins: {model_check.layers[0].ins.grad[:2, :]}')

Your dLoss/dW1: tensor([[ -8.7668,  11.2050,  -9.2887],
        [ -1.0134,   4.6907,  -5.7011],
        [ -7.3668,  55.9515, -58.8643],
        [ -1.9684, -13.3189,  14.9649],
        [  0.2436, -12.4558,  15.9174]], grad_fn=<MmBackward0>)
Your dLoss/db1: tensor([[  8.8145, -12.5458,  51.7239,  22.9312,  28.0691]],
       grad_fn=<MmBackward0>)
Your dLoss/dins: tensor([[-0.2988, -1.1856,  2.5806],
        [-1.6428,  5.3316, -3.1710]], grad_fn=<SliceBackward0>)
Correct dLoss/dW1: tensor([[ -8.7668,  11.2050,  -9.2887],
        [ -1.0134,   4.6907,  -5.7011],
        [ -7.3668,  55.9515, -58.8643],
        [ -1.9684, -13.3189,  14.9649],
        [  0.2436, -12.4558,  15.9174]])
Correct dLoss/db1: tensor([[  8.8145, -12.5458,  51.7239,  22.9312,  28.0691]])
Correct dLoss/dins: tensor([[-0.2988, -1.1856,  2.5806],
        [-1.6428,  5.3316, -3.1710]])
