<center>
NNTI Assignment 7

December 21, 2023

Name: Aleksey Morshnev
Student ID: 7042691
Email: almo00008@stud.uni-saarland.de
</center>

# Exercise 7.3 (5 points)

In this exercise you will build the toy library. You are asked to perform backpropagation on a neural network model that you will build in this exercise.  
  
In this toy library, we are not implementing the functionalities of autograd or any other automatic differentiation. Still it will be extremely helpful for you to know the basics about how the PyTorch autograd functionality works (e.g. for checking your implementation of gradient calculations). A good starting point would be [PyTorch autograd tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html).   

All classes that you will implement must have a `grad` function which would compute and return gradients. The `grad` function in the classes of the following *loss* funtions (`MSELoss` and `CrossEntropyLoss`) must compute gradients of the loss w.r.t. its input. The `grad` function for *activation functions* must take the incoming gradient (possibly from the previous layer or the loss function) and compute gradients of the loss w.r.t. its input. The `grad` function for *layers* (in this exercise we have only `Linear` layer) must take the incoming gradient and compute gradients of the loss w.r.t. both its input and weights (you can ignore computing the gradients w.r.t. biases).

For each gradient calculation, we are providing some low-dimensional data. After you finish the implementation of each `grad` function, simply run the corresponding cell (**do not** change the contents of these cells). To check for the correctness of the implementation, we ask you to *call* the corresponding function from PyTorch on the same input data, compute gradients, and compare them with the gradients from your implementation. If you have a correct solution, then they must be the same (or maybe with some very small <$10^{-3}$ differences).

Please remember that everything is processed in minibatches and gradients must be calculated accordingly. The input for each of the model components has dimensions of `N*D` where `N` is the number of datapoints in minibatch and `D` is the number of features. Of course, all the gradient computations must, ideally, be implemented in vectorized form (without using any loops).

### Note: do not modify the code in the notebook

In [1]:
import numpy as np
from activations import ReLU, Sigmoid
from losses import CrossEntropy, MSELoss
from layers import Linear, Dropout
from model import Model

import torch
np.random.seed(23)

## Exercise 7.3.1 Implement __ call __ and grad methods for MSE loss (0.25 points)
Check the correctness of the gradient by calculating it on the same data using PyTorch  

In [None]:
y_pred = np.array([0, 1, 2])
y_true = np.array([1, 3, 3])
loss = MSELoss()

print('predictions')
print(y_pred)
print('true values')
print(y_true)

print('MSE loss')
print(loss(y_pred, y_true))

print('MSE gradient')
print(loss.grad())

In [None]:
y_pred = torch.tensor([0, 1, 2], requires_grad=True, dtype=torch.float32)
y_true = torch.tensor([1, 3, 3], dtype=torch.float32)
loss = torch.nn.MSELoss(reduction='mean')

print('predictions')
print(y_pred)
print('true values')
print(y_true)

print('MSE loss')
out = loss(y_pred, y_true)
print(out.item())
out.backward()

print('MSE gradient')
print(y_pred.grad)

## Exercise 7.3.2 Implement __ call __ and grad methods for Cross Entropy Loss (0.75 points)
<br> $\frac{\delta L}{\delta o_i} = p_i - y_i$  
where <br> $o_i$ - one of the input variables,   
$p_i$ - probability for that input variable calculated using softmax,   
$y_i$ - label for that input variable ($y_i \in \{0, 1\}$).  
For simplicity of the proof, you can prove it for just one datapoint, but in the code, you should properly extrapolate it for computing the gradients for the whole minibatch (`N` datapoints).  
  
Please remember that a typical Cross Entropy Loss implementation, including ours, implicitly applies Softmax before calculating the CE loss.
  
Check the correctness of the gradient by calculating it on the same data using PyTorch  

In [None]:
ce_loss = CrossEntropy(average=True)
predictions = np.array([[0.4,0.35,0.71,0.30],
                        [0.01,0.01,0.01,0.65]])
targets = np.array([[0,0,1,0],
                  [0,0,0,1]])

print('predictions')
print(predictions)
print('targets')
print(targets)

print('cross entropy loss')
print(ce_loss(predictions, targets))

print('gradient of the cross entropy loss')
print(ce_loss.grad())

In [None]:
ce_loss = torch.nn.CrossEntropyLoss()
predictions = torch.tensor([[0.4,0.35,0.71,0.30],
                            [0.01,0.01,0.01,0.65]], dtype=torch.float32, requires_grad=True)
targets = torch.tensor([[0,0,1,0],
                        [0,0,0,1]], dtype=torch.float32)

print('predictions')
print(predictions)
print('targets')
print(targets)

out = ce_loss(predictions, targets)
print(out.item())

out.backward()
print(predictions.grad)


## Exercise 7.3.3 Implement the __ call __ and grad methods for linear layer (1.0 points)

$\frac{\delta L}{\delta X} = \frac{\delta L}{\delta Y} W^T$ and $\frac{\delta L}{\delta W} = X^T \frac{\delta L}{\delta Y}$  
where $Y = XW$ <br> (X - input data matrix of dimension `N * in_features` and W is a weight matrix of dimension `in_features * out_features`),  
$\frac{\delta L}{\delta Y}$ is the incoming gradient of dimension `N * out_features` (e.g. from the loss function that is applied on the outputs of the linear layer).  

Check the correctness of the gradient by calculating it on the same data using PyTorch.

Note that, to get the same output, the *weights* and *biases* of the `Linear` layer instantiated above and the `Linear` layer from PyTorch must be the same. 

In [None]:
minibatch_size = 4
in_features = 5
out_features = 2
minibatch = np.random.randn(minibatch_size, in_features)
print('input data')
print(minibatch)

layer = Linear(in_features, out_features)
print('output of the linear layer')
print(layer(minibatch))

in_gradient = np.ones((minibatch_size, out_features,))
gradient_weights, gradient_input = layer.grad(in_gradient)
print('gradient w.r.t weights')
print(gradient_weights)
print('gradient w.r.t. inputs')
print(gradient_input)

In [None]:
print('input data')
minibatch = torch.tensor(minibatch, dtype=torch.float32, requires_grad=True)
print(minibatch)

linear = torch.nn.Linear(in_features, out_features, bias=True)
with torch.no_grad():
    linear.weight.copy_(torch.tensor(layer.weights).t())
    linear.bias.copy_(torch.tensor(layer.bias[0,:]))
out = linear(minibatch)
print('output of the linear layer')
print(out)

in_gradient = torch.ones((minibatch_size, out_features,))
out.backward(gradient=in_gradient)
print('gradient w.r.t weights')
print(linear.weight.grad.t())
print('gradient w.r.t. inputs')
print(minibatch.grad)

## Exercise 7.3.4 Implement __ call __ and grad methods for activation functions (0.5 point)
Check the correctness of the gradients by calculating them on the same data using PyTorch.  

In [None]:
x = np.array([[0.1, -0.3, 0.5, 0.9, 0, -1.0],
              [0.2, -0.4, 1.1, 0.4, 0.3, 0]])
sigmoid = Sigmoid()
print(sigmoid(x))

in_gradient = np.ones((2, 6,))
print(sigmoid.grad(in_gradient))

In [None]:
x_torch = torch.tensor(x, requires_grad=True)
torch_sigmoid = torch.nn.Sigmoid()
out = torch_sigmoid(x_torch)
print(out)

in_gradient_torch = torch.ones((2, 6, ))
out.backward(gradient=in_gradient_torch)
print(x_torch.grad)

In [None]:
relu = ReLU()
print(relu(x))
print(relu.grad(in_gradient))

In [None]:
x_torch = torch.tensor(x, requires_grad=True)
torch_relu = torch.nn.ReLU()
out = torch_relu(x_torch)
print(out)

in_gradient_torch = torch.ones((2, 6, ))
out.backward(gradient=in_gradient_torch)
print(x_torch.grad)

## Exercise 7.3.5 Implement a model class (2.0 points)
Implement a model class which stores a list of components of the model (in this exercise those are only the *layers* and *activation functions*). 
It must perform the forward pass and also be able to calculate and store the gradients for all the layers, and perform a parameter update step (here we deviate from PyTorch since we don't use *autograd*).  
For simplicity, you don't have to compare the value of each parameter of the model with PyTorch implementation, but just check the value of the resultant loss (before and after the parameter update step). We provide all the code, including the code for PyTorch below. You don't have to change the cells below, but just check whether your implementation of the model achieves the same decrease in loss as the equivalent implementation in PyTorch.

In [None]:
from model import Model
np.random.seed(123)

layer1 = Linear(1000, 100)
activation1 = ReLU()
layer2 = Linear(100, 10)
activation2 = ReLU()
loss = CrossEntropy()

x = np.random.randn(2, 1000)
y_true = np.zeros((2, 10,))
y_true[0, 4] = 1
y_true[1, 1] = 1
m = Model([layer1, activation1, layer2, activation2])
out = m.forward(x)
print(loss(out, y_true))

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.layer1 = nn.Linear(1000, 100, bias=True)
        self.layer2 = nn.Linear(100, 10, bias=True)
    
    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        return x
criterion = nn.CrossEntropyLoss()
net = Net()

with torch.no_grad():
    net.layer1.weight.copy_(torch.tensor(layer1.weights).t())
    net.layer1.bias.copy_(torch.tensor(layer1.bias[0,:]))
    net.layer2.weight.copy_(torch.tensor(layer2.weights).t())
    net.layer2.bias.copy_(torch.tensor(layer2.bias[0,:]))

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0)

x_torch = torch.tensor(x, dtype=torch.float32)
out = net(x_torch)
y_true_torch = torch.tensor(y_true, dtype=torch.float32)
loss_torch = criterion(out, y_true_torch)
print(loss_torch.item())

In [None]:
grads = m.backward(loss.grad())
m.update_parameters(grads, 0.001)
out = m.forward(x)
model_loss_ours = loss(out, y_true)
print(model_loss_ours)

In [None]:
loss_torch.backward()
optimizer.step()

out = net(x_torch)
model_loss_pt = criterion(out, y_true_torch).item()
print(model_loss_pt)

In [None]:
# sanity check within some acceptable tolerance level

np.allclose(model_loss_ours, model_loss_pt, atol=1e-3)

## Exercise 7.3.6 Implement __ call __ and grad methods for dropout (0.5 point)
In this exercise we are going to implement inverted dropout. 
We implement dropout as a layer wrapper where Dropout class takes two arguments
Dropout (layer, probability). Although dropout can be applied to several types
of layers, we only apply it to linear layers in this exercise. Use inverted dropout
in this exercise. Implement dropout in ./layers/Dropout.py which transforms the input by setting randomly chosen activations to 0 by a probability p

In [None]:
np.random.seed(123)

layer1 = Linear(1000, 100)
activation1 = ReLU()
layer2 = Dropout(Linear(100, 10), p=0.5)
activation2 = ReLU()
loss = CrossEntropy()

x = np.random.randn(2, 1000)
y_true = np.zeros((2, 10,))
y_true[0, 4] = 1
y_true[1, 1] = 1
m = Model([layer1, activation1, layer2, activation2])
out = m.forward(x)

# numpy seed is fixed so you should get the same value after each run
# print(loss(out, y_true)) # = 2.15028 with tolerance 5e-3
np.allclose(loss(out, y_true), 2.15028, atol=5e-3)