# Linear regression

In the first part of the assignment we will implement the foward pass and the backward local and global gradient computation for a simple linear regression problem with mean squared error loss.

Our decomposition of the linear regression corresponds to the middle computation graph in slide 11 of Lecture 04 with 2 compute nodes.

In [1]:
# necessary initialization
%load_ext autoreload
%autoreload 2

import torch

In [2]:
# load data
from ann_code.helpers import load_data
in_data, labels = load_data(filename='./ann_data/toy_data.csv') # correct filename if necessary
# get data dimensions
num_inst, num_dim = in_data.shape
print(f"Number of instances: {num_inst}, input dimensions: {num_dim}.")

Number of instances: 90, input dimensions: 3.


## 1) Linear regression - single data point

To make things easy, we first work over a single data example.

The prediction function is an affine function (linear with bias) with parameters $\theta = {\mathbf{w}, b}$.
We write it here in full detail to see the individual scalar parameters (elements of vector $\mathbf{w}$)

$$\hat{y} = f_\theta(\mathbf{x}) = \sum_{j=1}^d x_j \, w_j + b \enspace .$$

The loss is the squared error (SE)

$$\mathcal{L}_{SE}(\hat{y}, y) = (\hat{y} - y)^2 \enspace .$$

Work with the code in `code/linear_regression.py` and complete it as instructed here below.

In [3]:
# get single data example
x = in_data[0, :]
y = labels[0]
print(f"x: {x}, \ny: {y}")


x: tensor([ 0.8872, -1.2852,  0.3707]), 
y: tensor([0.2390])


### Forward propagation

I have implemented for you the forward pass using `for` loops to calculate the inner product in the `linear_single_forward` function

In [4]:
# get prediciton using the provided function linear_single_forward
from ann_code.linear_regression import linear_single_forward

# initialize parameters w, b
w = torch.tensor([ 1.5410, -0.2934, -2.1788]) 
b = torch.tensor([0.8380])

# print data and parameters for info
print(f"x: {x} \nw: {w}, \nb: {b}")

# get predictions
yhat, lin_cache = linear_single_forward(x, w, b)
print(f"Prediction {yhat}")

x: tensor([ 0.8872, -1.2852,  0.3707]) 
w: tensor([ 1.5410, -0.2934, -2.1788]), 
b: tensor([0.8380])
Prediction tensor([1.7745])


I have also implemented the `squared_error_forward` function for you to check the accuracy of the predcition against the true label.

In [5]:
# get squred error using the provided function squared_error_forward
from ann_code.linear_regression import squared_error_forward

# calcualte squred error
loss, loss_cache = squared_error_forward(yhat, y)
print(f"Squarred error: {loss}")
print(loss_cache)

Squarred error: tensor([2.3577])
(tensor([1.7745]), tensor([0.2390]))


### Backward propagation

You will now implement the backward propagation functions. 

#### Local gradients

Remember that each computation node needs to be able to compute its local gradient before it can combine it with the upstream gradient through the chain rule.
Local gradients are simply gradients of the function outputs with respect to its inputs, so far ignoring or possible upstream or downstream operations.

Derive the local gradients and implement them in `linear_single_lgrad` and `squared_error_lgrad`.
Then use the cell bellow to check your implementation. 
The relative errors should all be rather small (e.g. 1e-4).

Note 1: I use the term *gradient* for simplicity even though for scalar objects *derivative* would be more appropriate.

Note 2: We will get the gradients with respect to all inputs of the functions including the data $\mathbf{x}$ and $y$. It will become clear why these may be useful in later parts of the assignment.

In [6]:
# After implementing the local gradient functions, you can check them here
from ann_code.helpers import numerical_gradient, grad_checker
from ann_code.linear_regression import linear_single_lgrad, squared_error_lgrad
yhatg=torch.zeros(1)
# get local gradients of the linear function
xg, wg, bg = linear_single_lgrad(lin_cache)

# get local gradients of the squared error
yhatg, yg = squared_error_lgrad(loss_cache)
print(yhatg)
# check local gradients
# check xg
print(f"Checking xg")
f = lambda theta: linear_single_forward(theta, w, b)[0]
xng = numerical_gradient(f, x)
grad_checker(xg, xng)

# check wg
print(f"Checking wg")
f = lambda theta: linear_single_forward(x, theta, b)[0]
wng = numerical_gradient(f, w)
grad_checker(wg, wng)

# check bg
print(f"Checking bg")
f = lambda theta: linear_single_forward(x, w, theta)[0]
bng = numerical_gradient(f, b)
grad_checker(bg, bng)

# check yhatg
print(f"Checking yhatg")
f = lambda theta: squared_error_forward(theta, y)[0]
yhatng = numerical_gradient(f, yhat)
grad_checker(yhatg, yhatng)

# check bg
print(f"Checking yg")
f = lambda theta: squared_error_forward(yhat, theta)[0]
yng = numerical_gradient(f, y)
grad_checker(yg, yng)

tensor([3.0710])
Checking xg
analytic grad: tensor([ 1.5410, -0.2934, -2.1788])
numerical grad: tensor([ 1.5408, -0.2933, -2.1791], grad_fn=<ViewBackward0>)
relative error: tensor([9.6748e-08, 4.2130e-08, 2.3903e-07], grad_fn=<MulBackward0>)
Checking wg
analytic grad: tensor([ 0.8872, -1.2852,  0.3707])
numerical grad: tensor([ 0.8875, -1.2851,  0.3695], grad_fn=<ViewBackward0>)
relative error: tensor([1.8547e-07, 1.3927e-08, 2.8144e-06], grad_fn=<MulBackward0>)
Checking bg
analytic grad: tensor([1.])
numerical grad: tensor([1.0002], grad_fn=<ViewBackward0>)
relative error: tensor([5.5072e-08], grad_fn=<MulBackward0>)
Checking yhatg
analytic grad: tensor([3.0710])
numerical grad: tensor([3.0708], grad_fn=<ViewBackward0>)
relative error: tensor([4.2442e-08], grad_fn=<MulBackward0>)
Checking yg
analytic grad: tensor([-3.0710])
numerical grad: tensor([-3.0708], grad_fn=<ViewBackward0>)
relative error: tensor([4.2442e-08], grad_fn=<MulBackward0>)


  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


#### Global gradients

To get the global gradients you need to use the chain rule - each local gradient needs to be mulitplied with the upstream global gradient.
Implement the global gradient calcualtion in `linear_single_ggrad`. Remember that the local and global gradient of the loss function are equal so there is no need to write a specific function for these.
The relative errors should all be rather small (e.g. 1e-4).

In [7]:
# After implementing the global gradient functions, you can check them here
from ann_code.linear_regression import linear_single_ggrad

# get global gradients of the linear function parameters w, b and data x
xgrad, wgrad, bgrad = linear_single_ggrad((xg, wg, bg), yhatg)

# check global gradients
# check xgrad
print(f"Checking xgrad")
f = lambda theta: squared_error_forward(linear_single_forward(theta, w, b)[0], y)[0]
xng = numerical_gradient(f, x)
grad_checker(xgrad, xng)

# check wgrad
print(f"Checking wgrad")
f = lambda theta: squared_error_forward(linear_single_forward(x, theta, b)[0], y)[0]
wng = numerical_gradient(f, w)
grad_checker(wgrad, wng)

# check bgrad
print(f"Checking bgrad")
f = lambda theta: squared_error_forward(linear_single_forward(x, w, theta)[0], y)[0]
bng = numerical_gradient(f, b)
grad_checker(bgrad, bng)

Checking xgrad
analytic grad: tensor([ 4.7324, -0.9010, -6.6910])
numerical grad: tensor([ 4.7314, -0.9012, -6.6924], grad_fn=<ViewBackward0>)
relative error: tensor([1.8391e-06, 7.8083e-08, 3.7249e-06], grad_fn=<MulBackward0>)
Checking wgrad
analytic grad: tensor([ 2.7246, -3.9467,  1.1385])
numerical grad: tensor([ 2.7251, -3.9458,  1.1349], grad_fn=<ViewBackward0>)
relative error: tensor([5.5576e-07, 1.5071e-06, 2.6592e-05], grad_fn=<MulBackward0>)
Checking bgrad
analytic grad: tensor([3.0710])
numerical grad: tensor([3.0708], grad_fn=<ViewBackward0>)
relative error: tensor([4.2442e-08], grad_fn=<MulBackward0>)


## 2) Linear regression - vectorize and refactor

We will now vectorize our code (drop for loops), make it work for a whole batch of examples in one go, and merge the local and global gradient calculations into one step.

For a set of $n$ examples with inputs $\mathbf{x} \in \mathbb{R}^{d}$ and outpus $\mathbf{y} \in \mathbb{R}^n$ the affine (linear with bias) prediciton function with parameters $\theta = {\mathbf{w}, b}$ is

$$\mathbf{\hat{y}} = f_\theta(\mathbf{\mathbf{X}}) = \mathbf{Xw} + b \enspace ,$$
where $\mathbf{X}$ is the $(n \times d)$ data matrix and the scalar bias $b$ is broadcasted across the whole prediction vector.

The loss is the mean squared error (MSE)

$$\mathcal{L}_{MSE}(\mathbf{\hat{y}, y}) = \frac{1}{n} ||\mathbf{\hat{y}} - \mathbf{y}||_2^2 \enspace .$$

Work with the code in `code/linear_regression.py` and complete it as instructed here below.

In [8]:
# get input and output data
X = in_data
y = labels
w = torch.tensor([ 1.5410, -0.2934, -2.1788]) 
b = torch.tensor([0.8380])
print(f"X: {X.shape}, \ny: {y.shape}")

X: torch.Size([90, 3]), 
y: torch.Size([90, 1])


### Forward propagation

Implement the vectorized versions of the forward pass in `linear_forward` and `mse_forward`. 
Avoid using `for`, `while` or other lopps!

Check your implementation by comparing your mse loss with the correct value (the differnce should be tiny).

In [9]:
# get predicitons and mse loss using the implemented functions
from ann_code.linear_regression import linear_forward, mse_forward

# parameter values
if w.dim() == 1:
    w = w[:, None] # add dimensions to w to make it (d, 1) tensor
b = b # same as above

# get predictions (use the same parameter values as above)
yhat, lin_cache = linear_forward(X, w, b)

# get mse loss
loss, loss_cache = mse_forward(yhat, y)

print(f"Your loss: {loss}, correct loss: 10.318559646606445")

Your loss: 10.318559646606445, correct loss: 10.318559646606445


### Backward propagation

Next you shall implement the backward pass. Instead of creating two separate functions for local and global gradient calculation, we will merge these into a single function for each compute node.

Derive the local gradients and combine them appropriately with the upstream gradient to obtain the global gradient necessary for the backward propagation in `linear_backward` and `mse_backward`.

Then use the cell bellow to check your implementation. 
The relative errors should all be rather small (e.g. 1e-4).

In [10]:
# After implementing the backward pass functions, you can check them here
from ann_code.linear_regression import linear_backward, mse_backward

# get mse gradients
yhatgrad, ygrad = mse_backward(loss_cache)
#print("yhatgrad====",yhatgrad)
# get linear func gradients
Xgrad, wgrad, bgrad = linear_backward(lin_cache, yhatgrad)

# check global gradients
# check xgrad
print(f"Checking Xgrad")
f = lambda theta: mse_forward(linear_forward(theta, w, b)[0], y)[0]
xng = numerical_gradient(f, X)
grad_checker(Xgrad, xng, rnd=True)

# check wgrad
print(f"Checking wgrad")
f = lambda theta: mse_forward(linear_forward(X, theta, b)[0], y)[0]
wng = numerical_gradient(f, w)
grad_checker(wgrad, wng, rnd=True)

# check bgrad
print(f"Checking bg")
f = lambda theta: mse_forward(linear_forward(X, w, theta)[0], y)[0]
bng = numerical_gradient(f, b)
grad_checker(bgrad, bng)

Checking Xgrad
To save space, printing only randomly selected elements:
analytic grad: tensor([[-0.0846, -0.0019, -0.0705, -0.1331, -0.0728]])
numerical grad: tensor([[-0.0858, -0.0048, -0.0715, -0.1335, -0.0763]])
relative error: tensor([[3.1964e-06, 1.5998e-05, 2.2464e-06, 4.2331e-07, 2.4990e-05]])
Checking wgrad
To save space, printing only randomly selected elements:
analytic grad: tensor([[ 4.8180, -4.8611,  4.8180,  4.8180, -4.8611]])
numerical grad: tensor([[ 4.8256, -4.8542,  4.8256,  4.8256, -4.8542]])
relative error: tensor([[1.1640e-04, 9.5097e-05, 1.1640e-04, 1.1640e-04, 9.5097e-05]])
Checking bg
analytic grad: 1.6759978532791138
numerical grad: tensor([1.6785])
relative error: tensor([1.2191e-05])
