# Linear Regression

In this section we will solve a linear regression problem, by manually implementing the gradient descent algorithm using only NumPy. 

In [1]:
import numpy as np
import sklearn.datasets as datasets

The ```make_regression``` function from ```sklearn.datasets``` creates features and targets for our regression task.
- ```n_samples```: number of samples in the features and targets matrices
- ```n_features```: number of features in the features matrix
- ```n_informative```: now many of the features do actually contribute to the output, in our case all features have an impact on the targets.
- ```noise```: some noise is applied to the output, to make the linear model imperfect

In [2]:
X, y = datasets.make_regression(n_samples=100, n_features=10, n_informative=10, noise=0.01)

The dimensions of the $\mathbf{X}$ matrix are as expected: ```(number of samples, number of features)```.

In [3]:
print(f'The shape of the feature matrix: {X.shape}')

The shape of the feature matrix: (100, 10)


The $\mathbf{y}$ vector has dimensionality ```(100, )```. 

Uur predictions $\mathbf{\hat{y}}$ will be in the shape ```(100, 1)```, therefore we reshape the output vector to match our predictions.

In [4]:
y = y.reshape(100, 1)
print(f'The dimension of the output vector: {y.shape}')

The dimension of the output vector: (100, 1)


We create the weight vector $\mathbf{w}$ and the bias $b$ and initialize the values using the standard normal distribution.

In [5]:
w = np.random.randn(1, 10)
b = np.random.randn(1, 1)

The linear regression does not require a lot of hyperparameters. We generally only need the number of epochs and the learning rate $\alpha$.

In [6]:
alpha = 0.1
epochs = 100

Below is the actual implementation of gradient descent.
1. We start by calculating the output of the linear regression $\mathbf{\hat{y}} = \mathbf{X} \mathbf{w}^T + b$
2. We can calculate the gradient for a single sample $i$ using $\dfrac{\partial{L}}{\partial w_j} = \dfrac{1}{n} \sum_n^i -x_j^{(i)}(y^{(i)} - \hat{y}^{(i)})$. We calculate the gradients for all weights and samples simultaneously by using $-\mathbf{X} \otimes \mathbf{(\hat{y}} - \mathbf{y})$ and end up with a matrix of size ```n_samples * n_features```. This matrix essentially holds the partial derivatives with respect to a particular feature for a particular sample. We average the derivatives over all samples and end up with a gradient vector $\mathbf{\nabla}$. A similar procedure is done to calculate the derivative with the respect to the bias $b$.
3. Finally we apply batch gradient descent $\mathbf{w} := \mathbf{w} - \alpha \mathbf{\nabla}$ and $b := b - \alpha \dfrac{\partial L}{\partial b}$.

In [7]:
for epoch in range(epochs):
    # 1. calculate output of linear regression
    y_hat = X @ w.T + b
    # output info every 10 epochs
    if epoch % 10 == 0:
        mse = ((y - y_hat)**2).mean()
        print(f"Epoch: {epoch} | MSE: {mse:.2f}")
    # 2. calculate the gradients 
    grad_w = (-X * (y - y_hat)).mean(axis=0) 
    grad_b = -(y - y_hat).mean()
    # 3. apply batch gradient descent
    w = w - alpha * grad_w
    b = b - alpha * grad_b

Epoch: 0 | MSE: 27536.11
Epoch: 10 | MSE: 4520.28
Epoch: 20 | MSE: 981.36
Epoch: 30 | MSE: 246.61
Epoch: 40 | MSE: 66.19
Epoch: 50 | MSE: 18.34
Epoch: 60 | MSE: 5.18
Epoch: 70 | MSE: 1.48
Epoch: 80 | MSE: 0.43
Epoch: 90 | MSE: 0.13


The mean squared error shrinks fast and gets close to 0 after 10 epochs.