# Vectorization of Linear Regression
This notebook illustrates how to perform vectorization of common calculations encountered in linear regression using NumPy.

In [18]:
import numpy as np

#### Training Dataset and Labels

Assume there are ten samples in the training set and each sample has 3 features.

In [19]:
np.random.seed(42)

X_transpose = np.random.rand(10,3) # Named X_transpose assumining X has each sample as a column.
X_transpose

array([[0.37454012, 0.95071431, 0.73199394],
       [0.59865848, 0.15601864, 0.15599452],
       [0.05808361, 0.86617615, 0.60111501],
       [0.70807258, 0.02058449, 0.96990985],
       [0.83244264, 0.21233911, 0.18182497],
       [0.18340451, 0.30424224, 0.52475643],
       [0.43194502, 0.29122914, 0.61185289],
       [0.13949386, 0.29214465, 0.36636184],
       [0.45606998, 0.78517596, 0.19967378],
       [0.51423444, 0.59241457, 0.04645041]])

Add a dummy feature for the bias term.

In [20]:
feature_for_bias = np.ones(len(X_transpose))
feature_for_bias

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [21]:
features = np.column_stack((feature_for_bias, X_transpose))
features

array([[1.        , 0.37454012, 0.95071431, 0.73199394],
       [1.        , 0.59865848, 0.15601864, 0.15599452],
       [1.        , 0.05808361, 0.86617615, 0.60111501],
       [1.        , 0.70807258, 0.02058449, 0.96990985],
       [1.        , 0.83244264, 0.21233911, 0.18182497],
       [1.        , 0.18340451, 0.30424224, 0.52475643],
       [1.        , 0.43194502, 0.29122914, 0.61185289],
       [1.        , 0.13949386, 0.29214465, 0.36636184],
       [1.        , 0.45606998, 0.78517596, 0.19967378],
       [1.        , 0.51423444, 0.59241457, 0.04645041]])

In [22]:
targets = 10 + (np.random.rand(10) * 10)
targets

array([16.07544852, 11.70524124, 10.65051593, 19.48885537, 19.65632033,
       18.08397348, 13.04613769, 10.97672114, 16.84233027, 14.40152494])

In [23]:
weights_transpose = np.random.rand(len(features[0]))
weights_transpose

array([0.12203823, 0.49517691, 0.03438852, 0.9093204 ])

# Linear Regression

## W <sup>T</sup> X (or equivalently X<sup>T</sup>W) 

#### In Stochastic Gradient Descent ( use a single sample)

In [24]:
single_sample = features[0]

w_T_X = weights_transpose.dot(single_sample)
w_T_X

1.0058125380960208

In [25]:
X_T_w = single_sample.T.dot(weights_transpose.T)
X_T_w

1.0058125380960208

#### In minibatch gradient descent (using multiple samples)

In [26]:
w_T_X = (weights_transpose).dot(features.T)
w_T_X

array([1.00581254, 0.56569434, 0.72719256, 1.35532611, 0.70688379,
       0.70049008, 0.90231269, 0.53429909, 0.55644204, 0.43928582])

In [27]:
X_T_w = (features).dot(weights_transpose.T)
X_T_w

array([1.00581254, 0.56569434, 0.72719256, 1.35532611, 0.70688379,
       0.70049008, 0.90231269, 0.53429909, 0.55644204, 0.43928582])

## Error

## W<sup>T</sup>X - Y<sup>T</sup> (or equivalently X<sup>T</sup>W - Y) 

In [28]:
X_T_w - targets

array([-15.06963598, -11.13954689,  -9.92332337, -18.13352926,
       -18.94943654, -17.3834834 , -12.143825  , -10.44242205,
       -16.28588822, -13.96223911])

## Mean Squared Error

### MSE = $\frac{1}{2}$  $\sum_{i}$ (w<sup>T</sup>x<sup>(i)</sup> - y<sup>(i)</sup>)<sup>2</sup> = $\frac{1}{2}$ (w <sup>T</sup> X) . (w<sup>T</sup>X) = $\frac{1}{2}$ (X<sup>T</sup>w) . (X<sup>T</sup>w)

In [29]:
MSE = (1/2) * (X_T_w - targets).dot(X_T_w - targets)
MSE

1078.2191219484507

## Gradient
We know that the gradient of the MSE error wrt the model weights w is

## $\frac{\partial E}{\partial w}$ = $ \begin{bmatrix} \frac{\partial E}{\partial w _{0}}\\ \frac{\partial E}{\partial w _{1}} \\ \frac{\partial E}{\partial w _{2}} \end{bmatrix}$

Also, we know each component in the gradient is

## $\frac{\partial E}{\partial w _{0}}$ =  $\sum_{i} $ (w<sup>T</sup> x <sup>(i)</sup>  - Y<sup>(i)</sup>)  . x <sup>(i)</sup> <sub>0</sub>



#### Product within summation can be rewritten in vector and matrix notations as:

###  $\frac{\partial E}{\partial w _{0}}$ =  $ \begin{bmatrix} w^{T}x^{(1)} - Y^{(1)} &&  w^{T}x^{(2)} - Y^{(2)} && w^{T}x^{(3)}- Y^{(3)}\end{bmatrix}$ . $ \begin{bmatrix} x^{(0)}_{0} \\  x^{(1)}_{0} \\ x^{(2)}_{0}\end{bmatrix}$ 

## $\frac{\partial E}{\partial w _{1}}$ =  $\sum_{i} $ (w<sup>T</sup> x <sup>(i)</sup>  - Y<sup>(i)</sup>)  . x <sup>(i)</sup> <sub>1</sub>

## $\frac{\partial E}{\partial w _{2}}$ =  $\sum_{i} $ (w<sup>T</sup> x <sup>(i)</sup>  - Y<sup>(i)</sup>)  . x <sup>(i)</sup> <sub>2</sub>

#### Also, notice that each component  $\frac{\partial E}{\partial w _{j}}$ has the following vector common.

## $\sum_{i} $ (w<sup>T</sup> x <sup>(i)</sup>  - Y<sup>(i)</sup>)

### Hence, $\frac{\partial E}{\partial w}$ can be rewritten in vector notations as

###  $\frac{\partial E}{\partial w }$ =  $ \begin{bmatrix} w^{T}x^{(1)} - Y^{(1)} &&  w^{T}x^{(2)} - Y^{(2)} && w^{T}x^{(3)}- Y^{(3)}\end{bmatrix}$ . $ \begin{bmatrix} x^{(0)}_{0} & x^{(0)}_{1} & x^{(0)}_{2} \\  x^{(1)}_{0} & x^{(1)}_{1} & x^{(1)}_{2}\\ x^{(2)}_{0} & x^{(2)}_{1} & x^{(2)}_{2}\end{bmatrix}$ 

## $\frac{\partial E}{\partial w}$ = (W<sup>T</sup>X - Y) . X<sup>T</sup>

In [30]:
dE_by_dw = (w_T_X - targets).dot(features)
dE_by_dw

array([-143.43332982,  -66.00122048,  -61.99206272,  -64.04546937])

## Gradient Update

In [33]:
print("Weights before gradient update: ", weights_transpose)

Weights before gradient update:  [0.12203823 0.49517691 0.03438852 0.9093204 ]


In [34]:
lr = 0.0001
weights_transpose = weights_transpose - lr * dE_by_dw
print("Weights after gradient update: ", weights_transpose)

Weights after gradient update:  [0.13638157 0.50177703 0.04058773 0.91572495]
