# Mathematical Forumulation

- Suppose we have $m$ features and $n$ training examples.  
  Then, our model predicts:

$$
\hat{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_m x_{im}, 
\quad i = 1, \dots, n
$$

- Let

$$
\hat{Y} =
\begin{pmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{pmatrix}.
$$

Then, in **matrix form** we can write:

$$
\begin{pmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{pmatrix}
=
\begin{pmatrix}
1 & x_{11} & x_{12} & \dots & x_{1m} \\
1 & x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & x_{n2} & \dots & x_{nm}
\end{pmatrix}
\begin{pmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_m
\end{pmatrix}.
$$

- Here:
  - The feature  $\boldsymbol{X}$  (of shape $( n \times (m+1)$ )) contains all training examples.  
  - The weight vector $\boldsymbol{\beta}$ (of shape $( (m+1) \times 1)$ contains the model coefficients.
  - The predicted outputs $\boldsymbol{\hat{Y}}$  (of shape $( n \times 1)$ are obtained as $$\boldsymbol{\hat{Y} = X {\beta}}$$

## Loss Function
The sum of residual squares is $\sum_{i=0}^n (y_{i} - \hat{y}_{i})^2$ where $y_{i}\epsilon \boldsymbol{Y}$ and $\hat{y}_{i} \epsilon \boldsymbol{\hat{Y}}$. 
Thus, we have $$
J = (\boldsymbol{Y}-\boldsymbol{\hat{Y}})^T(\boldsymbol{Y}-\boldsymbol{\hat{Y}})
$$
Expanding the RHS, we get: $$\boldsymbol{Y^TY}-\boldsymbol{Y^T\hat{Y}}-\boldsymbol{\hat{Y}^TY}-\boldsymbol{\hat{Y}^T \hat{Y}}$$
Since both $\boldsymbol{Y}$ and $\boldsymbol{\hat{Y}}$ are matrices of shape $(n, 1)$, that is, vectors, the terms are $\boldsymbol{Y^T\hat{Y}}$ and $\boldsymbol{\hat{Y}^TY}$ are both just scalars, and hence, symmetric. Thus, we have $$
J = \boldsymbol{Y^TY}-2\boldsymbol{Y^T\hat{Y}}-\boldsymbol{\hat{Y}^T \hat{Y}}
$$
Now, inputting $\boldsymbol{\hat{Y} = X {\beta}}$, we get:
$$
J(\boldsymbol{\beta}) = \boldsymbol{Y^TY}-2\boldsymbol{Y^TX\beta} - \boldsymbol{\beta ^TX^TX\beta}
$$
where $\boldsymbol{\beta ^TX^T}= \boldsymbol{(X\beta)^T}$ 

## Minimizing the loss function
We need $\boldsymbol{\beta}$ such that $J(\boldsymbol{\beta})$ is minimized. To this end, we differentiate $J$ and set is derivative to $0$. $$
 \frac{\partial J(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = 0 - 2\boldsymbol{Y^TX} - 2 \boldsymbol{\beta^TX^T X}
$$
Where the third term was found using the identity $\frac{\partial (\beta^TA\beta)}{\partial \beta } = (A+A^T)\beta = 2\beta A$ , where the second step follows since $\boldsymbol{A} = \boldsymbol{X^TX}$ is symmetric.

Thus, we have $$
\boldsymbol{Y^TX} = \boldsymbol{\beta^TX^T X}
$$$$
\boldsymbol{\beta^T} = \boldsymbol{Y^TX(X^TX)^{-1}}
$$
$$
\boldsymbol{\beta} = \boldsymbol{[(X^TX)^{-1}]^T X^TY}
$$
since $\boldsymbol{X^TX}$ is symmetric, we have
$$
\boldsymbol{\beta} = \boldsymbol{(X^TX)^{-1} X^TY}
$$

In [25]:
import numpy as np

class CustomLinearRegressor:
    def __init__(self):
        self.beta = None

    def fit(self, X_train, y_train):
        n = X_train.shape[0]
        X = np.concatenate((np.ones((n, 1)), X_train), axis=1)
        self.beta = np.linalg.inv(X.T @ X) @ X.T @ y_train

    def predict(self, X_test):
        n = X_test.shape[0]
        X = np.concatenate((np.ones((n, 1)), X_test), axis=1)
        return X @ self.beta

In [4]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=2, n_informative=2, n_targets=1, noise=50)

In [7]:
import pandas as pd
df = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'target':y})
df.sample(5)

Unnamed: 0,feature1,feature2,target
26,1.251678,-2.103232,-46.145934
46,-0.882336,-0.971738,-150.088081
99,1.875331,0.38015,78.699235
81,-1.01185,0.115377,-23.836923
15,1.549113,0.931258,163.291496


In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [26]:
custom_lr = CustomLinearRegressor()
custom_lr.fit(X_train, y_train)

In [27]:
y_preds = custom_lr.predict(X_test)

In [28]:
def rmse(y_true, y_pred):
    return np.sqrt(np.mean((y_true - y_pred)**2))

def r2_score(y_true, y_pred):
    ss_lr = np.sum((y_true - y_pred)**2)
    ss_m = np.sum((y_true - np.mean(y_true))**2)
    return 1 - ss_lr/ss_m

In [29]:
print(rmse(y_test, y_preds))
print(r2_score(y_test, y_preds))

51.53060006119548
0.75784075038077


In [30]:
custom_lr.beta

array([ 3.17128145, 69.18110011, 78.20907969])