### Linear Regression Derivation

Let best fit line be : $ \hat Y_i = b X_i + a $ and the sum of squared error(S) is $ S = \sum ( Y_i - \hat Y_i )^2 $ which is to be minimized.

S is to be minimized at the value of a and b, So $ \partial S / \partial a = 0 $ and $ \partial S / \partial b $

So, First Condition

$$ \frac{\partial S}{ \partial a} = 0 $$

$$ \frac{\partial}{\partial a}( \sum_{i=0}^n ( Y_i - \hat Y_i )^2 ) = 0 $$

$$ \frac{\partial}{\partial a} \sum_{i=0}^n ( Y_i - b X_i - a )^2 ) = 0 $$

$$ \sum_{i=0}^n -2( Y_i - b X_i - a ) = 0 $$

$$ 2 ( na - \sum_{i=0}^n Y_i + b \sum_{i=0}^n X_i ) = 0 $$

$$ a = \frac{ \sum_{i=0}^n Y_i - b \sum_{i=0}^n X_i }{n} $$

$$ a = \bar Y - b \bar X $$

This means constant a (the y-intercept) is set such that the line must go through the mean of x and y. Make sense because this point is the "center" of the data cloud

Now, second Condition

$$ \frac{\partial S}{ \partial a} = 0 $$

$$ \frac{\partial}{\partial b}( \sum_{i=0}^n ( Y_i - \hat Y_i )^2 ) = 0 $$

$$ \frac{\partial}{\partial b} (\sum_{i=0}^n ( Y_i - b X_i - a )^2 ) = 0 $$

$$  \sum_{i=0}^n -2X_i( Y_i - b X_i - a ) = \sum_{i=0}^n -2( X_iY_i - b X_i^2 - aX_i ) = 0 $$

substituting value of a,

$$  \sum_{i=0}^n ( X_iY_i - b X_i^2 -  X_i \bar Y + b  X_i \bar X ) = 0 $$

$$   \sum_{i=0}^n (X_iY_i -  X_i \bar Y) - b \sum_{i=0}^n ( X_i^2 - X_i \bar X) = 0 $$

$$   b = \frac{\sum_{i=0}^n (X_iY_i -  X_i \bar Y)}{\sum_{i=0}^n ( X_i^2 - X_i \bar X)} $$

We can also note that,

$$ \sum_{i=0}^n ( \bar X ^2 - X_i \bar X ) = 0, \; and \sum_{i=0}^n( \bar X \bar Y - Y_i \bar X ) = 0 $$

Using this, b can also be written as

$$ b = \frac{\sum_{i=0}^n ( X_iY_i -  X_i \bar Y) + \sum_{i=0}^n( \bar X \bar Y - Y_i \bar X ) }{\sum_{i=0}^n ( X_i^2 - X_i \bar X) + \sum_{i=0}^n ( \bar X ^2 - X_i \bar X ) }  $$

$$ b = \frac{ \sum_{i=0}^n \bigg( X_i (Y_i -\bar Y) - \bar X ( Y_i - \bar Y ) \bigg) }{ \sum_{i=0}^n \bigg( X_i^2 - 2 X_i \bar X +  \bar X ^2 \bigg) }   $$

$$ b = \frac{ \frac{1}{n} \sum_{i=0}^n ( X_i - \bar X )(Y_i - \bar Y)  }{ \frac{1}{n} \sum_{i=0}^n (X_i - \bar X)^2 }  $$

$$ b = \frac{covariance( X_i,Y_i )}{variance(X_i)}  $$

$$ b = \frac{ r \sigma_x \sigma_y }{ \sigma_x^2 }, where \; r = pearson's \; r $$

$$ b = \frac{ r*\sigma_y }{\sigma_x} $$


#### For multiple independent variables

$$ a = \bar Y - \sum b \bar X $$

$$ b = (X^TX)^{-1} X^T Y $$

> Calculating $ (X^TX)^{-1} $ is O(n^3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

> Non invertibility of $X^TX$.

> 1. Redundant Features. If two features are linearly dependent.

> 2. Too many features. E.g. m<n.

In [44]:
import pandas as pd
import numpy as np

In [45]:
data = pd.read_csv('student.csv')
data.head()

Unnamed: 0,Math,Reading,Writing
0,48,68,63
1,62,81,72
2,79,80,78
3,76,83,79
4,59,64,62


In [46]:
X = data.values[:,[0,1]]
y = data.values[:,-1]
X = (X-X.mean(axis=0))/X.std(axis=0)
X.shape,y.shape

((1000, 2), (1000,))

In [47]:
b =  np.dot( np.linalg.inv(np.dot(X.T, X)), np.dot( X.T, y ))
a = y.mean() - np.sum(b * X.mean(axis=0))
b,a

(array([ 1.44746983, 13.33854736]), 68.616)

In [50]:
y_pred = np.dot(X,b)+a

In [51]:
def rsquare(y_true,y_pred):
    return 1-( np.sum((y_true - y_pred)**2)/ np.sum( (y_true-y_true.mean())**2 ) )
rsquare( y, y_pred )

0.9098901726717316