## Introduction

The Multiple Linear Regression is defined as 

$ {y_i} = {\beta_0} + {\beta_1}x_{i1} + {\beta_2}x_{i2} + \ldots + {\beta_k}x_{ik} + u_i $

This model represents the 'population' we are interested in, where $u_i$ is the error term. We won't be able to know the true values the $\beta$ parameters, so we will use the following model to 'estimate' these parameters.

$ \hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_{i1} + \hat{\beta_2}x_{i2} + \ldots + \hat{\beta_k}x_{ik} $

Notice that $u_i$ isn't present in this model. The error term is unknown to us during estimation, but can be calculated later as the residual.

As it is currently defined, the model inputs have the subscript $i$ to show that a different prediction will be made for each data point. We can visualize the data as every regression from $i$ to $n$.

$ \hat{y_1} = \hat{\beta_0} + \hat{\beta_1}x_{11} + \hat{\beta_2}x_{12} + \ldots + \hat{\beta_k}x_{1k} $

$ \hat{y_2} = \hat{\beta_0} + \hat{\beta_1}x_{21} + \hat{\beta_2}x_{22} + \ldots + \hat{\beta_k}x_{2k} $

$ \vdots $

$ \hat{y_n} = \hat{\beta_0} + \hat{\beta_1}x_{n1} + \hat{\beta_2}x_{n2} + \ldots + \hat{\beta_k}x_{nk} $

Because we are attempting to find the different $\beta$ parameters, we can simplify the above representation with matricies.

$ \hat{Y} = 
\begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
\end{bmatrix}  
; \hat{\beta} = 
\begin{bmatrix}
    \beta_0 \\
    \beta_1 \\
    \vdots \\
    \beta_k
\end{bmatrix}
; X = 
\begin{bmatrix}
    1 & x_{11} & x_{12} & \dots & x_{1k} \\
    1 & x_{21} & x_{22} & \dots & x_{2k} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    1 & x_{n1} & x_{n2} & \dots & x_{nk} \\
\end{bmatrix}
$

If you are familiar with matrix algebra, you will notice that $\hat{Y}$ is simply the dot product of $X$ and $\hat{\beta}$. Also notice the 1s in the first column of $X$. These are the values that will be mutiplied by the constant $\hat{\beta_0}$


## Derive the estimator for $ \hat{\beta} $ by minimizing the sum of square residuals

We want to find the matrix of parameters that will minimize the sum of square residuals, or mathematically, we will be finding $ \min\sum{\hat{u_i^2} } $

Before we start, we can expand $ \sum{\hat{u_i^2}} $ and seperate each $ {u_i}^2 $ into two matricies, allowing us to rewrite the sum of square residuals as $ \hat{U'}U $, where $U$ is a $(n x 1)$ matrix of residuals.

With that taken care of, we can now derive the estimator.

$$ \min{\hat{U'}U} $$

Remember that the residual is defined as $U = Y -\hat{Y}$ and $\hat{Y} = X\hat{\beta} $, so 

$$ \min{ (Y -\hat{Y})'(Y -\hat{Y}) } $$

$$ \min{ (Y - X\hat{\beta})'(Y - X\hat{\beta}) } $$

$$ \min{ Y'Y - Y'X\hat{\beta} - (X\hat{\beta})'Y } + (X\hat{\beta})'X\hat{\beta}$$

$$ \min{ Y'Y - Y'X\hat{\beta} - \hat{\beta}'X'Y } + \hat{\beta}'X'X\hat{\beta}$$

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('WAGE2_SMALL.CSV')

In [4]:
df.head(5)

Unnamed: 0,wage,tenure,IQ,educ
0,769,2,93,12
1,808,16,119,18
2,825,9,108,14
3,650,7,96,12
4,562,5,74,11


In [5]:
Y = np.matrix(df.wage).T
X = np.matrix([np.ones(len(df)), df.tenure, df.IQ, df.educ]).T

In [6]:
Y[:10]

matrix([[ 769],
        [ 808],
        [ 825],
        [ 650],
        [ 562],
        [1400],
        [ 600],
        [1081],
        [1154],
        [1000]], dtype=int64)

In [7]:
X[:10]

matrix([[  1.,   2.,  93.,  12.],
        [  1.,  16., 119.,  18.],
        [  1.,   9., 108.,  14.],
        [  1.,   7.,  96.,  12.],
        [  1.,   5.,  74.,  11.],
        [  1.,   2., 116.,  16.],
        [  1.,   0.,  91.,  10.],
        [  1.,  14., 114.,  18.],
        [  1.,   1., 111.,  15.],
        [  1.,  16.,  95.,  12.]])

In [8]:
B = (X.T * X).I * X.T * Y

Let's look at the results. This matrix contains the parameters for our model. The first parameter is the intercept ($\hat{\beta_0}$), the following ones 

In [9]:
B

matrix([[-199.55416715],
        [  10.3007383 ],
        [   4.85024855],
        [  43.93506431]])

In [10]:
def predict_wage(t, iq, ed):
    return float(B[0]) + (float(B[1]) * t) + (float(B[2]) * iq) + (float(B[3]) * ed)

In [11]:
df.head(1)

Unnamed: 0,wage,tenure,IQ,educ
0,769,2,93,12


In [12]:
predict_wage(2,93,12)

799.3411964727363