### 1. Linear Model ###

Consider a vector $y_{n \times 1}$ (**dependent, endogenous**) that we will try to explain using the information contained in a matrix $X_{n\times k}$ (**independent, exogenous**). We assume a linear relationship between $X$ and $y$, measured by $k$ effects stored in $b_{k\times 1}$. There will be an estimation error $e_{n \times 1}$ (**residuals**). Note that we are not yet using any of the seven assumptions.

$$ y = Xb + e  $$

The criteria to call this a "good" regression is to consider some "criterion function" that depends on the residuals. 

$$S(b) = f(e)$$


### 2. OLS ###

OLS considers the criterion function to be sum of squares of the residuals. This also represents the square of the norm of the residual vector:

$$S(b) = \sum_n e_i^2 = e^Te = \|e \|^2 $$

Minimizing:

$$S(b) = e^Te = (y - Xb)^T(y - Xb) $$

$$\frac{dS(b)}{db} = 2(y - Xb)^TX = 0 = X^T(y - Xb) $$

Obtaning the **normal equations**, this is the set of equations to be solved:

$$ X^Ty = X^TXb$$

Granted that $X^TX$ has full rank, solving for $b$:

$$ b = (X^TX)^{-1}X^Ty  $$

As a side note, using the normal equations and replacing $y$ by the linear model

$$ X^Ty = X^T(Xb + e)= X^TXb + X^Te = X^TXb \quad \text{then, $X^Te =0$}$$

If $X$ has a column of ones, meaning that this model has an intercept, it follows that $i^Te = 0$ and $\bar{e}=0$. This implies that the covariance between the columns of $X$ and $e$ can be evaluated by $X^Te$, which is always 0. 






### 3. Geometrical properties ###

Replacing the estimated $b$ in the linear model:

$$ y = Xb + e = y + X(X^TX)^{-1}X^Ty + e $$

Defining $H = X(X^TX)^{-1}X^T$ we check that this matrix projects a vector on the column space of $X$. What is the dimension of this space?

$$rank(H) = Tr(H) = Tr(X(X^TX)^{-1}X^T) = Tr((X^TX)^{-1}X^TX) = Tr(I_k) = k $$

The estimated values of $y$ are in the column space of $X$, which is lower dimensional. Then, the residual vector $e$ can be obtained by another projection of $y$ using the matrix $M=I-H$, a $n-k$ dimensional space orthogonal to $X$. This is why it is said that **the residuals of the estimation have n-k degrees of freedom**.

### 4. Statistical properties of b ###

The estimator $b$ is a random variable. What can we say about its properties? (note the use of $\stackrel{\text{A*}}{=}$ to highlight the assumption used)

$$b = (X^TX)^{-1}X^Ty \stackrel{\text{A6}}{=} (X^TX)^{-1}X^T( X\beta  + \epsilon) = \beta + (X^TX)^{-1}X^T\epsilon $$

**Expectation:**

$$ E[b] = \beta + E[(X^TX)^{-1}X^T\epsilon] \stackrel{\text{A1}}{=} \beta + (X^TX)^{-1}X^TE[\epsilon]  \stackrel{\text{A2}}{=} \beta   $$

Under assumptions, $b$ is unbiased.


**Variance:**

$$ Var(b) = E[(b-E[b])(b-E[b])^T] = E[(b-\beta)(b-\beta)^T] = E[ (X^TX)^{-1}X^T\epsilon \epsilon^T X (X^TX)^{-1}]  $$

$$E[ (X^TX)^{-1}X^T\epsilon \epsilon^T X (X^TX)^{-1}]  \stackrel{\text{A1}}{=}  (X^TX)^{-1}X^TE[\epsilon \epsilon^T] X (X^TX)^{-1} $$

$$ (X^TX)^{-1}X^TE[\epsilon \epsilon^T] X (X^TX)^{-1}  \stackrel{\text{A3,A4}}{=}  \sigma^2(X^TX)^{-1}  $$

Under assumptions, variance of $b$ is.

$$ Var(b) = \sigma^2(X^TX)^{-1} $$

The diagonal elements contain the variance of the estimators $b_j$, and the off diagonal represent the covariance between estimators.


### 5. Distribution of e ###

Recall that $e$ can be constructed from $y$ by projecting it using the $M$ matrix, with $rank(M)=n-k$ (note the use of $\stackrel{\text{A*}}{=}$ to highlight the assumption used)

$$ e = My \stackrel{\text{A6}}{=} M(X\beta + \epsilon) = (I-H) X\beta + M\epsilon $$

Naturally, projecting $X$ on its own column space will yield $X$

$$ e = (X-X)\beta + M\epsilon = M\epsilon $$

Under assumption 6, the residual vector is a projection of the disturbances. This space has $n-k$ degrees of freedom.

Furthermore, using assumption 2-4, and 7:

$$\epsilon \sim N_n(0, \sigma^2I_n) $$

$$ e = M\epsilon \sim N_n(0, \sigma^2M) $$

The residual vector $e$ follows a degenerate normal distribution. However, we can find a distribution for the squared norm of this vector. Using previous results:

$$ \frac{\epsilon}{\sigma} \sim N_n(0, I_n)  $$


$$ e^Te = (M\epsilon)^TM\epsilon = \epsilon^T M \epsilon $$

$$ \frac{e^Te}{\sigma^2} = \frac{\epsilon}{\sigma}^TM\frac{\epsilon}{\sigma} \sim \chi^2(n-k) \quad \text{ given that $rank(M)=n-k$} $$

Then, $\frac{e^Te}{\sigma^2}$ follows a $ \chi^2(n-k)$ distribution. 
