# THE GAUSS-MARKOV THEOREM
<br>


## Introduction

<br>
The reason OLS estimation is so popular is that, under the appropriate assumptions, the OLS coefficient estimators have several desirable statistical properties. In this notebook we will examine some of these statistical properties, primarily (but not only) in terms of the OLS slope coefficient estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ .

<br>
Let's start by bringing up the two equations we defined for linear regression, and the OLS estimators : <br>


<blockquote>
$
    \begin{align}
        \boldsymbol{\mathbf{Y}_i} = \boldsymbol{\beta}\boldsymbol{\mathbf{X}_i} + \boldsymbol{\varepsilon_i}
        & \quad \boldsymbol{\text{PRE}}
        \newline
        \boldsymbol{\hat{\mathbf{Y}}}\boldsymbol{_i}
        = \boldsymbol{\hat{\beta}}\boldsymbol{\mathbf{X}_i} + \boldsymbol{\mathbf{e}_i}
        & \quad \boldsymbol{\text{SRE}}
    \end{align}
$
</blockquote>

<blockquote>
$
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= 
            \dfrac
                { 
                    \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
                    - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}} 
                }
                { \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} - m \ \boldsymbol{\overline{\mathbf{X}}^2} }
        \newline
        &= 
            \dfrac
                { 
                    \sum_{i=1}^{m}
                    \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)
                    \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)
                }
                {
                    \sum_{i=1}^{m} 
                    \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)^2
                }
        = 
            \dfrac
                {\mathrm{Cov}(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})}
                {\mathrm{Var}(\boldsymbol{\mathbf{X}_i})} 
        \newline
        &= 
            \sum_{i=1}^{m} 
            \boldsymbol{\mathbf{x}_i} \boldsymbol{\mathbf{y}_i}               
    \end{align}
$
</blockquote>

<blockquote>
$
    \quad
    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} 
    = \overline{\mathbf{Y}} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
$
</blockquote>

## The Gauss–Markov theorem

<br>
In statistics, the Gauss–Markov theorem states that, under the assumptions [<b>A1 - A8</b>] of the Classical Linear Regression Model, the <b>B</b>est <b>L</b>inear <b>U</b>nbiased <b>E</b>stimator (provided it exists, <b>BLUE</b>) of the regression coefficients is given by the ordinary least squares (<b>OLS</b>) estimator.

<br>
Equivalently, the theorem establishes that under the <b>CLRM</b> assumptions, the OLS coefficient estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j}$ (j = 0, 1) are the minimum-variance estimators in the class of all linear
unbiased estimators of the corresponding population parameters.

<br>
Although all the <b>CLRM</b> assumptions are actually needed in the broader context of linear regression, only a few of them are usually cited in the Gauss-Markov theorem, in particular only those concerned with the disturbance term $\boldsymbol{\varepsilon}$ : 

<br>
<ul style="list-style-type:square">
    <li>
        <b>linearity (A1)</b>
    </li>
    <br>
    <li>
        <b>strict exogeneity (A2)</b>
    </li>  
    <li>
        <b>spherical errors (A3 + A4)</b>
    </li>
    <br>
    <li>
        <b>full rank (A6 + A8)</b>
    </li>
</ul>

<br>
We will now proceed in the demonstration of the thereom.

## [GM1] Proof of Linearity

<br>
A linear estimator of $\boldsymbol{\beta_j}$ is a linear combination

<br>
$
    \quad
    \hat{\boldsymbol{\beta}}_\boldsymbol{j} 
    = \boldsymbol{\mathbf{c}_{1j}} \boldsymbol{\mathbf{Y}_1} + \cdots + \boldsymbol{\mathbf{c}_{mj}} \boldsymbol{\mathbf{Y}_m}
$

<br>
in which the coefficients $\boldsymbol{\mathbf{c}_{ij}}$ are not allowed to depend on the underlying coefficients $\boldsymbol{\beta_j}$, since those are not observable, but are allowed to depend on the values $\boldsymbol{\mathbf{X}_{ij}}$, since these data are observable.


<br>
Let's now re-write the formula for $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ :

<br>
$ 
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} &=
            \frac
                {\sum_{i=1}^{m} 
                    (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})
                    (\boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}})
                }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            = \frac
                {
                      \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \boldsymbol{\mathbf{Y}_i}
                    - \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \overline{\mathbf{Y}}
                }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            = \frac
                {
                      \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \boldsymbol{\mathbf{Y}_i}
                    - \overline{\mathbf{Y}} \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})
                }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \newline
            & = \frac
                {
                      \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \boldsymbol{\mathbf{Y}_i}
                    - \overline{\mathbf{Y}} 0
                }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            = \frac
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \boldsymbol{\mathbf{Y}_i} }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{Y}_i}            
    \end{align}
$

<br>
where the $\boldsymbol{\mathbf{c}_i}$ are defined by

<br>
$
    \quad
    \boldsymbol{\mathbf{c}_i} = \dfrac
        {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})}
        {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
$

<br>
and have the following properties :

<br>
<ul style="list-style-type:square">
    <li>
        $
            \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} 
            \ = \ \dfrac
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})}
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \ = \ \dfrac
                {1}{\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
                \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})
            \ = \ 0
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            [\textbf{P1}] 
        $ 
    </li>
    <br>
    <li>
        $
            \sum_{i=1}^{m} {\boldsymbol{\mathbf{c}_i}}^2 
            \ = \ \dfrac
                { \big[ \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) \big] ^2 }
                { \big[ \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2 \big] ^2 }
            \ = \ \dfrac
                {1}
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            [\textbf{P2}] 
        $ 
    </li>
    <br>
    <li>
        $
            \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{X}_i}
            \ = \  
                  \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{X}_i}
                - \overline{\mathbf{X}} \sum_{i=1}^{N} \boldsymbol{\mathbf{c}_i}
            \ = \  
                  \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{X}_i}
                - \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \overline{\mathbf{X}} 
            \ = \  \sum_{i=1}^{N} \boldsymbol{\mathbf{c}_i} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})
            \ = \  \dfrac
                {
                    \sum_{i=1}^{m} 
                    (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) 
                    (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})
                }
                {\sum_{i=1}^{N} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \ = \  1
            \qquad \qquad \quad
            [\textbf{P3}] 
        $ 
    </li>
</ul>

<br>
Now that we know the properties of the coefficients $\boldsymbol{\mathbf{c}_i}$, we will see that :

<br>
$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{Y}_i}
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i}
             (\boldsymbol{\beta_0} + \boldsymbol{\beta_1}\boldsymbol{\mathbf{X}_i} + \boldsymbol{\varepsilon_i})
        \newline
        &= 
              \boldsymbol{\beta_0} \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i}
            + \boldsymbol{\beta_1} \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{X}_i}
            + \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\varepsilon_i}
            & \text{since } \textbf{P1} \text{ and } \textbf{P3}
        \newline
        &= \boldsymbol{\beta_1} + \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\varepsilon_i}
        & \qquad \qquad \qquad \qquad \qquad \qquad \qquad
        \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
        [\textbf{E1}] 
    \end{align}
$

<br>
Therefore, the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ (and by analogy  $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}$) is a linear estimator, specifically a linear function of the disturbance terms.

### [GM1] Proof of Linearity in matrix form

<br>
A quick look at the notebook regarding the OLS estimation will remind us of the following matrix form :

<br>
$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{Y} 
        \newline
        &= \boldsymbol{\beta_1} + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol{\varepsilon}
    \end{align}
$

<br>
Since we can write $ \quad \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} = {\boldsymbol{\beta}_1} + \mathbf{A}\boldsymbol{\varepsilon} \quad $ where $\mathbf{A} = (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}$, it is easy to see that $\quad \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ is a linear function of the disturbance terms.


## [GM2] Proof of Unbiasedness

<br>


### [GM2 | Unbiasedness of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ ]

<br>
This proof follows directly by <b>E1</b> : 

<br>
$
    \quad
    \begin{align}
        \mathbf{E} \big[ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \big] 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            & \text{since } \textbf{E1}
        \newline 
        &= 
              \mathbf{E} \big[ \boldsymbol{\beta_1} \big] 
            + \mathbf{E} \big[ \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\varepsilon_i} \big] 
            & \text{conditioning on } \mathbf{X}
        \newline 
        &=
              \boldsymbol{\beta_1} 
            + \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i} 
              \mathbf{E} \big[ \boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i} \big] 
              & \text{strinct exogeneity (} \textbf{A2} \text{)}
        \newline             
        &= \boldsymbol{\beta_1}
    \end{align}
$

<br>
Conditioning on the sample values of the regressor $\mathbf{X}$ means that the coefficients $\boldsymbol{\mathbf{c}_i}$ are treated as non-random, since they are functions only of the sample values $\boldsymbol{\mathbf{X}_i}$. 

<br>
Therefore, the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ is an unbiased estimator of the corresponding population parameter.


### [GM2 | Unbiasedness of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}$ ]

<br>
This proof follows directly by the population regression equation (<b>PRE</b>) : 

<br>
$
    \quad
    \begin{align*}
        \quad & \quad
            \boldsymbol{\mathbf{Y}_i} 
            = \boldsymbol{\beta_0} + \boldsymbol{\beta_1} \boldsymbol{\mathbf{X}_i} + \boldsymbol{\varepsilon_i}
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \text{averaging over the } m \text{ observations}
        \newline
        \Rightarrow & \quad
            \frac{1}{m} \sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i} 
            = \frac{1}{m} \ m \ \boldsymbol{\beta_0}
            + \frac{\boldsymbol{\beta_1}} {m} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} 
            + \frac{1}{m} \sum_{i=1}^{m} \boldsymbol{\varepsilon_i}     
        \newline 
        \Rightarrow & \quad
            \overline{\mathbf{Y}} 
            = \boldsymbol{\beta_0} + \boldsymbol{\beta_1} \overline{\mathbf{X}} + \overline{\boldsymbol{\varepsilon}}  
        \newline \newline
        &\quad
             & \text{by definition of the OLS estimator } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
        \newline
        \Rightarrow &\quad
            \begin{aligned}[T]
                \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                &=   \overline{\mathbf{Y}}
                  - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
                \newline
                &=  (\boldsymbol{\beta_0} + \boldsymbol{\beta_1} \overline{\mathbf{X}} + \overline{\boldsymbol{\varepsilon}})
                  - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
                \newline
                &=   \boldsymbol{\beta_0} 
                   + (\boldsymbol{\beta_1} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) \overline{\mathbf{X}} 
                   + \overline{\boldsymbol{\varepsilon}}
            \end{aligned}
        \newline \newline
        &\quad
             & \text{conditioning on } \mathbf{X}
        \newline
        &\quad
             & \text{zero unconditional mean (} \textbf{A2} \text{)}
        \newline
        \Rightarrow &\quad
            \begin{aligned}[T]
                \mathbf{E} \big[ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} \big]
                &=     \mathbf{E} \big[ \boldsymbol{\beta_0} \big]
                    + \mathbf{E} 
                      \big[ \ (\boldsymbol{\beta_1} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) \overline{\mathbf{X}} \ \big]
                    + \mathbf{E} \big[ \overline{\boldsymbol{\varepsilon}} \big]
                \newline
                &=                
                      \boldsymbol{\beta_0}
                    + \overline{\mathbf{X}} \mathbf{E} 
                      \big[ \ \boldsymbol{\beta_1} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \ \big]
                    + \mathbf{E} \big[ \overline{\boldsymbol{\varepsilon}} \big]
                \newline
                &=
                      \boldsymbol{\beta_0}
                    + \overline{\mathbf{X}} 
                      \big[ \ 
                          \mathbf{E} [\boldsymbol{\beta_1}] - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}] 
                       \ \big]
                    + \mathbf{E} \big[ \overline{\boldsymbol{\varepsilon}} \big]
                \newline
                &=
                      \boldsymbol{\beta_0}
                    + \overline{\mathbf{X}} \big[ \ \boldsymbol{\beta_1} - \boldsymbol{\beta_1} \ \big]
                \newline
                &= \boldsymbol{\beta_0}
            \end{aligned}
    \end{align*}
$

<br>
Conditioning on the sample values of the regressor $\mathbf{X}$ means that $\overline{\mathbf{X}}$ are treated as non-random in taking expectations, since it is a function only of the sample values $\boldsymbol{\mathbf{X}_i}$. 

<br>
Therefore, the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}$ is an unbiased estimator of the corresponding population parameter.


### [GM2 | Unbiasedness of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ in matrix form ] 

<br>
Again the same matrix form we used before to prove linearity in matrix form :

<br>
$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{Y} 
        \newline
        &= \boldsymbol{\beta_1} + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol{\varepsilon}
    \end{align}
$

<br>
It is easy to show that, as long as $\mathbf{X}$ is either non-stochastic or stochastic but independent of the disturbance terms, the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j}$ are unbiased estimators : 

<br>
If $\mathbf{X}$ is non-stochastic (fixed) :

<br>
$
    \quad
    \begin{align}
        \mathbf{E} \big[ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \big]
        &= 
              \boldsymbol{\beta_1} 
            + \mathbf{E} \big[ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big]
        \newline
        &= 
            \boldsymbol{\beta_1} 
            + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{E} \big[ \boldsymbol{\varepsilon} \big]
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad 
            \text{strict exogeneity (} \textbf{A2} \text{)}
        \newline
        &= \boldsymbol{\beta_1} 
    \end{align}
$

<br>
If $\mathbf{X}$ is stochastic but uncorrelated with the disturbance term :

<br>
$
    \quad
    \begin{align}
        \mathbf{E} \big[ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \big]
        &= 
              \boldsymbol{\beta_1} 
            + \mathbf{E} \big[ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big]
        \newline
        &= 
            \boldsymbol{\beta_1} 
            + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{E} \big[ \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big]
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad 
            \text{strict exogeneity (} \textbf{A2} \text{)}
        \newline
        &= \boldsymbol{\beta_1} 
    \end{align}
$

## Covariance Matrix of the OLS estimators

<br>
In probability theory and statistics, a <b>covariance matrix</b> (also known as dispersion matrix or <b>variance–covariance matrix</b>) is a matrix whose element in the $ij$ position is the covariance between the $i^\text{th}$ and $j^\text{th}$ elements of a random vector. A random vector is a random variable with multiple dimensions.

<br>
Because the covariance of the $i^\text{th}$ random variable with itself is simply that random variable's variance, each element on the principal diagonal of the covariance matrix is the variance of one of the random variables. Because the covariance of the $i^\text{th}$ random variable with the $j^\text{th}$ one is the same thing as the covariance of the $j^\text{th}$ random variable with the $i^\text{th}$ one, every covariance matrix is symmetric. In addition, every covariance matrix is positive semi-definite.

<br>
Intuitively, the covariance matrix generalizes the notion of variance to multiple dimensions. We will now determine the covariance matrix of the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ :

<br>
$
    \quad
    \begin{align}
        \mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & \text{when } \star \text{ is a vector random variable}            
        \newline
        &
            & 
            \mathrm{Var}(\star) = \mathbf{E} 
            \big[ 
                \big(\star - \mathbf{E}[\star]\big)
                \big(\star - \mathbf{E}[\star]\big)^{\top} 
            \big]
        \newline
        &= 
            \mathbf{E} 
            \big[ \
                \big(
                          \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
                        - \mathbf{E}[\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}]
                \big)
                \big(
                          \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
                        - \mathbf{E}[\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}]
                \big)^{\top}
            \ \big] 
            & \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} \text{ is an unbiased estimator of } \boldsymbol{\beta}
        \newline
        &= 
            \mathbf{E} 
            \big[ \
                \big(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS} - \boldsymbol{\beta}\big)
                \big(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS} - \boldsymbol{\beta}\big)^{\top}
            \ \big] 
        \newline
        &= 
            \mathbf{E} 
            \big[ \ 
                \big(
                      \boldsymbol{\beta} 
                    + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} 
                    - \boldsymbol{\beta_1}
                \big)
                \big(
                      \boldsymbol{\beta} 
                    + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} 
                    - \boldsymbol{\beta}
                \big)^{\top}
            \ \big] 
        \newline
        &= 
            \mathbf{E} 
            \big[ \
                    \big( (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big)
                    \big( (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big) ^{\top}
            \ \big]
            & 
            \big[ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \big] ^{\top}
            = \boldsymbol{\varepsilon}^{\top} \ \big[ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \big] ^{\top}
        \newline
        &
            & 
            = \boldsymbol{\varepsilon}^{\top} \ \mathbf{X} \big[(\mathbf{X}^{\top}\mathbf{X})^{-1} \big] ^{\top}
            = \boldsymbol{\varepsilon}^{\top} \ \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}
        \newline
        &= 
            \mathbf{E} 
            \big[ \
                    (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} 
                    \ \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\top} \
                    \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}
            \ \big]
            & \text{conditioning on } \mathbf{X}
        \newline
        &= 
            (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}
            \ \mathbf{E} \big[ \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\top} \big] \         
            \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}         
            & [\textbf{E2}] 
    \end{align}
$

<br>
$\mathbf{E} \big[ \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\top} \big]$ is the covariance matrix of the disturbance term $\boldsymbol{\varepsilon}$. Despite being a $ \text{m x m}$ matrix and thus a potentially large one, under the assumption of spherical errors (homoscedasticity and no autocorrelation of the error terms) the covariance matrix of the disturbance terms simplifies greatly : 

<br>
$
    \quad
    \begin{align}
    \mathbf{E} [ \varepsilon \varepsilon^{\top} \mid X ]
    \quad &= \quad
    \mathbf{E}
    \begin{bmatrix}
        \varepsilon_1 \mid X \\
        \varepsilon_2 \mid X \\
        \vdots               \\
        \vdots               \\
        \varepsilon_m \mid X
    \end{bmatrix}_\textit{ m x 1 }
    \begin{bmatrix}
        \varepsilon_1 \mid X &
        \varepsilon_2 \mid X &
        \dots                &
        \dots                &
        \varepsilon_m \mid X 
    \end{bmatrix}_\textit{ 1 x m }
    \newline \newline
    &= \quad
    \mathbf{E}
    \begin{bmatrix}
        {\varepsilon_1}^2 \mid X           &  \varepsilon_1\varepsilon_2 \mid X & \dots  & \varepsilon_1\varepsilon_m \mid X  \\
        \varepsilon_2\varepsilon_1 \mid X  &  {\varepsilon_2}^2 \mid X          & \dots  & \varepsilon_2\varepsilon_m \mid X  \\
        \vdots                             &  \vdots                            & \vdots & \vdots                             \\
        \vdots                             &  \vdots                            & \ddots & \vdots                             \\
        \varepsilon_m\varepsilon_1 \mid X  &  \varepsilon_m\varepsilon_2 \mid X & \dots  & {\varepsilon_m}^2 \mid X  
    \end{bmatrix}_\textit{ m x m }
    \quad = \quad 
    \begin{bmatrix}
          \mathbf{E} [ {\varepsilon_1}^2 \mid X ]
        & \mathbf{E} [ \varepsilon_1\varepsilon_2 \mid X ]
        & \dots  
        & \mathbf{E} [ \varepsilon_1\varepsilon_m \mid X ] 
        \\
          \mathbf{E} [ \varepsilon_2\varepsilon_1 \mid X ]  
        & \mathbf{E} [  {\varepsilon_2}^2 \mid X ]          
        & \dots  
        & \mathbf{E} [ \varepsilon_2\varepsilon_m \mid X ]  
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathbf{E} [ \varepsilon_m\varepsilon_1 \mid X ]  
        & \mathbf{E} [ \varepsilon_m\varepsilon_2 \mid X ] 
        & \dots  
        & \mathbf{E} [ {\varepsilon_m}^2 \mid X ]
    \end{bmatrix}_\textit{ m x m }
    \newline \newline
    &= \quad 
    \begin{bmatrix}
          \mathrm{Var}(\varepsilon_1 \mid X) 
        & \mathrm{Cov}(\varepsilon_1\varepsilon_2 \mid X)
        & \dots
        & \mathrm{Cov}(\varepsilon_1\varepsilon_m \mid X)
        \\
          \mathrm{Cov}(\varepsilon_2\varepsilon_1 \mid X)
        & \mathrm{Var}(\varepsilon_2 \mid X) 
        & \dots
        & \mathrm{Cov}(\varepsilon_2\varepsilon_m \mid X)
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathrm{Cov}(\varepsilon_m\varepsilon_1 \mid X)
        & \mathrm{Cov}(\varepsilon_m\varepsilon_2 \mid X)
        & \dots 
        & \mathrm{Var}(\varepsilon_m \mid X) 
    \end{bmatrix}
    \newline \newline
    &= \quad 
    \begin{bmatrix}
        \sigma^2  &  0         &  \dots   &  0         \\
        0         &  \sigma^2  &  \dots   &  0         \\
        \vdots    &  \vdots    &  \vdots  &  \vdots    \\
        \vdots    &  \vdots    &  \vdots  &  \vdots    \\
        0         &  0         &  \dots   &  \sigma^2  \\
    \end{bmatrix}
    \quad = \quad 
    \sigma^2 I
    \end{align}
$

<br>
Now we know that under the mentioned assumptions, we can re-write the covariance matrix of 
$\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ as :

<br>
$
    \quad
    \begin{align}
        \mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) 
        \newline
        &= 
            (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}
            \ \mathbf{E} \big[ \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^{\top} \big] \         
            \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}  
            & \text{by spherical errors (} \textbf{A3 + A4} \text{)}
        \newline
        &= 
            (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}
            \ \boldsymbol{\sigma^2I} \ \
            \mathbf{X} (\mathbf{X}^{\top}\mathbf{X})^{-1}  
        \newline
        &= 
            \boldsymbol{\sigma^2} 
            \ (\mathbf{X}^{\top}\mathbf{X})^{-1} 
            \ \mathbf{X}^{\top} \mathbf{X} \
            (\mathbf{X}^{\top}\mathbf{X})^{-1}  
        \newline
        &= 
            \boldsymbol{\sigma^2} \ (\mathbf{X}^{\top}\mathbf{X})^{-1} 
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad 
            [\textbf{E3}]
    \end{align}
$


### <font color='#28B463'>Variance of the OLS Estimators

<br>
It is easy to see that the covariance matrix of the OLS estimators is a function of (among other factors) both the variance and the covariance of the estimators themselves. Our goal, for the moment, is to compute these two formulas so that we will have a more detailed representation of the covariance matrix.

<br>
The notation below is based on regression residuals $\boldsymbol{\varepsilon_i}$ : 

<br>
$
    \quad
    \begin{align}
        \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) 
        &=
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \text{by definition of variance}
        \newline
        &= \mathbf{E} 
            \Big[ \
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big) ^2
            \ \Big]  
            & \text{by } \textbf{E1}
        \newline
        &= \mathbf{E} \Big[ \ \big( \sum_{i=1}^{N} \boldsymbol{\mathbf{c}_i}\boldsymbol{\varepsilon_i} \big) ^2 \ \Big] 
        \newline
        &=  \mathbf{E} 
            \Big[
                  \sum_{i=1}^{m} \boldsymbol{\mathbf{c}_i}^2 \boldsymbol{\varepsilon_i}^2
                + 2 \sum_{i=1}^{m} \sum_{j \neq i}^{m} 
                  \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{c}_j} \boldsymbol{\varepsilon_i} \boldsymbol{\varepsilon_s}
            \Big]
            & [\textbf{E4}]
    \end{align}
$

<br>
Just like its equivalent in matrix form, this last equation can be furtherly simplified under the assumption of spherical errors :  

<br>
$
    \quad
    \begin{align}
        \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            & \text{by } \textbf{E3}
        \newline
        &= \mathbf{E} 
            \Big[
                  \sum_{i=1}^{m} \boldsymbol{{\mathbf{c}_i}^2} \boldsymbol{\varepsilon_i}^2
                + 2 \sum_{i=1}^{m} \sum_{j \neq i}^{m} 
                  \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{c}_j} \boldsymbol{\varepsilon_i} \boldsymbol{\varepsilon_s}
            \Big]
            & \text{conditioning on } \mathbf{X}
        \newline
        &=  \sum_{i=1}^{m} \boldsymbol{{\mathbf{c}_i}^2} 
                \mathbf{E} \Big[ \boldsymbol{{\varepsilon_i}^2} \mid \boldsymbol{\mathbf{X}_i} \Big]
            + 2 \sum_{i=1}^{m} \sum_{j \neq i}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{c}_j} 
                \mathbf{E} 
                \Big[   
                    \boldsymbol{\varepsilon_i} , \boldsymbol{\varepsilon_j} 
                    \mid \boldsymbol{\mathbf{X}_i} , \boldsymbol{\mathbf{X}_j}
                \Big]
        \newline
        &=  \sum_{i=1}^{m} \boldsymbol{{\mathbf{c}_i}^2} 
                \mathrm{Var} \Big( \boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i} \Big)
            + 2 \sum_{i=1}^{m} \sum_{j \neq i}^{m} \boldsymbol{\mathbf{c}_i} \boldsymbol{\mathbf{c}_j} 
                \mathrm{Cov} 
                \Big(
                    \boldsymbol{\varepsilon_i} , \boldsymbol{\varepsilon_j} 
                    \mid \boldsymbol{\mathbf{X}_i} , \boldsymbol{\mathbf{X}_j}
                \Big) 
                & \text{by } \textbf{A3} \text{ and } \textbf{A4}
        \newline
        &=  \boldsymbol{\sigma^2} \sum_{i=1}^{m} \boldsymbol{{\mathbf{c}_i}^2}
            & \text{by } \textbf{P2}
        \newline
        &=  \dfrac
                { \boldsymbol{\sigma^2} } 
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) }
            & [\textbf{E5}]
    \end{align}
$

### <font color='#28B463'>Estimation of the variance

<br>
It's important to notice that, since the variance of the error terms $\boldsymbol{\sigma^2}$ is unobservable (being the error terms unobservable themselves), we will actually compute an estimate $\boldsymbol{s^2}$ of it, based on the regression residuals :

<br>
$
    \quad
    \boldsymbol{s^2} 
    \ = \ \dfrac
        { \sum_{i=1}^{m} \boldsymbol{{e_i}^2} }
        { m - p }
    \ = \ \dfrac
        { \sum_{i=1}^{m} \boldsymbol{e}^{\top}\boldsymbol{e} }
        { m - p }
    \ = \ \dfrac
        { \text{SSR} }
        { m - p }
    \qquad \Rightarrow \qquad
    \widehat{\mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})} = \boldsymbol{s^2} \ (\mathbf{X}^{\top}\mathbf{X})^{-1} 
$


### Covariance of the OLS Estimators

<br>
$
    \quad
    \begin{align}
        \mathrm{Cov}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}, \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & \text{by definition of covariance}
        \newline
        &= \mathbf{E} 
            \Big[ \
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} ] 
                \big)
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
            \ \Big]  
            & \text{by definition of } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
        \newline
        &= \mathbf{E} 
            \Big[ \
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                    - \mathbf{E} [ \overline{\mathbf{Y}} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}} ] 
                \big)
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
            \ \Big]  
        \newline
        &= \mathbf{E} 
            \Big[ \
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                    - \overline{\mathbf{Y}} + \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] \overline{\mathbf{X}} 
                \big)
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
            \ \Big]  
            & \text{by definition of } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
        \newline
        &= \mathbf{E} 
            \Big[ \
                \big(
                      \overline{\mathbf{Y}} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
                    - \overline{\mathbf{Y}} + \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] \overline{\mathbf{X}} 
                \big)
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
            \ \Big]  
        \newline
        &= \mathbf{E} 
            \Big[ \ 
                - \overline{\mathbf{X}} 
                \big( 
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} 
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
                \big(
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)
            \ \Big]    
        \newline
        &= \mathbf{E} 
            \Big[ \ 
                - \overline{\mathbf{X}} 
                \big( 
                      \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} 
                    - \mathbf{E} [\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} ] 
                \big)^2
            \ \Big]  
        \newline
        &=  - \overline{\mathbf{X}} \ \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})
            & [\textbf{E6}]
    \end{align}
$


### Internal representation of the Covariance Matrix

<br>
<b>E2</b> and <b>E4</b> give the variance of the OLS estimator in the general case : 

<br>
$
    \quad
    \mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) 
    \quad = \quad
    \begin{bmatrix}
          \mathrm{Var}(\hat{\beta}_{OLS-1}) 
        & \mathrm{Cov}(\hat{\beta}_{OLS-1} , \hat{\beta}_{OLS-2})
        & \dots
        & \mathrm{Cov}(\hat{\beta}_{OLS-1} , \hat{\beta}_{OLS-m})
        \\
          \mathrm{Cov}(\hat{\beta}_{OLS-2} , \hat{\beta}_{OLS-1})
        & \mathrm{Var}(\hat{\beta}_{OLS-2}) 
        & \dots
        & \mathrm{Cov}(\hat{\beta}_{OLS-2} , \hat{\beta}_{OLS-m})
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathrm{Cov}(\hat{\beta}_{OLS-m} , \hat{\beta}_{OLS-1})
        & \mathrm{Cov}(\hat{\beta}_{OLS-m} , \hat{\beta}_{OLS-2})
        & \dots 
        & \mathrm{Var}(\hat{\beta}_{OLS-m}) 
    \end{bmatrix}
$

<br>
while <b>E3</b> and <b>E5</b> are extended, simplified versions that arise under the assumption of spherical errors : 

<br>
$
    \quad
    \mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) 
    \quad = \quad
    \begin{bmatrix}
          \sigma^2 {(X^{\top}X)^{-1}}_{11} 
        & 0 
        & \dots 
        & 0
        \\
          0  
        & \sigma^2 {(X^{\top}X)^{-1}}_{22}  
        & \dots 
        & 0
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots
        \\ 
        0 & 0 & \dots & \sigma^2 {(X^{\top}X)^{-1}}_{mm} 
    \end{bmatrix}
$


### Interpretation of the covariance matrix

<br>
$\mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})$ and $\mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0})$ measure the statistical precision of the corresponding OLS estimators : 

<br>
$
    \quad
    \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})
    \quad = \quad
    \boldsymbol{\sigma^2} \ (\mathbf{X}^{\top}\mathbf{X})^{-1}
    \qquad \qquad = \quad
    \dfrac
           { \boldsymbol{\sigma^2} } 
           { \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) }
$

$
    \quad    
    \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0})
    \quad = \quad
    \dfrac
        { \boldsymbol{\sigma^2} }
        { N }
    \dfrac
        { \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i})^2 }
        { \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2 }
$

<br>
The variance of the two OLS coefficient estimators is smaller : 

<br>
<ul style="list-style-type:square">
    <li>
        the smaller is the (unobservable) variance of the disturbance terms $\boldsymbol{\sigma^2}$
    </li>
    <br>
    <li>
        the larger is the variation of the sample values $\boldsymbol{\mathbf{X}_i}$ about their sample mean
        $\overline{\mathbf{X}}$
    </li>
    <br>
    <li>
        the larger is the sample size $N$
    </li>
</ul>

<br><br>
Under the assumption of spherical errors, equation <b>E6</b> can be simplified into :

<br>
$
    \quad
    \begin{align}
        \mathrm{Cov}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}, \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})
        &= 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            & \text{by } \textbf{E6}
        \newline
        &=  \ - \overline{\mathbf{X}} \ \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})
            & \text{by } \textbf{E5}
        \newline
        &=  \ - \overline{\mathbf{X}} \
            \dfrac
                { \boldsymbol{\sigma^2} } 
                { \sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}) }
    \end{align}
$

<br>
Since both the numerator and the denominator of $\mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})$ are positive, the sign of $\mathrm{Cov}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}, \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})$ depends on the
sign of the sample mean :

<br>
<ul style="list-style-type:square">
    <li>
        $ 
            \overline{\mathbf{X}} > 0 
            \quad \Rightarrow \quad 
            \mathrm{Cov}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} , \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) > 0 \quad 
        $
        , the sampling errors $(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} - \boldsymbol{\beta_1})$ and
        $(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} - \boldsymbol{\beta_1})$ are of opposite sign
    </li>
    <br>
    <li>
        $ 
            \overline{\mathbf{X}} < 0 
            \quad \Rightarrow \quad 
            \mathrm{Cov}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} , \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) < 0 \quad 
        $
        , the sampling errors $(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} - \boldsymbol{\beta_1})$ and
        $(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} - \boldsymbol{\beta_1})$ are of the same sign
    </li>
</ul>


## [GM3] Proof of Efficiency

<br>
In this section of the notebook we will demonstrate that the OLS estimators has the minimum variance in the class of all the linear unbiased estimators of the population parameters. 

<br>
Recall that $\quad$ <b>Efficiency = Unbiasedness + Minimum Variance</b>

<br>
The demonstration is structured in three points :

<br>
<ul style="list-style-type:square">
    <li>
        the definition of an arbitrary estimator $\tilde{\boldsymbol{\beta}}_\boldsymbol{1}$ linear in $\mathbf{Y}$
    </li>
    <br>
    <li>
        the imposition on $\tilde{\boldsymbol{\beta}}_\boldsymbol{1}$ of restrictions implied by unbiasedness
    </li>
    <br>
    <li>
        the core of the demonstration, where we will show that the variance of the arbitrary estimator
        $\tilde{\boldsymbol{\beta}}_\boldsymbol{1}$ must be larger than, or at least equal to, the variance of
        $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$
    </li>
</ul>

<br>
Let's define an arbitrary estimator $\tilde{\boldsymbol{\beta}}_\boldsymbol{1}$ linear in $\mathbf{Y}$ :

<br>
$
    \quad \tilde{\boldsymbol{\beta}}_\boldsymbol{1} 
    = \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} \boldsymbol{\mathbf{Y}_i}
    = 
        \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} 
        (\boldsymbol{\beta_0} + \boldsymbol{\beta_1}\boldsymbol{\mathbf{X}_i} + \boldsymbol{\varepsilon_i})
    = 
        \boldsymbol{\beta_0} \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} 
        + \boldsymbol{\beta_1} \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} \boldsymbol{\mathbf{X}_i}
        + \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} \boldsymbol{\varepsilon_i}
$

<br>
A quick look at <b>GM1</b> will remind us that for the estimator $\tilde{\boldsymbol{\beta}}_\boldsymbol{1}$ to be unbiased, the following restrictions must be accomplished :

<br>
$
    \quad
    \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} = 0 
    \quad \text{and} \quad 
    \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} \boldsymbol{\mathbf{X}_i}= 1
    \quad \Rightarrow \quad
    \tilde{\boldsymbol{\beta}}_\boldsymbol{1} 
    = \boldsymbol{\beta_1} + \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i} \boldsymbol{\varepsilon_i}        
$


<br>
We will now show that the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{1}$ has the minimum variance in the class of all the linear unbiased estimator of $\boldsymbol{\beta_1}$ :

<br>
$
    \quad
    \begin{align}
        \mathrm{Var}(\tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad 
            & \text{by definition of variance}
        \newline
        &=
            \mathbf{E} 
            \big[ \
                \tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1} 
                - \mathbf{E} [\tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1}] 
            \ \big]^2
            & \text{by unbiasedness of } \tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        \newline
        &= 
            \mathbf{E} \big[ \ \tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1} - \boldsymbol{\beta_1} \ \big]^2
            & \text{by } \textbf{E1}
        \newline
        &= 
            \mathbf{E} \Big[ \ \sum_{i=1}^{m} \boldsymbol{\mathbf{h}_i}\boldsymbol{\varepsilon_i} \ \Big] ^2
            = \sum_{i=1}^{m} \Big[ \boldsymbol{{\mathbf{h}_i}^2} \ \mathbf{E} [ \boldsymbol{\varepsilon_i} ] ^2 \Big]
            = \boldsymbol{\sigma^2} \ \sum_{i=1}^{m} \boldsymbol{{\mathbf{h}_i}^2}
        \newline  \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m}
            \Big[ \
                 (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i} + \boldsymbol{\mathbf{c}_i}) ^2 
            \ \Big] 
        \newline
        &= 
              \boldsymbol{\sigma^2}   \sum_{i=1}^{m} \Big[ \ (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i})^2 \ \Big] 
            + \boldsymbol{\sigma^2}   \sum_{i=1}^{m} \Big[ \ {\boldsymbol{\mathbf{c}_i}} ^2 \ \Big]  
            + 2 \boldsymbol{\sigma^2} \sum_{i=1}^{m} 
                \Big[ \ (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i})  \boldsymbol{\mathbf{c}_i} \ \Big] 
        \newline  
        &= 
              \boldsymbol{\sigma^2}   \sum_{i=1}^{m} \Big[ \ (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i})^2 \ \Big] 
            + \boldsymbol{\sigma^2}   \sum_{i=1}^{m} \Big[ \ {\boldsymbol{\mathbf{c}_i}} ^2 \ \Big]  
            + 2 \ \boldsymbol{\sigma^2} \sum_{i=1}^{m} \Big[ \ \boldsymbol{\mathbf{h}_i} \boldsymbol{\mathbf{c}_i} \ \Big]
            - 2 \ \boldsymbol{\sigma^2} \sum_{i=1}^{m} \Big[ \ {\boldsymbol{\mathbf{c}_i}} ^2 \ \Big] 
        \newline 
        \end{align}
$

$
    \quad
    \qquad \qquad \ \
    \begin{align}
        \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m} 
            \Bigg[ 
                \boldsymbol{\mathbf{h}_i} 
                - \dfrac
                    {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                    {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \Bigg] ^2
            + \boldsymbol{\sigma^2} \sum_{i=1}^{m}  
            \Bigg[ 
                \dfrac
                    {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                    {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2} 
            \Bigg] ^2 
        \newline
        &
            \quad
            + 2 \boldsymbol{\sigma^2} \sum_{i=1m}   
                \Bigg[              
                    \boldsymbol{\mathbf{h}_i} \
                    \dfrac
                        {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                        {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
                \Bigg] 
            - 2 \boldsymbol{\sigma^2} \sum_{i=1}^{m}    
                \Bigg[ 
                        \dfrac
                            {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                            {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2} ^2
                \Bigg] 
        \newline \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m}
            \Bigg[
                \boldsymbol{\mathbf{h}_i} 
                    - \dfrac
                        {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                        {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \Bigg] ^2
            + \boldsymbol{\sigma^2} \sum_{i=1}^{m} 
            \Bigg[
                \dfrac
                    {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                    {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}                
            \Bigg] ^2 
            + 2 \ \boldsymbol{\sigma^2} - 2 \ \boldsymbol{\sigma^2}
        \newline \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m}
            \Bigg[
                \boldsymbol{\mathbf{h}_i} 
                    - \dfrac
                        {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                        {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \Bigg] ^2
            + \boldsymbol{\sigma^2} \sum_{i=1}^{m} 
            \Bigg[
                \dfrac
                    {\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}}}
                    {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}                
            \Bigg] ^2 
            & \text{by } \textbf{P2}
        \newline \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m}
                \Big[ \ (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i}) ^2 \ \Big] 
            + \dfrac
                { \boldsymbol{\sigma^2} }
                {\sum_{i=1}^{m} (\boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}})^2}
            \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \quad 
            & \text{by } \textbf{E5}
        \newline
        &= 
            \boldsymbol{\sigma^2} \sum_{i=1}^{m}
                \Big[ \ (\boldsymbol{\mathbf{h}_i} - \boldsymbol{\mathbf{c}_i}) ^2 \ \Big] 
            + \mathrm{Var}(\ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \ )
    \end{align}
$

<br>
The first term (on the right-hand side) of the last equation will always be positive, being a sum of squares; the only exception is when $\boldsymbol{\mathbf{h}_i} = \boldsymbol{\mathbf{c}_i}$, in other words when $\tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1} = \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$, in this circumstance the two estimators have the same variance :

<br>
$
    \quad 
    \mathrm{Var}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1})
    \quad \leq \quad
    \mathrm{Var}(\tilde{\boldsymbol{\beta}}_\boldsymbol{OLS-1}) 
$

## <font color='red'>Finite sample properties

Under the assumption of strict exogeneity, the OLS estimators $\hat{\boldsymbol{\beta}}$ and $\hat{\boldsymbol{\sigma}}^2$ are unbiased, meaning that their expected values coincide with the true values of the population parameters :

$
\mathbf{E} [\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta} , 
\quad \mathbf{E} [\hat{\boldsymbol{\sigma}}^2 \mid \mathbf{X}] = \boldsymbol{\sigma}^2
$
The estimator is unbiased and consistent if the errors have finite variance and are uncorrelated with the regressors : 
$ {\displaystyle \operatorname {E} [\,\mathbf{y} {x} _{i}\varepsilon _{i}\,] = 0 } $


$
{
    \displaystyle { \boldsymbol {\beta} } = 
        (\mathbf {X} ^{\top }\mathbf {X} )^{-1} 
        \mathbf {X} ^{\top }\mathbf {y} =
    \left(\sum \mathbf {x} _{i}\mathbf {x} _{i}^{\top }\right)^{-1}
    \left(\sum \mathbf {x} _{i}y_{i}\right)
}
$

It is also efficient under the assumption that the errors have finite variance and are homoscedastic, meaning that 
$ {\displaystyle \operatorname {E} [\,\varepsilon_{i}^{2} | \mathbf {x}_{i}\,] = 0 } $ does not depend on i. 
The condition that the errors are uncorrelated with the regressors will generally be satisfied in an experiment, but in the case of observational data, it is difficult to exclude the possibility of an omitted covariate z that is related to both the observed covariates and the response variable. The existence of such a covariate will generally lead to a correlation between the regressors and the response variable, and hence to an inconsistent estimator of β. The condition of homoscedasticity can fail with either experimental or observational data. 

If the goal is either inference or predictive modeling, the performance of OLS estimates can be poor if multicollinearity is present, unless the sample size is large.

In simple linear regression, where there is only one regressor (with a constant), the OLS coefficient estimates have a simple form that is closely related to the correlation coefficient between the covariate and the response.

## <font color='#28B463'>References

<br>
<ul style="list-style-type:square">
    <li>
         Queen's University at Kingston - Economics 351 - M.G. Abbott -
         <a href="https://bit.ly/2IFUS3n">
         Statistical Properties of the OLS Coefficient Estimators</a>
    </li>
    <br>
    <li>
         University of Valencia - Ezequiel Uriel - 
         <a href="https://bit.ly/2x9cSh6">
         The simple regression model : estimation and properties</a>        
    </li>
    <br>
    <li>
        Wake Forest University - Allin Cottrell - 
        <a href="https://bit.ly/2Ls3tV9">
        Regression Basics in Matrix terms</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2IKZelN">
        Gauss Markov Theorem</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2IJEWcc">
        Covariance Matrix</a>
    </li>
</ul>