# INTRINSIC PROPERTIES OF OLS
<br>


## Introduction

<br>
The primary property of OLS estimators is that they satisfy the criteria of minimizing the sum of squared residuals. However, there is a number of other intrinsic properties. These properties are algebraic implications of the estimation, derived exclusively from the application of the OLS procedure to the simple linear regression model. In other words, they do not depend on any assumptions.


## [I1] The observed values of X are uncorrelated with the residuals

<br>
Let's start by writing down the OLS normal equation in matrix form :

<br>
$
    \quad
    \begin{align*}
        &
        \begin{aligned}[t]
            (\mathbf{X}^{\top} \mathbf{X}) \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
            &= \mathbf{X}^{\top} \mathbf{Y} 
            \newline   
            &= 
                \mathbf{X}^{\top} (\mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} + \boldsymbol{e})
               \newline
            &= 
                  (\mathbf{X}^{\top} \mathbf{X}) \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
                + \mathbf{X}^{\top} \boldsymbol{e}     
        \end{aligned}        
        \newline
        \quad \Rightarrow \quad
        & \mathbf{X}^{\top} \boldsymbol{e} = 0
    \end{align*}
$

What does this $\mathbf{X}^{\top} \boldsymbol{e}$ matrix look like ? 

<br>
$
    \quad
    \begin{bmatrix}
        X_{11} & X_{12} & \dots  & X_{1m} \\
        X_{21} & X_{22} & \dots  & X_{2m} \\
        \vdots & \vdots & \dots & \vdots \\
        \vdots & \vdots & \ddots & \vdots \\
        X_{p1} & X_{p2} & \dots  & X_{pm} 
    \end{bmatrix}_\textit{ p x m}  
    \quad
    \begin{bmatrix}
        e_1    \\
        e_2    \\
        \vdots \\
        \vdots \\
        e_m    
    \end{bmatrix}_\textit{ m x 1}
    \quad = \qquad
    \begin{bmatrix}
        X_{11}e_1  + X_{12}e_2  + \dots + X_{1N}e_m \\
        X_{21}e_1  + X_{22}e_2  + \dots + X_{2N}e_m  \\
        \vdots \\
        \vdots \\
        X_{p1}e_1  + X_{p2}e_2  + \dots + X_{pN}e_m 
    \end{bmatrix}_\textit{ p x 1}
    \quad = \qquad
    \begin{bmatrix}
        0 \\
        0 \\
        \vdots \\
        \vdots \\
        0
    \end{bmatrix}_\textit{ p x 1}
$

<br>
$\mathbf{X}^{\top} \boldsymbol{e} = 0$ implies that each regressor has zero sample correlation with the residuals. This does not mean that $\mathbf{X}$ is uncorrelated with the disturbance terms, we will have to assume this (see <b>Further considerations</b>).

<br>
If matrix $\mathbf{X}$ includes a constant term, then the following properties also hold.

## [I2] The sum of the residuals is zero

<br>
If there is a constant, then the first column in $\mathbf{X}$ will be a column of ones; equivalently, the first row in $\mathbf{X}^{\top}$ will be a row of ones. This means that for the first element in the $\mathbf{X}^{\top} \boldsymbol{e}$ vector to be zero, it must be the case that the sum of residuals itself is zero :

<br>
$
    \quad 
      \boldsymbol{\mathbf{X}_{11}} \boldsymbol{e_1}  
    + \boldsymbol{\mathbf{X}_{12}} \boldsymbol{e_2}  
    + \dots 
    + \boldsymbol{\mathbf{X}_{1N}} \boldsymbol{e_m} = 0
    \quad \Rightarrow \quad
    \sum_{i=1}^{m} \boldsymbol{e_i} = 0
$

## [I3] The sample mean of the residuals is zero

<br>
This follows straightforwardly from the previous property : 

<br>
$
    \quad 
    \sum_{i=1}^{m} \boldsymbol{e_i} = 0
    \quad \Rightarrow \quad
    \overline{\boldsymbol{e}} = \frac{1}{m}\sum_{i=1}^{m} \boldsymbol{e_i} = 0
$


## [I4] The mean of the predicted values of Y for the sample equals the mean of observed values   

<br>
This follows from the previous implication <b>I3</b>:

<br>
$
    \quad 
    \begin{align*}
    &
        \begin{aligned}[T]
            \overline{\boldsymbol{e}} = 0         
            \quad \Rightarrow \quad
            & (\overline{ \mathbf{Y} - \hat{\mathbf{Y}} })
            \newline
            &= ( \overline{ \mathbf{Y} } - \overline{ \hat{\mathbf{Y}} } )
            \newline
            &= 0          
        \end{aligned}              
        \newline
        \Rightarrow \quad
        &
        \overline{\mathbf{Y}} = \overline{ \hat{\mathbf{Y}} }
    \end{align*}
$


## [I4] The regression hyperplane passes through the means of the observed values 

<br>
This follows from implication <b>I3</b> as well :

<br>
$
    \quad 
    \begin{align*}
    &
        \begin{aligned}[T]
            \overline{\boldsymbol{e}} = 0         
            \quad \Rightarrow \quad
            & ( \overline{ \mathbf{Y} - \hat{\mathbf{Y}} } )
            \newline
            &= ( \overline{ \mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} } )
            \newline
            &= \overline{\mathbf{Y}} - \overline{\mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}} 
            \newline
            &= 0          
        \end{aligned}              
        \newline
        \Rightarrow \quad
        &
        \overline{\mathbf{Y}} = \overline{ \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} }
    \end{align*}
$


## [I5] The predicted values of Y are uncorrelated with the residuals  

<br>
$
    \quad 
    \begin{align}
        \hat{\mathbf{Y}}^{\top} \boldsymbol{e} &=
        \newline
        &= (\mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})^{\top} \boldsymbol{e}
        \newline
        &= \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \ (\mathbf{X}^{\top} \boldsymbol{e})
        \newline
        &= 0
    \end{align}
$


## [I6] Decomposition of the variance of Y  

<br>
It must be stressed that this last property depends on the first OLS normal equation, the equation associated to the intercept. If our model does not include an intercept term, then in general the decomposition obtained below will not be fulfilled.

<br>
$
    \quad
    \begin{align*}      
        & \quad 
            \boldsymbol{\mathbf{Y}_i} = \hat{\mathbf{Y}}\boldsymbol{_i} + \boldsymbol{e_i}
        \newline \newline       
        \Rightarrow   
        & \quad
        \begin{aligned}[T]  
            \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} 
            &= \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\mathbf{Y}} + \boldsymbol{e_i}
            \newline           
            &= \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} + \boldsymbol{e_i}
        \end{aligned}
        \newline \newline         
        \Rightarrow 
        & \quad     
        \begin{aligned}[T]  
            \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)^2 
            &= \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} + \boldsymbol{e_i} \big)^2 
            \newline    
            &= 
                  \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big)^2 
                + \boldsymbol{e_i}^2 
                + 2 \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big) \boldsymbol{e_i}     
        \end{aligned}
        \newline \newline         
        \Rightarrow 
        & \quad   
        \begin{aligned}[T]
            \sum_{i=1}^{m} \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)^2 
            &= \sum_{i=1}^{m} \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big)^2  
              + \sum_{i=1}^{m} \boldsymbol{e_i}^2 
              + 2 \sum_{i=1}^{m} \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big) \boldsymbol{e_i} 
            \newline
            &= \sum_{i=1}^{m} \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big)^2  
              + \sum_{i=1}^{m} \boldsymbol{e_i}^2 
              + 2 \big[ 
                          \sum_{i=1}^{m} \hat{\mathbf{Y}}\boldsymbol{_i}\boldsymbol{e_i}  
                          - \overline{\hat{\mathbf{Y}}} \sum_{i=1}^{m} \boldsymbol{e_i} 
                  \big]
            \newline
            &= \sum_{i=1}^{m} \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big)^2  
              + \sum_{i=1}^{m} \boldsymbol{e_i}^2 
        \end{aligned}
        \newline \newline         
        \Rightarrow 
        & \quad   
            \frac{1}{m} \sum_{i=1}^{N} \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)^2
            =   \frac{1}{m} \sum_{i=1}^{m} \big( \hat{\mathbf{Y}}\boldsymbol{_i} - \overline{\hat{\mathbf{Y}}} \big)^2  
              + \frac{1}{m} \sum_{i=1}^{m} \boldsymbol{e_i}^2 
        \newline \newline        
        \Rightarrow 
        & \quad \textbf{Total Sum of Squares (TSS) = Explained Sum of Squares (ESS) + Residual Sum of Squares (RSS)}
        \newline \newline        
        \Rightarrow 
        & \quad \textbf{Total Variance (TVAR) = Explained Variance (EVAR) + Unexplained variance (UVAR)}
    \end{align*}
$


## Further considerations  

<br>
We should be careful not to infer anything from the residuals about the disturbance terms. 

<br>
For example, we cannot infer that the sum of the disturbance terms is zero just because this is true of the residuals. This is true of the residuals simply because we decided to minimize the sum of squared residuals; in other words it is true (of the residuals) by construction. 

<br>
It's important to notice that that we know nothing about $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ except that they satisfy all of the properties discussed above. In order to make any inferences regarding the true population parameters $\boldsymbol{\beta}$ from $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ we need to make further assumptions about the true model. 

<br>
This further assumptions are commonly referred to as the Gauss-Markov assumptions and will be described in more detail in the related notebook.


## References

<br>
<ul style="list-style-type:square">
    <li>
         University of Valencia - Ezequiel Uriel - 
         <a href="https://bit.ly/2x9cSh6">
         The simple regression model : estimation and properties</a>        
    </li>
    <br>
    <li>
        New York University - 
        <a href="https://stanford.io/2KMHCGL">
        OLS in Matrix Form</a>
    </li>
</ul>