# ORDINARY LEAST SQUARES
<br>


## Introduction

<br>
In statistics, ordinary least squares (<b>OLS</b>) or linear least squares is a method for estimating the unknown parameters in a linear regression model; the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ will be those formulas which minimize the sum of the squared residuals (<b>RSS</b>) : <br>

$
    \quad    
    RSS
    \ = \ \sum_{i=1}^{m} \boldsymbol{{\mathbf{e}_i}^2} 
    \ = \ \sum _{i=1}^{n}(\boldsymbol{\mathbf{Y}_i} - \hat{\boldsymbol{\mathbf{Y}}}\boldsymbol{_i})^{2}  
$
    
<br>
Why minimizing the sum of squared residuals ? Let's start by saying that the OLS estimation criterion intuitively corresponds to the idea of "best fit" of the estimated sample regression function (SRF) to the given sample data $(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})$ (i = 1, ... , m).

<br>
We will also show that, under the CLRM assumptions, the OLS coefficient estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ have several desirable statistical properties.


## Residual Sum of Squared

<br>
The sum of squared residuals (also called Residual Sum of Squared, RSS) is a measure of the overall model fit and constitutes the loss function for OLS estimation; our aim is to minimize this function :

$ 
    \quad
    \begin{align}
        RSS &= 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad & \text{by definition}
        \newline
        &= \sum_{i=1}^{m} \boldsymbol{{\mathbf{e}_i}^2}
        \ = \ \sum _{i=1}^{m}(\boldsymbol{\mathbf{Y}_i} - \hat{\mathbf{Y}}_\boldsymbol{i})^{2} 
            & \text{see derivation}
        \newline
        &= \sum _{i=1}^{m}(\boldsymbol{\mathbf{Y}_i} - \boldsymbol{\mathbf{X}_i}^{\top} \hat{\boldsymbol{\beta}})^{2}
        \newline
        &= (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}})^{T} (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}})  
    \end{align}
$

<br>
Squaring the residuals $\boldsymbol{\mathbf{e}_i}$ does several things: <br>

<ul style="list-style-type:square">
    <li>
        it avoids the possibility that large positive residuals and large negative residuals could offset each other and still
        lead to a small (or even zero) value of the RSS
    </li>
    <br>
    <li>
        it avoids the complex computation required by other objective functions such as Least Absolute Deviations
    </li>
    <br>
    <li>
        it assigns a larger weight to (it penalizes more) numerically large residuals, regardless of whether they are positive
        or negative
    </li>
</ul>

<br>
A large residual can either be due to a poor estimation of the population parameters or to a large stochastic component in the regression equation. Since the measure is additive, no value is of outmost relevance.

The RSS function $S(\hat{\boldsymbol{\beta}})$ is quadratic in $\hat{\boldsymbol{\beta}}$ with positive-definite Hessian, and therefore this function possesses a unique global minimum, which can be given by the closed-form explicit formula :

$
    \quad
    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}
    \ = \ {\rm {arg}}\min _{\hat{\beta} \in \mathbb{R}^{p}} S(\hat{\beta})
    \ = \ 
        \left({\dfrac {1}{n}}\sum_{i=1}^{m} \mathbf{X}_{i}\mathbf{X}_{i}^{\top}\right)^{\!-1}\!\!\cdot \,
        {\dfrac {1}{n}}\sum _{i=1}^{m}\mathbf{X}_{i}\mathbf{Y}_{i}
$

or equivalently in matrix form <br>

$
    \quad
    \hat{\boldsymbol{\beta}}_{OLS} 
    \ = \ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{Y}
    \ = \ {\boldsymbol{\beta}} + (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\boldsymbol{\varepsilon}
$

## Derivation

Derivation of the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ is performed in two stages: <br>

<ul style="list-style-type:square">
    <li>
        <b>stage 1</b> consists of determining the first-order conditions (FOC) for minimizing the residual sum of squares
        function RSS;<br> these first-order conditions are also called the "OLS normal equations" <br>
        [<b>D1</b>]
    </li>
    <br>
    <li>
        <b>stage 2</b> consists of solving the OLS normal equations in order to obtain the explicit expressions for the 
        estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ <br>
        [<b>D2</b>]
    </li>
</ul>


## [D1] Determination of the OLS Normal Equations

<br>

### [D1.1] Partial Differentiation

<br>
Let's start by re-writing the RSS as a generic function of the residuals $\boldsymbol{{\mathbf{e}_i}}$ :

<br>
$ 
    \quad
    RSS 
    \ = \ S(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j}) 
    \ = \ = \sum_{i=1}^{m} \boldsymbol{{\mathbf{e}_i}^2}
    \ = \ = \sum_{i=1}^{m} f(\boldsymbol{{\mathbf{e}_i}})
$

Now we can proceed in the differentiation : 

<br>
$
    \quad
    \begin{align}
        \dfrac
            {\partial \text{ } RSS}
            {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j} }
        &= 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad  
            & \text{by the chain rule of differentiation}
        \newline
        &= 
            \sum_{i=1}^{m}
            \frac
                {d \text{ } f}
                {d \text{ } \boldsymbol{{\mathbf{e}_i}}}
             \frac
                 {\partial \text{ } \boldsymbol{{\mathbf{e}_i}}}
                 {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j} }
            & \text{by the power rule of differentiation}
        \newline       
        &= 
            \sum_{i=1}^{m}
            2 \ \boldsymbol{{\mathbf{e}_i}}
            \frac
                {\partial \text{ } \boldsymbol{{\mathbf{e}_i}}}
                {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j} }
        \newline
        &= 
            2 \ \sum_{i=1}^{m}
            \boldsymbol{{\mathbf{e}_i}}
            \frac
                {\partial \text{ } \boldsymbol{{\mathbf{e}_i}}}
                {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j} }
            \qquad \qquad \text{j = 0,1}     
    \end{align}     
$

The partial derivatives thus take the form : 

<br>
$
    \quad
    \dfrac
        {\partial \text{ } RSS}
        {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}}
    \ = \
        2 \sum_{i=1}^{m}
        \boldsymbol{{\mathbf{e}_i}}
        \dfrac
            {\partial \text{ } \boldsymbol{{\mathbf{e}_i}}}
            {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}}
    \ = \ 
        2 \sum_{i=1}^{m} 
        \boldsymbol{{\mathbf{e}_i}} (-1)
    \ = \ 
        - 2 \sum_{i=1}^{m} 
        \boldsymbol{{\mathbf{e}_i}}
$

$
    \quad
    \dfrac
        {\partial \text{ } RSS}
        {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}}
    \ = \ 
        2 \sum_{i=1}^{m}
        \boldsymbol{{\mathbf{e}_i}}
        \dfrac
            {\partial \text{ } \boldsymbol{{\mathbf{e}_i}}}
            {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}}
    \ = \ 
        2 \sum_{i=1}^{m} 
        \boldsymbol{{\mathbf{e}_i}} (-\boldsymbol{\mathbf{X}_i})    
    \ = \
        - 2 \sum_{i=1}^{m} 
        \boldsymbol{\mathbf{X}_i} \boldsymbol{{\mathbf{e}_i}} 
$

### [D1.2] First-Order Conditions

<br>
$
    \quad
    \begin{align}
        \dfrac
        {\partial \text{ } RSS}
        {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}}
        &= 
            - 2 \sum_{i=1}^{m} 
            \boldsymbol{{\mathbf{e}_i}}
        = 0
        \newline       
        &\Rightarrow
            \sum_{i=1}^{m}
            \boldsymbol{{\mathbf{e}_i}} = 0
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & [\textbf{E1}] 
        \newline       
        &\Rightarrow
            \sum_{i=1}^{m}
            (
                  \boldsymbol{\mathbf{Y}_i}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}\boldsymbol{\mathbf{X}_i}
            ) = 0
            & [\textbf{E2}] 
    \end{align}     
$

<br>
$
    \quad
    \begin{align}
        \dfrac
        {\partial \text{ } RSS}
        {\partial \text{ } \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}}
        &= 
            - 2 \sum_{i=1}^{m} 
            \boldsymbol{\mathbf{X}_i} \boldsymbol{{\mathbf{e}_i}} 
        = 0
        \newline       
        &\Rightarrow
            \sum_{i=1}^{m}
            \boldsymbol{\mathbf{X}_i} \boldsymbol{{\mathbf{e}_i}}  = 0
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            & [\textbf{E3}] 
        \newline       
        &\Rightarrow
            \sum_{i=1}^{m}
            \boldsymbol{\mathbf{X}_i} 
            (
                  \boldsymbol{\mathbf{Y}_i}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}\boldsymbol{\mathbf{X}_i}
            ) = 0
            & [\textbf{E4}]
    \end{align}     
$

<br>
Equations <b>E1</b> and <b>E3</b> are the most compact way of writing the FOCs for the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$, while equations <b>E2</b> and <b>E4</b> are used to obtain the actual formulas for $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$.


### [D1.3] OLS Normal Conditions

<br>
$
    \quad
    \begin{align}
        & \quad 
            \sum_{i=1}^{m}
            (
                  \boldsymbol{\mathbf{Y}_i}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}\boldsymbol{\mathbf{X}_i}
            ) = 0
        \newline        
        \Rightarrow 
        & \quad         
              \sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i}
            - m \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            = 0
        \newline
        \Rightarrow 
        & \quad 
              m \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i}
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \quad
            & [\textbf{E5}] 
    \end{align}     
$

<br>
$
    \quad
    \begin{align}
        & \quad 
            \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            (
                  \boldsymbol{\mathbf{Y}_i}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
                - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}\boldsymbol{\mathbf{X}_i}
            ) = 0
        \newline        
        \Rightarrow 
        & \quad         
              \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}
            - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
            = 0
        \newline
        \Rightarrow 
        & \quad 
              \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad \quad 
            & [\textbf{E6}] 
    \end{align}     
$

<br>
Equations <b>E5</b> and <b>E6</b> are called OLS normal equations : solving these two equations yields explicit expressions (formulas) for the OLS estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ .

## [D2] Solving the OLS Normal Equations 

<br>
There is more than one way to solve the OLS normal equations <b>E5</b> and <b>E6</b>, the following steps describe only one of the possible methods.

<br>
$
    \quad
    \begin{align}
        & \quad 
              m \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i}
        \newline        
        \Rightarrow 
        & \quad         
              \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
              \dfrac
                  {\sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}}
                  {m}
            = \dfrac
                {\sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i}}
                {m}
        \newline
        \Rightarrow 
        & \quad 
              \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
            = \overline{\mathbf{Y}}
        \newline
        \Rightarrow 
        & \quad
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            =   \overline{\mathbf{Y}}
              - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
            & [\textbf{E7}] 
    \end{align}     
$

<br>
$
    \quad
    \begin{align}
        & \quad
              \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}
        \newline        
        \Rightarrow 
        & \quad 
            \Big(
                      \overline{\mathbf{Y}}
                    - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
            \Big) \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}
        \newline
        \Rightarrow 
        & \quad 
              m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
            - m \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \boldsymbol{\overline{\mathbf{X}}^2}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
            = \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}   
        \newline
        \Rightarrow 
        & \quad 
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
            \Big(
                      \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2}
                    - m \boldsymbol{\overline{\mathbf{X}}^2}
             \Big)
             = \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i}   
               - m \overline{\mathbf{X}} \overline{\mathbf{Y}}
        \newline
        \Rightarrow 
        & \quad 
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
            = \dfrac
                { 
                    \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
                    - m \overline{\mathbf{X}} \overline{\mathbf{Y}} 
                }
                { \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} - m \ \boldsymbol{\overline{\mathbf{X}}^2} }
                \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
                \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
                & [\textbf{E8}] 
    \end{align}     
$

<br>
Equations <b>E7</b> and <b>E8</b> represent the solution of the OLS normal equations; that is, they represent the solution of the FOCs for minimizing the residual sum-of-squares function RSS.


## Further considerations

<br>
If we were actually using formulas <b>E7</b> and <b>E8</b> to compute estimates of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}$ and $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ for a given sample of m observations 
$(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})$ (i = 1, ... , m) we would employ the following two-step computational
procedure : <br>

<ul style="list-style-type:square">
    <li>
        first use <b>E8</b> to compute the estimate of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ <br>
        
        <br>
        $
            \quad 
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
            = \dfrac
                { 
                    \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
                    - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}} 
                }
                { \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} - m \ \boldsymbol{\overline{\mathbf{X}}^2} }
        $
    </li>
    <br>
    <li>
        second, substitute $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ (our estimate of the unknown population parameter 
        $\boldsymbol{{\beta}_1}$), into <b>E7</b> to obtain the estimate of $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}$ <br>
        
        <br>
        $
            \quad
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-0}
            =   \overline{\mathbf{Y}}
              - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1} \overline{\mathbf{X}}
        $
    </li>
</ul>

<br>
The OLS coefficient estimators $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-j}$ are functions only of the observed sample values $(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})$ (i = 1, ... , m) of the observable variables $\mathbf{Y}$ and $\mathbf{X}$; they can therefore be computed for any given set of sample data. 


## Alternative Formulations

<br>

### Matrix form

<br>
The sum of squared residuals (RSS) can be written in matrix notation as (equivalently) : 


<br>
$
    \quad
    e^{\top}e
    \quad = \qquad
    \begin{bmatrix}
        e_1 & e_2 & \dots & \dots & e_m 
    \end{bmatrix}_\textit{ 1 x m}
    \quad 
    \begin{bmatrix}
        e_1 \\
        e_2 \\
        \vdots \\
        \vdots \\
        e_m
    \end{bmatrix}_\textit{ m x 1}
    \quad = \qquad
    \begin{bmatrix}
        {e_1}^2 & {e_2}^2 & \dots & \dots & {e_m}^2 
    \end{bmatrix}_\textit{ 1 x 1}
$


<br>
$
    \quad
    \begin{align}
        e^{\top}e 
        \quad &= \qquad 
            (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})^{\top}
            (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})
        \newline
        &= \qquad 
            (\mathbf{Y}^{\top} - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top})
            (\mathbf{Y} - \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})
        \newline
        &= \qquad 
              \mathbf{Y}^{\top} \mathbf{Y}
            - \mathbf{Y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}
            - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} \mathbf{Y}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} 
              \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
              & 
              \text{the scalar } (\mathbf{Y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})
              \text{ is equal to its own transponse}
        \newline        
        &= \qquad 
            \mathbf{Y}^{\top} \mathbf{Y}
            - (\mathbf{Y}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS})^{\top}
            - \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} \mathbf{Y}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} 
              \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
        \newline        
        &= \qquad 
            \mathbf{Y}^{\top} \mathbf{Y}
            - 2 \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} \mathbf{Y}
            + \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} 
              \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}     
            \qquad \qquad \qquad \qquad \qquad \qquad & [\textbf{E9}] 
    \end{align}
$


<br>
In order to determine the OLS normal equations in matrix form, we will take the partial derivative of equation <b>E9</b> with respect to $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}$ and set it to zero :


<br>
$
    \quad
    \begin{align}
         \dfrac
            {\partial \ e^{\top}e}
            {\partial \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
        &=
        \newline
        &=
            \partial \ \dfrac
                { (\mathbf{Y}^{\top} \mathbf{Y}) }
                {\partial \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
            - 2 \ \partial \dfrac
                { (\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} \mathbf{Y})}
                {\partial \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
            + \partial \ \dfrac
                { (
                    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} \mathbf{X}^{\top} 
                    \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} )
                }
                {\partial \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
        & \qquad 
            \text{since } \dfrac{\partial \ a^{\top}b}{\partial \ b} = \dfrac{\partial \ b^{\top}a}{\partial \ b} = a
            \qquad \text{when } \textit{a} \text{ and } \textit{b} \text{ are } _{p x 1} \text{ vectors} 
        \newline
        &
        & \qquad 
            \text{since } \dfrac{\partial \ b^{\top}Ab}{\partial \ b} = 2 \ Ab = 2 \ b^{\top}A
            \qquad \text{when } \textit{A} \text{ is any simmetric matrix} 
        \newline
        &=
            - 2 \ \partial \ \dfrac
                {\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} (\mathbf{X}^{\top} \mathbf{Y})}
                {\partial \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
            + \partial \ \dfrac
                { 
                    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}^{\top} 
                    (\mathbf{X}^{\top} \mathbf{X}) \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}
                }
                {\partial \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}}
        \newline
        &=
            - 2 \ \mathbf{X}^{\top} \mathbf{Y}
            + 2 \ \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}      
        \newline
        &= 0
    \end{align}
$


<br>
From this last equation we obtain the OLS normal equations in matrix form :

<br>
$
    \quad
    \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} \ = \ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top}\mathbf{Y} 
$

<br>
Two important things should be said about the matrix $ (\mathbf{X}^{\top}\mathbf{X}) $ . First, it is always square, since it has dimension $ \ \text{p x p} \ $. Second, it is always simmetric.

<br>
$
    \quad
    \begin{align}
        & \qquad
            (\mathbf{X}^{\top} \mathbf{X}) \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS}  \ = \ \mathbf{X}^{\top} \mathbf{Y}
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
            & \text{if } (\mathbf{X}^{\top}\mathbf{X})^{-1} \text{ exists, then it's possible to pre-multiply both sides}
        \newline
        \Rightarrow & \qquad
            (\mathbf{X}^{\top}\mathbf{X})^{-1} \ (\mathbf{X}^{\top} \mathbf{X}) \ \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
            \ = \ (\mathbf{X}^{\top}\mathbf{X})^{-1} \ \mathbf{X}^{\top} \mathbf{Y}
            & \text{by definition } (\mathbf{X}^{\top}\mathbf{X})^{-1} \ \mathbf{X}^{\top} \mathbf{X} = \mathbf{I}
        \newline
        \Rightarrow & \qquad
            \hat{\boldsymbol{\beta}}_\boldsymbol{OLS} 
            \ = \ (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{Y}
    \end{align}
$

### Deviation from the means

<br>
Formula <b>E8</b> for the OLS slope coefficient estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ can conveniently be re-written in terms of <b>deviation-from-means</b>, which uses lower case letters to denote the deviation of each observed sample value from its corresponding sample mean : <br>

<ul style="list-style-type:square">
    <li>
        $
            \boldsymbol{\mathbf{y}_i} \ = \ \mathbf{Y} - \overline{\boldsymbol{\mathbf{Y}_i}}
            \qquad \text{i = (1, } \dots \text{, m)} \qquad
        $
        where $\overline{\boldsymbol{\mathbf{Y}_i}}$ is the sample mean of the $\boldsymbol{\mathbf{Y}_i}$ values            
    </li>
    <br>
    <li>
        $
            \boldsymbol{\mathbf{x}_i} \ = \ \mathbf{Y} - \overline{\boldsymbol{\mathbf{X}_i}}
            \qquad \text{i = (1, } \dots \text{, m)} \qquad
        $
        where $\overline{\boldsymbol{\mathbf{X}_i}}$ is the sample mean of the $\boldsymbol{\mathbf{X}_i}$ values           
    </li>
</ul>


<br>
Let's start by re-writing the numerator of <b>E8</b> in this alternative formulation (deviation-from-means) : <br>

<br>
$
    \quad
    \begin{align}
        \sum_{i=1}^{m} 
        \boldsymbol{\mathbf{x}_i} \boldsymbol{\mathbf{y}_i}
        &= 
            \sum_{i=1}^{m}
            \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)
            \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)
        \newline
        &= 
            \sum_{i=1}^{m}
            \big( 
                  \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
                - \boldsymbol{\mathbf{X}_i} \overline{\mathbf{Y}}
                - \overline{\mathbf{X}} \boldsymbol{\mathbf{Y}_i}
                + \overline{\mathbf{X}} \overline{\mathbf{Y}}
            \big)
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
            - \overline{\mathbf{Y}} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} 
            - \overline{\mathbf{X}} \sum_{i=1}^{m} \boldsymbol{\mathbf{Y}_i}
            + m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
            - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
            - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
            + m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
            - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}}
        \newline
\end{align}
$


<br>
And now the denominator : <br>

<br>
$
    \quad
    \begin{align}
        \sum_{i=1}^{m} 
            \boldsymbol{{\mathbf{x}_i}^2}
        &= 
            \sum_{i=1}^{m} 
            \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)^2
        \newline
        &= 
            \sum_{i=1}^{m} 
            \big( 
                \boldsymbol{{\mathbf{X}_i}^2} 
                -2 \ \boldsymbol{\mathbf{X}_i} \overline{\mathbf{X}} 
                - \ \boldsymbol{\overline{\mathbf{X}}^2} 
            \big)
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} 
            - 2 \ \overline{\mathbf{X}} \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i}
            + m \ \boldsymbol{\overline{\mathbf{X}}^2} 
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} 
            - 2m \ \boldsymbol{\overline{\mathbf{X}}^2}
            + m \ \boldsymbol{\overline{\mathbf{X}}^2} 
        \newline
        &= 
            \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} 
            - m \ \boldsymbol{\overline{\mathbf{X}}^2}
    \end{align}
$


<br>
The OLS slope coefficient estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ can thus be written in (at least) the following equivalent ways : <br>

<br>
$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= 
            \dfrac
                { 
                    \sum_{i=1}^{m} \boldsymbol{\mathbf{X}_i} \boldsymbol{\mathbf{Y}_i} 
                    - m \ \overline{\mathbf{X}} \overline{\mathbf{Y}} 
                }
                { \sum_{i=1}^{m} \boldsymbol{{\mathbf{X}_i}^2} - m \ \boldsymbol{\overline{\mathbf{X}}^2} }
        \newline
        &= 
            \dfrac
                { 
                    \sum_{i=1}^{m}
                    \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)
                    \big( \boldsymbol{\mathbf{Y}_i} - \overline{\mathbf{Y}} \big)
                }
                {
                    \sum_{i=1}^{m} 
                    \big( \boldsymbol{\mathbf{X}_i} - \overline{\mathbf{X}} \big)^2
                }
        = 
            \dfrac
                {\mathrm{Cov}(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})}
                {\mathrm{Var}(\boldsymbol{\mathbf{X}_i})} 
        \newline
        &= 
            \sum_{i=1}^{m} 
            \boldsymbol{\mathbf{x}_i} \boldsymbol{\mathbf{y}_i}               
    \end{align}
$

<br>
$
    \quad
    \begin{align}
        \hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}
        &= (\mathbf{X}^{\top}\mathbf{X})^{-1} \ \mathbf{X}^{\top}\mathbf{Y} 
        \newline
        &= {\boldsymbol{\beta}} + (\mathbf{X}^{\top}\mathbf{X})^{-1} \ \mathbf{X}^{\top}\boldsymbol{\varepsilon}
    \end{align}
$

<br>
where $\mathrm{Cov}(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})$ and $\mathrm{Var}(\boldsymbol{\mathbf{X}_i})$ are, respectively, the sample covariance of the observed $(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\mathbf{Y}_i})$ values and the sample variance of the observed $\boldsymbol{\mathbf{X}_i}$ values.

<br>
It's important to notice that $\mathrm{Var}(\boldsymbol{\mathbf{X}_i})$ as defined above is a biased (but consistent) estimator of the population variance of $\mathbf{X}$, conventionally denoted as $\boldsymbol{{\sigma_X}^2}$ . The unbiased (and
consistent) estimator of $\boldsymbol{{\sigma_X}^2}$ is given by 
$ \dfrac{ \sum_{i=1}^{m} \boldsymbol{{\mathbf{x}_i}^2} } {(m - 1)} $ .

### Alternative derivations

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations y ≈ $X \hat{\beta}$, where $\hat{\beta}$ is the unknown. Assuming the system cannot be solved exactly (the number of equations n being much larger than the number of unknown coefficients p), we are looking for a solution that could provide the smallest discrepancy between the right and left hand sides. In other words, we are looking for the solution that satisfies <br><br>
$
    \quad
    \hat{\boldsymbol{\beta}}_{OLS} 
    \ = \ {\rm {arg}}\min _{\hat{\beta}}\,\lVert \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}} \rVert
$

where $\lVert · \lVert$ is the standard L2 norm in the n-dimensional Euclidean space $R^{n}$. The predicted quantity $\mathbf{X}\hat{\boldsymbol{\beta}}$ is just a certain linear combination of the vectors of regressors. 


## References

<br>
<ul style="list-style-type:square">
    <li>
         Queen's University at Kingston - Economics 351 - M.G. Abbott - 
         <a href="https://bit.ly/2KT4CnF">
         Ordinary Least Squares (OLS) Estimation of the Simple CLRM</a>         
    </li>
    <br>
    <li>
        New York University - 
        <a href="https://stanford.io/2KMHCGL">
        OLS in Matrix Form</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2s9aFwm">
        Ordinary Least Squares</a>
    </li>
</ul>
