# CLRM ASSUMPTIONS
<br>


## Introduction

<br>
The <b>C</b>lassical <b>L</b>inear <b>R</b>egression <b>M</b>odel, defined as the standard implementation of the linear regression algorithm with standard estimation techniques, makes a number of further assumptions about the regressors (independent variables), the response variables, and their relationship.

<br>
Before going into the details of these assumptions, we will say that they fall in three sets : <br>
<br>
<ul style="list-style-type:square">
    <li>
        assumptions regarding the formulation of the population regression equation <br>
        [<b>A1</b>]
    </li>
    <br>
    <li>
        assumptions regarding the statistical properties of the disturbance term and the dependent variable <br>
        [<b>A2 - A4</b>]
    </li>
    <br>
    <li>
        assumptions regarding the properties of the sample data <br>
        [<b>A5 - A8</b>]
    </li>
</ul>


## [A1] Linearity

<br>
Also known as the assumption on the functional form, it states that the population regression equation takes the form 

<br>
$
    \quad
    \begin{align}
        &
        \mathbf{Y} \ = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} 
        \newline 
        \text{or} \quad
        &
        \mathbf{Y}_i 
        \ = \ \mathbf{E}[\mathbf{Y}_i \mid \mathbf{X}_i] + {\boldsymbol{\varepsilon}_i}
        \ = \ \boldsymbol{\mathbf{X}_i}^{\top} \boldsymbol{\beta} + {\boldsymbol{\varepsilon}_i}
    \end{align}
$

<br>
The expected value of the response variable is assumed to be a linear (with respect to the parameters) combination of the
independent variables; note that this assumption is much less restrictive than it may at first seem. 

<br>
It does not mean that there must be a linear relationship between the independent and dependent variables; the specification only has to be linear in its parameters. In other words, the independent variables can take non-linear forms (or be arbitrarily transformed, as in the case of polynomial regression) as long as the parameters are linear.

<br>
<b>A1</b> incorporates three distinct assumptions : <br>
   

### [A1.1] Additive Disturbance Term

<br>
This assumption implies that the partial derivative of $\mathbf{y}$ with respect to $\boldsymbol{\varepsilon}$ will be equal to 1 :

$
    \quad
    \dfrac
        {\partial \mathbf{Y}_i}
        {\partial \boldsymbol{\varepsilon_i}}
    \ = \ 1
    \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}
$


### [A1.2] Linearity in Parameters

<br>
The population regression equation (PRE) is linear in the population parameters 
$ \boldsymbol{\beta_j} \ \text{(j = 0,} \dots \text{, p)} $.

<br>
This assumption implies that the partial derivative of $\mathbf{Y}$ with respect to each of the
population parameter will be a function only of known constants and/or the regressor $\mathbf{X}_i$; it will not be a function of any unknown parameters. 

$
    \quad
    \dfrac
        {\partial \mathbf{Y}_i}
        {\partial \boldsymbol{\beta_j}}
    \ = \
        \mathbf{f}_j(\mathbf{X}_i)
        \quad
        \text{(j = 0,} \dots \text{, p)}   
        \quad \text{where} \ \mathbf{f}_j(\mathbf{X}_i) \ \text{contains no unknown parameters}
$

### [A1.3] Parameters Constancy

<br>
The population parameters $\boldsymbol{\beta_j} \ \text{(j = 0,} \dots \text{, p)}$ are (unknown) constants that do not vary across observations. 

<br>
Symbolically, if $\boldsymbol{\beta_{ji}}$ is the value of the j-th parameter for the i-th observation, this assumption states that 
$ \ \boldsymbol{\beta_{ji}} = \boldsymbol{\beta_j} \ \forall i$ 


## [A2] Zero Conditional Mean or Strict Exogeneity

<br>
The conditional mean, or conditional expectation, of the disturbance terms $\boldsymbol{\varepsilon_i}$ for any given value $\boldsymbol{\mathbf{X}_i}$ of the regressor is equal to zero.

<br>
$
    \quad
    \mathbf{E} [\boldsymbol{\varepsilon} \mid \mathbf{X} ]
    \quad = \quad 
    \mathbf{E} 
    \begin{bmatrix}
        \boldsymbol{\varepsilon_{1}} \mid \mathbf{X} \\
        \boldsymbol{\varepsilon_{2}} \mid \mathbf{X} \\
        \vdots \\
        \boldsymbol{\varepsilon_{m}} \mid \mathbf{X}
    \end{bmatrix}    
    \quad = \quad
    \begin{bmatrix}
        \mathbf{E} [\boldsymbol{\varepsilon_{1}} \mid \mathbf{X}] \\
        \mathbf{E} [\boldsymbol{\varepsilon_{2}} \mid \mathbf{X}] \\
        \vdots \\
        \mathbf{E} [\boldsymbol{\varepsilon_{m}} \mid \mathbf{X}] \\
    \end{bmatrix}    
    \quad = \quad
    \begin{bmatrix}
        \mathbf{E} [\boldsymbol{\varepsilon_{1}}] \\
        \mathbf{E} [\boldsymbol{\varepsilon_{2}}] \\
        \vdots \\
        \mathbf{E} [\boldsymbol{\varepsilon_{m}}] \\
    \end{bmatrix}    
    \quad = \quad
    \begin{bmatrix}
        0      \\
        0      \\
        \vdots \\
        0      \\
   \end{bmatrix} 
   \quad = \quad
   0
$

<br>
$ 
    \quad
    \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0    
    \quad \text{or} \quad
    \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}] = 0 
    \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}
$

<br>
This assumption states two things : <br>

<ul style="list-style-type:square">
    <li>
        the conditional mean of the disturbance term $\boldsymbol{\varepsilon}$ is the same for all population values of
        $\boldsymbol{\mathbf{X}}$; <br>
        it does not depend, either linearly or non-linearly, on
        $\boldsymbol{\mathbf{X}}$
    </li>
    <br>
    <li>
        the common conditional population mean of $\boldsymbol{\varepsilon}$ for all values of $\boldsymbol{\mathbf{X}}$ is zero
    </li>
</ul>

<br>
Each value $\boldsymbol{\mathbf{X}_i}$ identifies a segment or subset of the relevant population $\boldsymbol{\mathbf{X}}$. This assumption says that, for each of these population segments or subsets, the expectation (conditional on the subset) of the disturbance term $\boldsymbol{\varepsilon}$ is zero. In other words, for each population segment $\boldsymbol{\mathbf{X}_i}$, positive and negative values of $\boldsymbol{\varepsilon_i}$ "cancel out" so that the average value of the disturbance term equals zero. 

<br><br>
Equivalent ways to express the assumption of strict exogeneity are the following : <br>

<br>
<ul style="list-style-type:square">
    <li>
        the disturbanace terms average out to zero for any value of  $\mathbf{X}$
    </li>
    <br>
    <li>
        no observation of the independent variables convey any information about the expected value of the disturbance terms
    </li>
    <br>
    <li>
        it must not be possible to explain $\boldsymbol{\varepsilon}$ through  $\mathbf{X}$
    </li>
    <br>
    <li>
        the regressors are uncorrelated with the disturbance terms
    </li>
</ul>

<br>
The regressors $\mathbf{X}$ are assumed to be uncorrelated with the disturbance term $\boldsymbol{\varepsilon}$. 
If one or more of the independent variables is correlated with the disturbance term, then the OLS estimation of the regression coefficients will be biased; however, if the correlation is not contemporaneous, the coefficients estimate may still be consistent. There are many methods of correcting the bias, including instrumental variable regression and Heckman selection correction.

<br>
Assumption <b>A2</b> has several implications, which we will discuss in the paragraphs below and name accordingly to the notation <b>Ax-Iy</b>.
   

### [A2 | Implication 1 : zero unconditional mean]

<br>
Assumption <b>A2</b> implies that the unconditional mean of the population values of $\boldsymbol{\varepsilon}$ equals zero : <br>

$
    \quad
    \begin{align}
        &
        \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0
        \quad \Rightarrow 
        \mathbf{E} [\boldsymbol{\varepsilon}] = 0
        \newline 
        \text{or} \quad
        &
        \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}] = 0
        \quad \Rightarrow 
        \mathbf{E} [\boldsymbol{\varepsilon_i}] = 0 \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}
    \end{align}
$


<br>
This implication follows from the so-called law of iterated expectations, which states that 

$
    \quad
      \mathbf{E} \Big[ \mathbf{E}[\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] \Big] 
    = \mathbf{E} [\boldsymbol{\varepsilon}] 
    \quad 
        \text{since } 
        \mathbf{E}[\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0 
        \text{ by } \boldsymbol{A2} \text{, it follows that} 
     \quad
      \mathbf{E} [\boldsymbol{\varepsilon}] 
    = \mathbf{E} \Big[ \mathbf{E}[\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] \Big] 
    = \mathbf{E} [0]
    = 0
$

<br>
The logic of <b>A2-1</b> is straightforward: if the conditional mean of $\boldsymbol{\varepsilon}$ for each and
every population value of $\mathbf{X}$ equals zero, then the mean of these zero conditional means must also be zero. 


### [A2 | Implication 2 : orthogonality]

<br>
Assumption <b>A2</b> also implies that the population values of the regressors and of the disturbance term (respectively $\boldsymbol{\mathbf{X}_i}$ and $\boldsymbol{\varepsilon_i}$) have zero covariance, i.e. the population values of $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$ are uncorrelated: 

<br>
$
    \quad
    \begin{align*}
        &
        \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}] = 0
        \quad \Rightarrow 
        \begin{aligned}[t]
            \mathrm{Cov}(\boldsymbol{\mathbf{X}_i}, \boldsymbol{\varepsilon_i}) &= 0         
            \newline
            &=  \mathbf{E} 
                \Big[ 
                    (\boldsymbol{\mathbf{X}_i} - \overline{\boldsymbol{\mathbf{X}_i}})
                    (\boldsymbol{\varepsilon_i} - \overline{\boldsymbol{\varepsilon_i}})
                \Big] 
                \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
                & \text{by definition}
                \newline
            &= 
                \mathbf{E} 
                    \Big[ (\boldsymbol{\mathbf{X}_i} - \overline{\boldsymbol{\mathbf{X}_i}}) \boldsymbol{\varepsilon_i} \Big]
                & \text{since } \mathbf{E}[\boldsymbol{\varepsilon_i}] = 0 \text{ by } \textbf{A2-I1}
                \newline
            &=  \mathbf{E} 
                \Big[ 
                      \boldsymbol{\mathbf{X}_i} \boldsymbol{\varepsilon_i} 
                    - \overline{\boldsymbol{\mathbf{X}_i}} \boldsymbol{\varepsilon_i}  
                 \Big]
                \newline
            &=  
                  \mathbf{E} \Big[\boldsymbol{\mathbf{X}_i} \boldsymbol{\varepsilon_i} \Big]
                - \mathbf{E} \Big[\overline{\boldsymbol{\mathbf{X}_i}} \boldsymbol{\varepsilon_i} \Big]
                & 
                    \text{since } 
                    \overline{\boldsymbol{\mathbf{X}_i}} = \mathbf{E}[\boldsymbol{\mathbf{X}_i}] 
                    \text{ is a constant}
                \newline
            &=  
                  \mathbf{E} \Big[\boldsymbol{\mathbf{X}_i} \boldsymbol{\varepsilon_i} \Big]
                - \overline{\boldsymbol{\mathbf{X}_i}} \mathbf{E} \Big[\boldsymbol{\varepsilon_i} \Big]
                 & \text{since } \mathbf{E}[\boldsymbol{\varepsilon_i}] = 0 \text{ by } \textbf{A2-I1}
                \newline
            &= 
                \mathbf{E} \Big[\boldsymbol{\mathbf{X}_i} \boldsymbol{\varepsilon_i} \Big]
                    \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}
        \end{aligned}              
        \newline
        \text{or} \quad
        &
        \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0
        \quad \Rightarrow 
        \mathrm{Cov}(\boldsymbol{\mathbf{X}}, \boldsymbol{\varepsilon}) = 0   
    \end{align*}
$

<br>
Saying that the population disturbance terms error terms $\boldsymbol{\varepsilon_i}$ have zero covariance with the corresponding population regressor values $\boldsymbol{\mathbf{X}_i}$ is equivalent to say that there is no linear
association between the two (or that the two are uncorrelated). Let's see this in details :

$
    \quad
    \boldsymbol{\rho} (\boldsymbol{\mathbf{X}_i}, \boldsymbol{\varepsilon_i}) 
    = \dfrac
        { \mathrm{Cov} (\boldsymbol{\mathbf{X}_i}, \boldsymbol{\varepsilon_i}) }
        { \sqrt { \mathrm{Var}(\boldsymbol{\mathbf{X}_i})\mathrm{Var}(\boldsymbol{\varepsilon_i}) } }
        \quad \text{zero covariance implies zero correlation}            
$


### [A2 | Implication 3 : conditional mean of the population response variable]

<br>
Assumption <b>A2</b> also implies that the conditional mean of the population values $\boldsymbol{\mathbf{Y}_i}$ corresponding to a given value $\boldsymbol{\mathbf{X}_i}$ of the regressor, equals the population regression function (PRF) :

$
    \quad
    \begin{align*}
        &
        \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}] = 0
        \quad \Rightarrow 
        \begin{aligned}[t]
            & 
            \mathbf{E} [\boldsymbol{\mathbf{Y}_i} \mid \boldsymbol{\mathbf{X}_i}]  
            & \qquad \qquad \qquad \qquad \qquad \qquad 
            \qquad \qquad \qquad \qquad \qquad \qquad
            \text{by }\textbf{A1}
            \newline
            &=  \mathbf{E} 
                [ \boldsymbol{\beta\mathbf{X}_i} + \boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i} ]                  
                \newline
            &=    \mathbf{E} [ \boldsymbol{\beta\mathbf{X}_i} \mid \boldsymbol{\mathbf{X}_i} ]  
                + \mathbf{E} [ \boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i} ]   
                & 
                    \text{since } 
                    \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}] = 0  
                    \text{ by } \textbf{A2}
                \newline
            &=  \mathbf{E} [ \boldsymbol{\beta\mathbf{X}_i} \mid \boldsymbol{\mathbf{X}_i} ]                   
                \newline
            &=  \boldsymbol{\beta\mathbf{X}_i}
                \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}
        \end{aligned}              
        \newline \newline
        \text{or} \quad
        &
        \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0
        \quad \Rightarrow 
        \mathbf{E} [\boldsymbol{\mathbf{Y}} \mid \boldsymbol{\mathbf{X}}] = \boldsymbol{\beta\mathbf{X}} 
    \end{align*}
$

### [A2 | Further considerations]

<br>
Assumption <b>A2</b> rules out both linear and non-linear dependence between $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$; i.e. it requires $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$ to be statistically independent: <br>

<ul style="list-style-type:square">
    <li>
        the absence of linear dependence between $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$ means that the two are
        uncorrelated, or equivalently that the two have zero covariance.
    </li>
    <br>
    <li>
        linear independence between $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$ is not sufficient to exclude the
        existence of a different relationship between the two; it is possible for $\boldsymbol{\mathbf{X}}$ and
        $\boldsymbol{\varepsilon}$ to be uncorrelated (or linearly independent), and non-linearly related.
    </li>
    <br>
    <li>
        assumption <b>A2</b> therefore also requires that there be no non-linear relationship between $\boldsymbol{\mathbf{X}}$
        and $\boldsymbol{\varepsilon}$
    </li>
</ul>

<br>
The disturbance term $\boldsymbol{\varepsilon}$ represents all the unknown, unobservable and unmeasured variables other than the regressor $\boldsymbol{\mathbf{X}}$ that determine the population values of the dependent variable $\boldsymbol{\mathbf{y}}$.
Anything that causes the disturbance term to be correlated with the regressor will violate assumption <b>A2</b> :

$ 
    \quad 
    \mathrm{Cov}(\boldsymbol{\mathbf{X}}, \boldsymbol{\varepsilon}) \neq 0 
    \quad \text{or} \quad
    \boldsymbol{\rho} (\boldsymbol{\mathbf{X}}, \boldsymbol{\varepsilon}) \neq 0 
    \quad \Rightarrow \quad
    \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] \neq 0 
$

<br>
It's important to notice that the converse is not true :

$ 
    \quad 
    \mathrm{Cov}(\boldsymbol{\mathbf{X}}, \boldsymbol{\varepsilon}) = 0 
    \quad \text{or} \quad
    \boldsymbol{\rho} (\boldsymbol{\mathbf{X}}, \boldsymbol{\varepsilon}) = 0 
    \quad \text{does not imply that} \quad
    \mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}] = 0 
$

the covariance (or the correlation) can only measure the linear dependence, whereas any non-linear dependence between $\boldsymbol{\mathbf{X}}$ and $\boldsymbol{\varepsilon}$ will also cause $\mathbf{E} [\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}]$ to depend on $\boldsymbol{\mathbf{X}}$, and hence to differ from zero.


## [A3] Homoscedasticity (or Constant Variance) of the disturbance term

<br>
The conditional variances of the disturbance terms $\boldsymbol{\varepsilon_i}$ are identical for all observations (all population values $\boldsymbol{\mathbf{X}_i}$) and equal the same finite positive (unknown) constant $\boldsymbol{\sigma^2}$ :

<br>
$
    \quad
    \begin{align*}
        &
        \begin{aligned}[t]
            \mathrm{Var} (\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}) 
            &= 
                \qquad \qquad \qquad \qquad \qquad \quad \qquad
            \qquad \qquad \qquad \qquad \qquad \qquad
                &\text{by definition of conditional variance}
            \newline
            &= \mathbf{E}
               \Big[ 
                   \Big(
                         \boldsymbol{\varepsilon_i}
                       - \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}]
                   \Big)^2 
                   \mid \boldsymbol{\mathbf{X}_i}
               \Big]
               & \text{by } \textbf{A2}
            \newline   
            &= \mathbf{E} \Big[ \Big(\boldsymbol{\varepsilon_i} - 0 \Big)^2 \mid \boldsymbol{\mathbf{X}_i} \Big]
               \newline
            &= \mathbf{E} \Big[ \boldsymbol{\varepsilon_i}^2 \mid \boldsymbol{\mathbf{X}_i} \Big]
            \newline
            & = \boldsymbol{\sigma^2} > 0      
            \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}            
        \end{aligned}        
        \newline \newline
        \text{or} \quad
        &
        \mathrm{Var} (\boldsymbol{\varepsilon} \mid \boldsymbol{\mathbf{X}}) 
        = \mathbf{E}[\boldsymbol{\varepsilon}^2 \mid \boldsymbol{\mathbf{X}}] 
        = \boldsymbol{\sigma^2} > 0     
    \end{align*}
$

<br>
For each population value $\boldsymbol{\mathbf{X}_i}$ of $\boldsymbol{\mathbf{X}}$, there is a corresponding conditional distribution of disturbance terms, and a corresponding conditional distribution of population $\boldsymbol{\mathbf{Y}_i}$ values.

<br>
The disturbance term is assumed to have a constant variance, regardless of the values of the regressors.

<br>
Assumption <b>A3</b> also implies that, if the standard deviations of the disturbance terms are constant and do not depend on the value of the regressors, then each probability distribution for the response variable also has the same standard deviation regardless of the regressors. 

<br>
In order to determine when a pattern of residuals violates the assumption of homoscedasticity it is necessary (it may be not sufficient though) to look for 'fanning' or 'funnelling' effects in a residual-vs-fitted plot (where fitted stands for predicted values).


### [A3 | Implication 1 : unconditional variance of the disturbance term]

<br>
Assumption <b>A3</b> implies that the unconditional variance of the disturbance term is also equal to $\boldsymbol{\sigma^2}$ :
   
$ 
    \quad
    \begin{align*}
        &
        \begin{aligned}[t]
            \mathrm{Var} (\boldsymbol{\varepsilon_i})
            &= 
                \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad \qquad \qquad\qquad \qquad \qquad\qquad
                &\text{by definition of conditional variance}
            \newline
            &= \mathbf{E} \Big[ \Big( \boldsymbol{\varepsilon_i} - \mathbf{E}[\boldsymbol{\varepsilon_i}] \Big)^2 \Big]
            \newline
            &= \mathbf{E} 
               \Big[
                     {\boldsymbol{\varepsilon_i}}^2
                   - 2 \boldsymbol{\varepsilon_i}\mathbf{E}[\boldsymbol{\varepsilon_i}]
                   + \mathbf{E}[\boldsymbol{\varepsilon_i}]^2
               \Big]
            \newline
            &=   \mathbf{E} \Big[ {\boldsymbol{\varepsilon_i}}^2 \Big]
               - 2 \mathbf{E} \Big[ \boldsymbol{\varepsilon_i}\mathbf{E}[\boldsymbol{\varepsilon_i}] \Big]
               + \mathbf{E} \Big[ \mathbf{E}[\boldsymbol{\varepsilon_i}]^2 \Big]
               & \text{by } \textbf{A2-I1} \text{ (zero unconditional mean)}
            \newline
            &= \mathbf{E} \Big[ {\boldsymbol{\varepsilon_i}}^2 \Big]
               & \text{by the law of iterated expectations}
            \newline
            &= \mathbf{E} \Big[ \mathbf{E}[{\boldsymbol{\varepsilon_i}}^2 \mid \boldsymbol{\mathbf{X}_i}] \Big]
               & \text{by } \textbf{A3}
            \newline
            &= \mathbf{E} \Big[ \boldsymbol{\sigma^2} \Big] = \boldsymbol{\sigma^2}
        \end{aligned}
        \newline \newline
        \text{or} \quad
        &
        \mathrm{Var} (\boldsymbol{\varepsilon}) 
        = \boldsymbol{\sigma^2}  
    \end{align*}
$


### [A3 | Implication 2 : conditional variance of population response variable]

<br>
Assumption <b>A3</b> implies that the conditional variance of the population values $\boldsymbol{\mathbf{Y}_i}$ corresponding to a given population value $\boldsymbol{\mathbf{X}_i}$ equals the constant variance $\boldsymbol{\sigma^2}$ :

$
    \quad
    \begin{align*}
        &
        \begin{aligned}[t]
            \mathrm{Var} (\boldsymbol{\mathbf{Y}_i} \mid \boldsymbol{\mathbf{X}_i}) 
            &= 
                \qquad \qquad \qquad \qquad \qquad \quad \qquad \qquad
                &\text{by definition of conditional variance}
            \newline
            &= \mathbf{E}
               \Big[ 
                   \Big(
                         \boldsymbol{\mathbf{Y}_i}
                       - \mathbf{E} [\boldsymbol{\mathbf{Y}_i} \mid \boldsymbol{\mathbf{X}_i}]
                   \Big)^2 
                   \mid \boldsymbol{\mathbf{X}_i}
               \Big]
               & \text{by } \textbf{A2-I3} \text{ (conditional mean of the population response variable)}
            \newline   
            &= \mathbf{E}
               \Big[ 
                   \Big( \boldsymbol{\mathbf{Y}_i} - \boldsymbol{\beta\mathbf{X}_i} \Big)^2 
                   \mid \boldsymbol{\mathbf{X}_i}                   
               \Big]
               & \text{by } \boldsymbol{A1}
               \newline
            &= \mathbf{E} \Big[ \boldsymbol{\varepsilon_i}^2 \mid \boldsymbol{\mathbf{X}_i} \Big]
               & \text{by } \textbf{A3}
            \newline
            & = \boldsymbol{\sigma^2} > 0      
            \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}            
        \end{aligned}        
        \newline \newline
        \text{or} \quad
        &
        \mathrm{Var} (\boldsymbol{\mathbf{Y}} \mid \boldsymbol{\mathbf{X}}) 
        = \boldsymbol{\sigma^2}     
    \end{align*}
$


## [A4] Independence of the disturbance terms

<br>
Also known as the assumption of Zero Error Covariances, Non-autoregressive Errors, or Non-autocorrelated Errors, <b>A4</b> states that the disturbance terms have zero conditional covariance across observations of the regressors :

$
    \quad
    \begin{align}
        \mathrm{Cov}
        (\boldsymbol{\varepsilon_i},\boldsymbol{\varepsilon_s} \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}) 
        &=
            \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
            & \text{by definition of conditional variance}
        \newline
        &= \mathbf{E}
           \Big[ 
               \Big(
                     \boldsymbol{\varepsilon_i}
                   - \mathbf{E} [\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i}]
               \Big)
               \Big(
                     \boldsymbol{\varepsilon_s}
                   - \mathbf{E} [\boldsymbol{\varepsilon_s} \mid \boldsymbol{\mathbf{X}_s}]
               \Big)
               \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}
           \Big]
            & \text{by } \textbf{A2}   
        \newline
        &= \mathbf{E} 
           \Big[ 
               \boldsymbol{\varepsilon_i}\boldsymbol{\varepsilon_s} 
               \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}
            \Big]
        \newline
        &= 0 \quad \forall i \neq s
    \end{align}
$   

<br>
Assumption <b>A4</b> states that : <br>

<ul style="list-style-type:square">
    <li>
        the disturbance terms $\boldsymbol{\varepsilon_i}$ corresponding to 
        $\boldsymbol{\mathbf{X}} = \boldsymbol{\mathbf{X}_i}$ have zero covariance (or are uncorrelated) with the disturbance
        terms $\boldsymbol{\varepsilon_s}$ corresponding to any other regressor value 
        $\boldsymbol{\mathbf{X}} = \boldsymbol{\mathbf{X}_s}$ 
        (where $\boldsymbol{\mathbf{X}_i} \neq \boldsymbol{\mathbf{X}_s}$)
    </li>
    <br>
    <li>
        the population values $\boldsymbol{\mathbf{y}_i}$ corresponding to 
        $\boldsymbol{\mathbf{X}} = \boldsymbol{\mathbf{X}_i}$ have zero covariance (or are uncorrelated) with the population
        values $\boldsymbol{\mathbf{y}_s}$ corresponding to any other regressor value 
        $\boldsymbol{\mathbf{X}} = \boldsymbol{\mathbf{X}_s}$ 
        (where $\boldsymbol{\mathbf{X}_i} \neq \boldsymbol{\mathbf{X}_s}$)
    </li>
</ul>

<br>
This means there is no systematic linear dependence or association between $\boldsymbol{\varepsilon_i}$ and $\boldsymbol{\varepsilon_s}$, or between $\boldsymbol{\mathbf{Y}_i}$ and $\boldsymbol{\mathbf{Y}_s}$, across observations of the regressors $\boldsymbol{\mathbf{X}}$.

<br> 
Assumption <b>A4</b> may be violated in the context of time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies.

### [A4 | Implication 1 : independence of the response variable terms]

<br>
Assumption <b>A4</b> implies that the response variable terms have zero conditional covariance across observations of the regressors as well :

$
    \quad
    \begin{align}
        \mathrm{Cov}
        (\boldsymbol{\mathbf{Y}_i},\boldsymbol{\mathbf{Y}_s} \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}) 
        &=
            & \text{by definition of conditional variance}
        \newline
        &= \mathbf{E}
           \Big[ 
               \Big(
                     \boldsymbol{\mathbf{Y}_i}
                   - \mathbf{E} [\boldsymbol{\mathbf{Y}_i} \mid \boldsymbol{\mathbf{X}_i}]
               \Big)
               \Big(
                     \boldsymbol{\mathbf{Y}_s}
                   - \mathbf{E} [\boldsymbol{\mathbf{Y}_s} \mid \boldsymbol{\mathbf{X}_s}]
               \Big)
               \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}
           \Big]
            & \text{by } \textbf{A2-I3} \text{ (conditional mean of the population response variable)} 
        \newline
        &=  \mathbf{E}
            \Big[ 
               \Big( \boldsymbol{\mathbf{Y}_i} - \boldsymbol{\beta\mathbf{X}_i} \Big)
               \Big( \boldsymbol{\mathbf{Y}_s} - \boldsymbol{\beta\mathbf{X}_s} \Big)
               \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}
            \Big]
            & \text{by } \textbf{A1}
        \newline
        &=  \mathbf{E}
            \Big[ 
                \boldsymbol{\varepsilon_i}\boldsymbol{\varepsilon{Y}_s}
                \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}
            \Big]            
        \newline
        &=  \mathrm{Cov}
            (\boldsymbol{\varepsilon_i},\boldsymbol{\varepsilon_s} \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s})
            & \text{by } \textbf{A4}
        \newline
        &= 0 \quad \forall i \neq s
    \end{align}
$

## <font color='#28B463'>[A3 + A4] Spherical Errors

<br>
The disturbance (or error) terms are said to be <b>spherical</b> when we have both homoscedasticity (<b>A3</b>) and no serial (or auto) correlation (<b>A4</b>); in this case the covariance matrix of the disturbance term is :

<br>
$
    \quad
    \begin{align}
    \mathbf{E} \ [ \varepsilon \varepsilon^{\top} \mid X ]
    \quad &= \quad
    \mathbf{E} \ 
    \begin{bmatrix}
        \varepsilon_1 \mid X \\
        \varepsilon_2 \mid X \\
        \vdots               \\
        \vdots               \\
        \varepsilon_m \mid X
    \end{bmatrix}_\textit{ m x 1 }
    \begin{bmatrix}
        \varepsilon_1 \mid X &
        \varepsilon_2 \mid X &
        \dots                &
        \dots                &
        \varepsilon_m \mid X 
    \end{bmatrix}_\textit{ 1 x m }
    \newline \newline
    &= \quad
    \mathbf{E} \ 
    \begin{bmatrix}
        {\varepsilon_1}^2 \mid X           &  \varepsilon_1\varepsilon_2 \mid X & \dots  & \varepsilon_1\varepsilon_m \mid X  \\
        \varepsilon_2\varepsilon_1 \mid X  &  {\varepsilon_2}^2 \mid X          & \dots  & \varepsilon_2\varepsilon_m \mid X  \\
        \vdots                             &  \vdots                            & \vdots & \vdots                             \\
        \vdots                             &  \vdots                            & \ddots & \vdots                             \\
        \varepsilon_m\varepsilon_1 \mid X  &  \varepsilon_m\varepsilon_2 \mid X & \dots  & {\varepsilon_m}^2 \mid X  
    \end{bmatrix}_\textit{ m x m }
    \quad = \quad 
    \begin{bmatrix}
          \mathbf{E} [ {\varepsilon_1}^2 \mid X ]
        & \mathbf{E} [ \varepsilon_1\varepsilon_2 \mid X ]
        & \dots  
        & \mathbf{E} [ \varepsilon_1\varepsilon_m \mid X ] 
        \\
          \mathbf{E} [ \varepsilon_2\varepsilon_1 \mid X ]  
        & \mathbf{E} [  {\varepsilon_2}^2 \mid X ]          
        & \dots  
        & \mathbf{E} [ \varepsilon_2\varepsilon_m \mid X ]  
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathbf{E} [ \varepsilon_m\varepsilon_1 \mid X ]  
        & \mathbf{E} [ \varepsilon_m\varepsilon_2 \mid X ] 
        & \dots  
        & \mathbf{E} [ {\varepsilon_m}^2 \mid X ]
    \end{bmatrix}_\textit{ m x m }    
    \newline \newline
    &= \quad 
    \begin{bmatrix}
          \mathrm{Var}(\varepsilon_1 \mid X) 
        & \mathrm{Cov}(\varepsilon_1\varepsilon_2 \mid X)
        & \dots
        & \mathrm{Cov}(\varepsilon_1\varepsilon_m \mid X)
        \\
          \mathrm{Cov}(\varepsilon_2\varepsilon_1 \mid X)
        & \mathrm{Var}(\varepsilon_2 \mid X) 
        & \dots
        & \mathrm{Cov}(\varepsilon_2\varepsilon_m \mid X)
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathrm{Cov}(\varepsilon_m\varepsilon_1 \mid X)
        & \mathrm{Cov}(\varepsilon_m\varepsilon_2 \mid X)
        & \dots 
        & \mathrm{Var}(\varepsilon_m \mid X) 
    \end{bmatrix}
    \quad & \text{by } \textbf{A3}
    \newline \newline
    &= \quad 
    \begin{bmatrix}
          \sigma^2  
        & \mathrm{Cov}(\varepsilon_1\varepsilon_2 \mid X)
        & \dots 
        & \mathrm{Cov}(\varepsilon_1\varepsilon_m \mid X)         
        \\
          \mathrm{Cov}(\varepsilon_2\varepsilon_1 \mid X)
        & \sigma^2  
        & \dots
        & \mathrm{Cov}(\varepsilon_2\varepsilon_m \mid X)
        \\
        \vdots & \vdots & \vdots & \vdots 
        \\
        \vdots & \vdots & \ddots & \vdots              
        \\
          \mathrm{Cov}(\varepsilon_m\varepsilon_1 \mid X)
        & \mathrm{Cov}(\varepsilon_m\varepsilon_2 \mid X)
        & \dots 
        & \sigma^2  
    \end{bmatrix}  
    & \text{by } \textbf{A4}
    \newline \newline
    &= \quad 
    \begin{bmatrix}
        \sigma^2 & 0        & \dots  & 0        \\
        0        & \sigma^2 & \dots  & 0        \\
        \vdots   & \vdots   & \vdots & \vdots   \\
        \vdots   & \vdots   & \ddots & \vdots   \\
        0        & 0        & \dots  & \sigma^2 \\ 
    \end{bmatrix}  
    \quad = \quad \boldsymbol{\sigma^2} \boldsymbol{\textit{I}}
    \end{align}
$



## [A5] Random Sampling or Independent Random Sampling

<br>
The sample data consist of $\boldsymbol{N}$ randomly selected observations on the regressand $\mathbf{Y}$ and the regressor $\mathbf{X}$, the two observable variables in the PRE described by <b>A1</b>. In other words, the sample observations are randomly selected from the underlying population; they are a random subset of the population data points.

<br>
Assumption <b>A5</b> implies that the sample observations are statistically independent, which means that : <br>

<ul style="list-style-type:square">
    <li>
         the disturbance terms $\boldsymbol{\varepsilon_i}$ and $\boldsymbol{\varepsilon_s}$ are therefore statistically
         independent as well, and hence have zero covariance (or are uncorrelated) for any two observations $\boldsymbol{i}$ and
         $\boldsymbol{s}$ 
    </li>
    <br>
    <li>
        the population values $\boldsymbol{\mathbf{Y}_i}$ and $\boldsymbol{\mathbf{Y}_s}$ are therefore statistically
        independent as well, and hence have zero covariance (or are uncorrelated) for any two observations $\boldsymbol{i}$ and
        $\boldsymbol{s}$ 
    </li>
</ul>

<br>
The assumption of random sampling is therefore sufficient for (or implies) assumption <b>A4</b> (zero covariance between observations), but the former is stronger than the latter. 


### [A5 | Further considerations]

<br>
The random sampling assumption is usually appropriate for cross-sectional regression models, i.e. for regression models formulated for <b>cross-sectional data</b>. What are cross-sectional data ? <br>

<ul style="list-style-type:square">
    <li>
         <b>definition : </b>a cross-sectional data set consists of a sample of observations on individual economic agents or
         other units taken at a single point in time or over a single period of time
    </li>
    <br>
    <li>
        a distinguishing characteristic of any cross-sectional data set is that the individual observations have no natural
        ordering
    </li>
    <br>
    <li>
        a common, almost universal characteristic of cross-sectional data sets is that they usually are constructed by random
        sampling from underlying populations
    </li>
</ul>

<br>
The random sampling assumption is hardly ever appropriate for time-series regression models, i.e. for regression models formulated for <b>time-series data</b>. What are cross-sectional data ? <br>

<ul style="list-style-type:square">
    <li>
         <b>definition : </b>a time-series data set consists of a sample of observations on one or more variables over several
         successive periods or intervals of time
    </li>
    <br>
    <li>
        a distinguishing characteristic of any time-series data set is that the observations have a natural ordering,
        specifically a chronological ordering
    </li>
    <br>
    <li>
        a common, almost universal characteristic of time-series data sets is that the sample observations exhibit a high degree
        of time dependence, and therefore cannot be assumed to be generated by random sampling
    </li>
</ul>



## [A6] Number of observations

<br>
The number of sample observations $\boldsymbol{m}$ is greater than the number of unknown parameters $\boldsymbol{p}$. 

<br>
Unless this assumption is satisfied, it is not possible to compute, from a given sample of $\boldsymbol{m}$ observations, estimates of all the $\boldsymbol{p}$ unknown parameters in the model.


## [A7] Non-constant Regressor 

<br>
The sample values $\boldsymbol{\mathbf{X}_i}$ of the regressor $\boldsymbol{\mathbf{X}}$ in the sample (and hence in the
population) are not all the same; i.e. they are not constant :

$ 
    \quad 
    \boldsymbol{\mathbf{X}_i} \neq \boldsymbol{c} 
    \quad \forall i \quad \text{(i = 1,} \dots \text{, m)} 
    \quad \text{where } \boldsymbol{c} \text{ is a constant}
$

<br>
In more mathematical terms, assumption <b>A7</b> requires the sample variance of the regressor values $\boldsymbol{\mathbf{X}_i}$ to be a finite positive number for any sample size $\boldsymbol{N}$ :

$
    \quad 
    \text{sample variance of } \boldsymbol{\mathbf{X}_i} 
    \ = \ \mathrm{Var}(\boldsymbol{\mathbf{X}_i})
    \ = \ \dfrac
        {\sum_{i=1}^{m} (\mathbf{X}_i - \overline{\mathbf{X}})^2}
        {m - 1}
    \ = \ \boldsymbol{{\sigma_X}^2} > 0 
$

<br>
In order to estimate, from the sample data, the effect of changes in the regressors on the regressand, the sample values $\boldsymbol{\mathbf{X}_i}$ of the regressors $\boldsymbol{\mathbf{X}}$ must vary across observations in any given sample.


## [A8] No Perfect Multicollinearity 

<br>
For standard least squares estimation methods, the design matrix $\mathbf{X}$ must have full column rank $\boldsymbol{p}$ (therefore $\mathbf{X}^{\top}\mathbf{X}$ invertible), otherwise we have a condition known as multicollinearity in the predictor variables.

<br>
Multicollinearity can be triggered by having two or more perfectly correlated predictor variables (same regressor mistakenly given twice, linear transformations etc.), but it can also happen if there is too little data available compared to the number of parameters to be estimated (see <b>A6</b>).

<br>
In presence of multicollinearity, the parameter vector $\boldsymbol{\beta}$ will be non-identifiable (identification problem), for it has no unique solution.

## Extensions

<br>
Numerous extensions have been developed that allow some of these assumptions to be relaxed (reduced to a weaker form), and in some cases eliminated entirely. Some methods are general enough that they can relax multiple assumptions at once, and in other cases this can be achieved by combining different extensions. Generally these extensions make the estimation procedure more complex and time-consuming, and may also require more data in order to produce an equally precise model.   


## References

<br>
<ul style="list-style-type:square">
    <li>
         Queen's University at Kingston - Economics 351 - M.G. Abbott - 
         <a href="https://bit.ly/2IGh79a">
         Specification : Assumptions of the Simple Classical Linear Regression Model (CLRM) </a>         
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2GL9l8j">
        Linear regression</a>       
    </li>
</ul>
