# VIOLATION OF INDEPENDENCE OF THE ERROR TERMS

<br>

## Introduction

<br>
The assumption of independence of the disturbance terms (<b>A4</b>) states that : 

<br>
<blockquote>
The disturbance terms have zero conditional covariance across observations of the regressors
</blockquote>

<br>
<blockquote>
$
    \mathrm{Cov} 
    (\boldsymbol{\varepsilon_i},\boldsymbol{\varepsilon_s} \mid \boldsymbol{\mathbf{X}_i},\boldsymbol{\mathbf{X}_s}) 
    \ = \ 0 \quad \forall i \neq s
$
</blockquote>

<br>
When this assumption is not justified, the dependency usually appears because of a temporal component. Error terms correlated over time are said to be autocorrelated or serially correlated. 

<br>
The current notebook will often refer to the violation of homoscedasticity; since <b>A3</b> and <b>A4</b> are the two assumptions that affect the covariance matrix of the disturbance terms, they share common features in the impact. 


## Consequences

<br>
Just like homoscedasticity, independence of the error terms is not required for the OLS estimates to be <b>unbiased, consistent, and asymptotically normal</b>. 

<br>
Violations of <b>A4</b> will affect our estimation however :

<br>
<ul style="list-style-type:square">
    <li>
        the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ will no longer have the lowest variance in the class of
        the linear unbiased estimators (loss of efficiency)
    </li>
    <br>
    <li>
        as a direct consequence, OLS is no longer the Best Linear Unbiased Estimator of the population parameters
    </li>
    <br>
    <li>
        the OLS estimator $\mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) $ ,and thus the standard errors as well, may be
        both biased and inconsistent
    </li>
    <br>
    <li>
        biased standard errors lead to biased inference, meaning confidence intervals and the results of hypothesis tests are no
        longer reliable        
    </li>
</ul>

<br>
It's important to notice that the consequences listed above are exactly the same arising from violations of homoscedasticity.


### Impact on test statistics

<br>
The F-statistic to test for overall significance of the regression may be inflated under positive serial correlation because the mean squared error (MSE) will tend to underestimate the population error variance. 

<br>
Furthermore, positive serial correlation typically causes the ordinary least squares (OLS) standard errors for the regression coefficients to underestimate the true standard errors. As a consequence, if positive serial correlation is present in the regression, standard linear regression analysis will typically lead us to compute artificially small standard errors for the regression coefficient. These small standard errors will cause the estimated t-statistic to be inflated, suggesting significance where perhaps there is none. The inflated t-statistic, may in turn, lead us to incorrectly reject null hypotheses, about population values of the parameters of the regression model more often than we would if the standard errors were correctly estimated.


## Autoregressive Models

<br>
A time series is a sequence of measurements of the same variable(s) made over time. Usually the measurements are made at evenly spaced times. To emphasize that we have measured values over time, we use the subscript $t$ instead of the usual $i$.

<br>
We have an autoregressive model when values from a time series are regressed on previous values from that same time series; in the regression model described by the equation below (which is just an example), the value of the response variable at the previous time period has become the main regressor : 

<br>
$
    \quad
    \boldsymbol{\mathbf{Y}_{t}} 
    \ = \ \boldsymbol{\beta_0} + \boldsymbol{\beta_1} \ \boldsymbol{\mathbf{Y}_{t-1}} + \boldsymbol{\epsilon_{t}}
$

<br>
The order of the autoregression is the number of immediately preceding values in the series that are used to predict the value at the present time. With this definition in mind, we can say that the model in the example is a first-order autoregression, written as <b><i>AR(1)</i></b>.

<br>
More generally, a $k^\text{th}$ order autoregression, <b><i>AR(k)</i></b>, is a multiple linear regression in which the value of the series at any time $t$ is a (linear) function of the values at times $ \ t-1, t-2, \dots, t-k \ $.

### Autocorrelation and Partial Autocorrelation

<br>
The coefficient of correlation between two values in a time series is called the <b>autocorrelation function (ACF)</b>, and is a way to measure the linear relationship between an observation at time $t$ and the observations at previous times :

<br>
$
    \quad
    ACF \ = \ \mathrm{Corr}(\boldsymbol{\mathbf{Y}_t}, \boldsymbol{\mathbf{Y}_{t-k}})
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
    [\textbf{E1}] 
$

<br>
Here, the value of $k$ represents the time gap being considered and is called <b>lag</b>. In general, a lag-$k$ autocorrelation is the coefficient of correlation between values that are $k$ time periods apart. The ACF 

<br>
If we assume an <i>AR(k)</i> model, then we may wish to only measure the association between $\boldsymbol{\mathbf{Y}_t}$ and $\boldsymbol{\mathbf{Y}_{t-k}}$ and filter out the linear influence of the random variables that lie in between (i.e. $ \ \boldsymbol{\mathbf{Y}_{t-1}}, \boldsymbol{\mathbf{Y}_{t-2}}, \dots, \boldsymbol{\mathbf{Y}_{t-(k-1)}} \ $, which requires a transformation on the time series. Then by calculating the correlation of the transformed time series we obtain the <b>partial autocorrelation function (PACF)</b>.

<br>
The PACF is most useful for identifying the order of an autoregressive model. Specifically, sample partial autocorrelations that are significantly different from 0 indicate lagged terms of $\mathbf{Y}$ that are useful predictors of $\boldsymbol{\mathbf{Y}_t}$. To help differentiate between ACF and PACF, think of them as analogues to R2 and partial R2 values as discussed previously.

### Assessing the lag

<br>
Graphical approaches to assessing the lag of an autoregressive model include looking at the ACF and PACF values versus the lag. 

<br>
In a plot of ACF versus the lag, if you see large ACF values and a non-random pattern, then likely the values are serially correlated. In a plot of PACF versus the lag, the pattern will usually appear random, but large PACF values at a given lag indicate this value as a possible choice for the order of an autoregressive model. 

<br>
It is important that the choice of the order makes sense. For example, suppose you have blood pressure readings for every day over the past two years. You may find that an <i>AR(1)</i> or <i>AR(2)</i> model is appropriate for modeling blood pressure. However, the PACF may indicate a large partial autocorrelation value at a lag of 17, but such a large order for an autoregressive model likely does not make much sense.



### Regression with Autoregressive Errors

<br>
The difficulty that often arises in the context of autoregressive models is that the disturbance terms may be correlated with each other. In other words, we have autocorrelation (or dependency) between the errors.

<br>
We may consider situations in which the error at one specific time is linearly related to the error at the previous time. That is, the errors themselves follow a simple linear regression model that can be written as :

<br>
$
    \quad
    \boldsymbol{\varepsilon_{t}} \ = \ \boldsymbol{\rho} \ \boldsymbol{\varepsilon_{t-1}} + \boldsymbol{\omega_{t}}
    \qquad \qquad \text{where } \mid \ \boldsymbol{\rho} \mid \ < 1
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
    \qquad \qquad \qquad \qquad \qquad \qquad \quad \
    [\textbf{E2}] 
$

<br>
Here, $\boldsymbol{\rho}$ is called the <b>autocorrelation parameter</b> and $ \ \boldsymbol{\omega_{t}} \ $ is a new error term that follows the usual assumptions we make about disturbance terms : $ \ \boldsymbol{\omega_{t}} \approx IID \ \textit{N}(0,\boldsymbol{\sigma^2}) \ $. This model says that the error at time $t$ can be predicted from a fraction of the error at the previous time period plus some new random perturbation.

<br>
We can use partial autocorrelation function (PACF) plots to help us assess appropriate lags for the errors in a regression model with autoregressive errors. Specifically, we first fit a multiple linear regression model to our time series data and store the residuals. Then we can look at a plot of the PACF for the residuals versus the lag. Large sample partial autocorrelations that are significantly different from 0 indicate lagged terms of $\boldsymbol{\epsilon}$ that may be useful predictors of $\boldsymbol{\epsilon_{t}}$.

<br>
There are several different methods for estimating the regression parameters when we have errors with an autoregressive structure and we will introduce a few of these methods later in this notebook.


## Detection

<br>
The easiest way to assess the presence of dependency is by producing a scatterplot of the residuals versus the time measurement for that observation, assuming the data are arranged according to a time sequence order. If the data are independent, then the residuals should look randomly scattered about zero; however, if a noticeable pattern emerges (particularly one that is cyclical) then dependency is likely an issue.

<br>
If we suspect first-order autocorrelation with the errors, then a formal test does exist regarding the autocorrelation parameter $\boldsymbol{\rho}$.

<br>
In statistics, the <b>Durbin – Watson statistic</b> is a test statistic used to detect the presence of autocorrelation in the residuals from a regression analysis. Durbin and Watson applied this statistic to the residuals from least squares regressions, and developed bounds tests for the null hypothesis that the errors are serially uncorrelated against the alternative that they follow a first order autoregressive process. 

Formally, the Durbin - Watson test is constructed as :

<br>
$
    \quad
    \begin{align} 
        H_{0} \ &: \ \boldsymbol{\rho} = 0 
        \newline
        H_{A} \ &: \ \boldsymbol{\rho} \neq 0
    \end{align}    
$

<br>
$
    \quad
    \boldsymbol{d} \ = \
        \dfrac 
            { \sum _{t=2}^{m} (\boldsymbol{e_{t}} - \boldsymbol{e_{t-1}})^\boldsymbol{2} } 
            { \sum _{t=1}^{m} \boldsymbol{e_{t}^2} } 
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \
    [\textbf{E3}] 
$

<br>
The null hypothesis of $ H_{0} $ means that $ \ \boldsymbol{e_{t}} = \boldsymbol{\omega_{t}} \ $; in other words, the error term in one period is not correlated with the error term in the previous period. The alternative hypothesis $ H_{A} $ means the error term in one period is either positively or negatively correlated with the error term in the previous period.

<br>
The value of $\boldsymbol{d}$ always lies between 0 and 4; since $\boldsymbol{d}$ is approximately equal to $ \ 2(1 − r) \ $ (where $r$ is the sample autocorrelation of the residuals), $ \boldsymbol{d} = 2 \ $ indicates no autocorrelation. 

If the Durbin – Watson statistic is substantially less than 2, there is evidence of positive serial correlation. As a rough rule of thumb, if the test statistic is less than 1, there may be cause for alarm. Small values of $\boldsymbol{d}$ indicate successive error terms are, on average, close in value to one another, or positively correlated. If $\boldsymbol{d} > 2$, successive error terms are, on average, much different in value from one another, i.e., negatively correlated.

<br>
<b>Positive serial correlation</b> is serial correlation in which a positive error for one observation increases the chances of a positive error for another observation. To test for positive autocorrelation at significance $\mathbf{\alpha}$, the test statistic $\boldsymbol{d}$ is compared to lower and upper critical values $\boldsymbol{d_L(\alpha)}$ and $\boldsymbol{d_U(\alpha)}$ :

<br>
<ul style="list-style-type:square">
    <li>
        if $ \ \boldsymbol{d} < \boldsymbol{d_L(\alpha)} \ $, there is statistical evidence that the error terms are positively
        autocorrelated
    </li>
    <br>
    <li>
        if $ \ \boldsymbol{d} > \boldsymbol{d_U(\alpha)} \ $, there is no statistical evidence that the error terms are
        positively autocorrelated
    </li>
    <br>
    <li>
        if $ \ \boldsymbol{d_L(\alpha)} < \boldsymbol{d} < \boldsymbol{d_U(\alpha)} \ $, the test is inconclusive      
    </li>
</ul>

<br>
<b>Negative serial correlation</b> implies that a positive error for one observation increases the chance of a negative error for another observation, and viceversa. To test for negative autocorrelation at significance $\mathbf{\alpha}$, the test statistic $(4 − \boldsymbol{d})$ is compared to lower and upper critical valuesis $\boldsymbol{d_L(\alpha)}$ and $\boldsymbol{d_U(\alpha)}$ :

<br>
<ul style="list-style-type:square">
    <li>
        if $ \ (4 − \boldsymbol{d}) < \boldsymbol{d_L(\alpha)} \ $, there is statistical evidence that the error terms are
        negatively autocorrelated
    </li>
    <br>
    <li>
        if $ \ (4 − \boldsymbol{d}) > \boldsymbol{d_U(\alpha)} \ $, there is no statistical evidence that the error terms are
        negatively autocorrelated
    </li>
    <br>
    <li>
        if $ \ \boldsymbol{d_L(\alpha)} < (4 − \boldsymbol{d}) < \boldsymbol{d_U(\alpha)} \ $, the test is inconclusive   
    </li>
</ul>

<br>
The critical values $\boldsymbol{d_L(\alpha)}$ and $\boldsymbol{d_U(\alpha)}$ vary by level of significance ($\alpha$), the number of observations, and the number of predictors in the regression equation. Their derivation is complex, and statisticians typically obtain these two values from statistical texts.


## Correction

<br>
If the Durbin – Watson statistic indicates the presence of serial correlation of the residuals (and consequently of the error terms), one of the first remedial measures should be to investigate the omission of a key predictor variable. 

<br>
If such a predictor does not aid in reducing or eliminating the autocorrelation, then certain transformations on the variables can be performed.



### Cochrane – Orcutt Estimation

<br>Cochrane–Orcutt estimation is a procedure which adjusts a linear model for serial correlation in the error term. Consider the model : 

<br>
$
    \quad
    \boldsymbol{\mathbf{Y}_t} \ = \
        \boldsymbol{\beta_0} + \boldsymbol{\beta_1} \ \boldsymbol{\mathbf{X}_t} + \boldsymbol{\varepsilon_t}
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad  
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
    [\textbf{E4}] 
$


<br>
If the process generating the residuals is found to be a stationary first-order autoregressive structure ($ \boldsymbol{\varepsilon_{t}} = \boldsymbol{\rho} \ \boldsymbol{\varepsilon_{t-1}} + \boldsymbol{\omega_{t}} $ with the errors $\boldsymbol{\omega_{t}}$ being white noise), then the Cochrane – Orcutt procedure transforms the model by taking a quasi-difference :

<br>
$
    \quad
    \begin{align}
        \boldsymbol{\mathbf{Y}_t} - \boldsymbol{\rho} \ \boldsymbol{\mathbf{Y}_{t-1}} 
        &=
        \newline
        &= 
            ( 
                      \boldsymbol{\beta_0}
                    + \boldsymbol{\beta_1} \ \boldsymbol{\mathbf{X}_t} 
                    + \boldsymbol{\varepsilon_t} 
            ) 
            - \boldsymbol{\rho} (
                      \boldsymbol{\beta_0} 
                    + \boldsymbol{\beta_1} \ \boldsymbol{\mathbf{X}_{t-1}} 
                    + \boldsymbol{\varepsilon_{t-1}}
            )
            & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \ \text{since } \textbf{E2}
        \newline
        &= 
              \boldsymbol{\beta_0} (1 - \boldsymbol{\rho})
            + \boldsymbol{\beta_1} (\boldsymbol{\mathbf{X}_t} - \boldsymbol{\rho} \ \boldsymbol{\mathbf{X}_{t-1}})
            + \boldsymbol{\omega_{t}}
    \end{align}
$

$
    \quad
    \boldsymbol{ {\mathbf{Y}_t}^* } =
          \boldsymbol{ {\beta_0}^* } 
        + \boldsymbol{ {\beta_1}^* } \ \boldsymbol{ {\mathbf{X}_t}^* } 
        + \boldsymbol{ {\varepsilon_t}^* }  
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad 
    \qquad \qquad \qquad \qquad \qquad \qquad  \qquad \qquad \qquad
    [\textbf{E5}] 
$


<br>
The procedure in can be described as an iterative process, consisting of the following steps :

<br>
<ul style="list-style-type:square">
    <li>
        estimate the autocorrelation parameter $\boldsymbol{\rho}$ : if $\boldsymbol{\rho}$ is not known, then it can be
        estimated by first regressing the original model and obtaining the residuals $\boldsymbol{e}$, then regressing
        $\boldsymbol{e_{t}}$ on $\boldsymbol{e_{t-1}}$ to compute an estimate of the autocorrelation parameter which is often
        called $\boldsymbol{r}$
    </li>
    <br>
    <li>
        transform the original model <b>E4</b> into <b>E5</b> 
    </li>
    <br>
    <li>
        regress $\boldsymbol{ {\mathbf{Y}_t}^* }$ on the transformed predictors using ordinary least squares to obtain estimates
        of the transformed parameters $ \boldsymbol{ {\beta_0}^* }, \dots, \boldsymbol{ {\beta_{p-1}}^* }$
    </li>
    <br>
    <li>
        examine the current regression residuals and determine if autocorrelation is still present (using the Durbin - Watson
        test for example). If autocorrelation is still present, then iterate this procedure.
    </li>
    <br>
    <li>
        if autocorrelation appears to be corrected, then we can transform the estimated parameters back to their original scale
        by setting <br><br>
        $ 
            \quad
            \begin{align}
                \boldsymbol{\hat{\beta_0}} &= \boldsymbol{\hat{\beta_0}^*} / \ (1 - r)
                \newline
                \boldsymbol{\hat{\beta_j}} &= 
                    \boldsymbol{\hat{\beta_j}^*} \quad \forall j \quad \text{(j = 1,} \dots \text{, p - 1)}
            \end{align}
        $ <br><br>       
        Notice that only the intercept parameter requires a transformation. Furthermore, the standard errors of the regression
        estimates for the original scale can also be obtained by setting <br><br>
        $
            \quad
            \begin{align}
                SE(\boldsymbol{\hat{\beta_0}}) &= \ SE(\boldsymbol{\hat{\beta_0}}) \ / \ (1 - r) 
                \newline
                SE(\boldsymbol{\hat{\beta_j}}) &= \
                    SE(\boldsymbol{\hat{\beta_j}}) \quad \forall j \quad \text{(j = 1,} \dots \text{, p - 1)}
            \end{align}
        $
    </li>
</ul>

<br>
One thing to note about the Cochrane - Orcutt approach is that it does not always work properly; this occurs primarily because if the errors are positively autocorrelated, then the estimated autocorrelation coefficient $\boldsymbol{r}$ tends to underestimate $\boldsymbol{\rho}$. When this bias is serious, it can seriously reduce the effectiveness of the Cochrane - Orcutt procedure.


## <font color='#28B463'>References

<br>
<ul style="list-style-type:square">
    <li>
        PennState University -  Stat 501 - 
        <a href="https://bit.ly/2kozxN2">
        Lesson 14 : Time Series and Autocorrelation</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2sa3G6B">
        Cochrane Orcutt estimation</a>
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2xEofZR">
        Durbin Watson statistic</a>
    </li>    
</ul>