# VIOLATION OF HOMOSCEDASTICITY

<br>

## Introduction

<br>
The assumption of homoscedasticity (<b>A3</b>) states that : 

<br>
<blockquote>
The conditional variances of the disturbance terms $\boldsymbol{\varepsilon_i}$ are identical for all observations (all population values $\boldsymbol{\mathbf{X}_i}$) and equal the same finite positive (unknown) constant $\boldsymbol{\sigma^2}$.
</blockquote>

<br>
<blockquote>
$
    \mathrm{Var} (\boldsymbol{\varepsilon_i} \mid \boldsymbol{\mathbf{X}_i})
    \ = \ \boldsymbol{\sigma^2} > 0 \quad \forall i \quad \text{(i = 1,} \dots \text{, m)}    
$    
</blockquote>

<br>
A collection of random variables is said to be heteroscedastic (or to present heteroscedasticity) if there are sub-populations having different variabilities from others; in this context variability could be quantified by the variance or any other measure of statistical dispersion. Thus <b>heteroscedasticity</b> can be defined as the violation of homoscedasticity.

## Consequences

<br>
Homoscedasticity is not required for the estimates to be <b>unbiased, consistent, and asymptotically normal</b>. 

<br>
Violations of the assumption of homoscedasticity will affect our estimation however :

<br>
<ul style="list-style-type:square">
    <li>
        the OLS estimator $\hat{\boldsymbol{\beta}}_\boldsymbol{OLS-1}$ will no longer have the lowest variance in the class of
        the linear unbiased estimators (loss of efficiency)
    </li>
    <br>
    <li>
        as a direct consequence, OLS is no longer the Best Linear Unbiased Estimator of the population parameters
    </li>
    <br>
    <li>
        the OLS estimator $\mathrm{V}(\hat{\boldsymbol{\beta}}_\boldsymbol{OLS}) $ ,and thus the standard errors as well, may be
        both biased and inconsistent
    </li>
    <br>
    <li>
        biased standard errors lead to biased inference, meaning confidence intervals and the results of hypothesis tests are no
        longer reliable        
    </li>
</ul>


## Detection

<br>
There are several methods to test for the presence of heteroscedasticity. The tests below consist of a test statistic (a mathematical expression yielding a numerical value as a function of the data), a hypothesis that is going to be tested (the null hypothesis), an alternative hypothesis, and a statement about the distribution of statistic under the null hypothesis.

<br>
<ul style="list-style-type:square">
    <li> 
        <a href="https://bit.ly/2s78aKU">Levene's test</a> 
    </li>
    <br> 
    <li> 
        <a href="https://bit.ly/2ILCrGl">Goldfeld – Quandt test</a>
    </li>
    <br>
    <li> 
        <a href="https://bit.ly/2klUkkx">Park test</a>
    </li>
    <br>
    <li> 
        <a href="https://bit.ly/2GMDrs3">Glejser test</a>
    </li>
    <br>
    <li> 
        <a href="https://bit.ly/2koRv2c">Brown – Forsythe test</a>
    </li>
    <br>
    <li> Harrison – McCabe test </li>
    <br>
    <li> 
        <a href="https://bit.ly/2IGYDFv">Breusch – Pagan test</a>
    </li>
    <br>
    <li> 
        <a href="https://bit.ly/2s6LvPU">White test</a>
    </li>
    <br>
    <li> Cook – Weisberg test </li>
</ul>

## Residuals-vs-Fits plot

<br>
Heteroskedasticity can also be informally detected by looking at the <b>residuals-vs-fits plot</b> or, in the case of time series data, at the <b>residuals-vs-time plot</b>. To be really thorough, we should also generate plots of residuals vs independent variables to look for consistency there as well.  

<br>
When the residuals roughly form a sort of "horizontal band" around the zero horizontal line, this suggests that the error terms have a constant variance regardless of the value of the regressors. 


In [1]:
import linear_regression as lr
lr.plot_residuals_vs_fits(cmd = 'homoscedasticity_correct')

  from pandas.core import datetools


FileNotFoundError: File b'../datasets/alcoholarm.txt' does not exist

<br>
Be alert for evidence of residuals that grow larger either as a function of the predicted value or as a function of time. The assumption of homoscedasticity is not justified when the spread of the residuals in the residuals-vs-fits plot varies in some complex fashion, two common cases are :

<br>
<ul style="list-style-type:square">
    <li>
         a <b>fanning</b> effect, the residuals have a certain variance for small values of the fitted variable but this
         variance increases with larger and larger values of the fitted variable (the residuals variance are more spread out for
         large fitted values)
    </li>
    <br>
    <li>
        a <b>funneling</b> effect, it's the opposite of the fanning effect 
    </li>
</ul>


In [None]:
lr.plot_residuals_vs_fits(cmd = 'homoscedasticity_violation')


## Correction

<br>
There are four common corrections for heteroscedasticity : 

<br>
<ul style="list-style-type:square">
    <li>
         use <b>logarithmized data</b>; non-logarithmized series that are growing exponentially often appear to have increasing
         variability, random volatility, or volatility clusters as the series rises over time. The variability in percentage
         terms may, however, be rather stable. 
    </li>
    <br>
    <li>
        use a <b>different specification</b> for the model (different independent variables, or non-linear transformations of
        the latter)
    </li>
    <br>
    <li>
        apply a <b>weighted least squares</b> estimation method, in which OLS is applied to a transformed version of the
        original regression equation
    </li>
    <br>
    <li>
        <b>heteroscedasticity-consistent standard errors (HCSE)</b>, while still biased, improve upon OLS estimates. HCSE is a
        consistent estimator of standard errors in regression models with heteroscedasticity. This method corrects for
        heteroscedasticity (when present) without altering the values of the coefficients, and when data are actually
        homoscedastic it returns standard errors equivalent to those estimated by OLS . 
    </li>
</ul>


## Considerations

<br>
Weighted least squares estimates of the coefficients will usually be nearly the same as the "ordinary" (unweighted) estimates. In cases where they differ substantially, the procedure can be iterated until estimated coefficients stabilize (often in no more than one or two iterations); this is called <b>iteratively reweighted least squares (IRLS)</b>.


## <font color='#28B463'>References

<br>
<ul style="list-style-type:square"> 
    <li>
        PennState University -  Stat 501 - 
        <a href="https://bit.ly/2kozxN2">
        Lesson 13 : Weighted Least Squares and Robust Regression</a>        
    </li>
    <br>
    <li>
        PennState University -  Stat 501 - 
        <a href="https://bit.ly/2KNOP9x">
        Lesson 4: SLR Model Assumptions</a>
    </li>
    <br>
    <li>
        The Minitab Blog - 
        <a href="https://bit.ly/2q8P4Tb">
        Why you need to check your Residual Plots for Regression Analysis : or, to err is human, to err randomly is
        statistically divine</a>        
    </li>
    <br>
    <li>
        StackExchange - Cross Validated - 
        <a href="https://bit.ly/2kmSoIv">
        What are the consequences of having non-constant variance in the error terms in linear regression?</a>        
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2schP43">
        Heteroscedasticity</a>        
    </li>
    <br>
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2klSMH9">
        Homoscedasticity</a>        
    </li>    
</ul>