# VIOLATION OF NO PERFECT MULTICOLLINEARITY

<br>

## Introduction

<br>
The assumption of no perfect multicollinearity (<b>A8</b>) states that : 

<br>
<blockquote>
The design matrix $\mathbf{X}$ must have full column rank $\boldsymbol{p}$ (therefore $\mathbf{X}^{\top}\mathbf{X}$ invertible), otherwise we have a condition known as multicollinearity in the predictor variables.
</blockquote>

<br>
<blockquote>
Multicollinearity can be triggered by having two or more perfectly correlated predictor variables (same regressor mistakenly given twice, linear transformations etc.), but it can also happen if there is too little data available compared to the number of parameters to be estimated.
</blockquote>

<br>
Formally, we define <b>multicollinearity</b> as the situation in which two or more explanatory variables in a regression model are highly linearly related. <br>
A set of variables is said to exhibit <b>perfect multicollinearity</b> if there exist one (or more) exact linear relationship among some of the variables :

<br>
$
    \quad
    \boldsymbol{\lambda_0} + \boldsymbol{\lambda_1} \boldsymbol{\mathbf{X}_{1i}} 
    + \boldsymbol{\lambda_2} \boldsymbol{\mathbf{X}_{2i}} + \dots + \boldsymbol{\lambda_k} \boldsymbol{\mathbf{X}_{ki}} \ = \ 0
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
    [\textbf{E1}] 
$

<br>
Perfect multicollinearity is fairly common when working with raw datasets, which frequently contain redundant information. Once redundancies are identified and removed, however, nearly multicollinear variables often remain due to correlations inherent in the system being studied. 

<br>When dealing with non-perfect multicollinearity, the linear relationship is no longer exact and is now described by a modified version of <b>E1</b> which includes an error term $\boldsymbol{V_i}$ :

<br>
$
    \quad
    \boldsymbol{\lambda_0} + \boldsymbol{\lambda_1} \boldsymbol{\mathbf{X}_{1i}} 
    + \boldsymbol{\lambda_2} \boldsymbol{\mathbf{X}_{2i}} + \dots + \boldsymbol{\lambda_k} \boldsymbol{\mathbf{X}_{ki}} 
    + \boldsymbol{V_i} \ = \ 0
    \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad
    [\textbf{E2}] 
$

<br>
In statements regarding the assumptions underlying regression analysis, the phrase "no multicollinearity" is often used to mean the absence of perfect multicollinearity.


## Consequences

<br>
In the case of perfect multicollinearity, the design matrix $\mathbf{X}$ has less than full rank, and therefore the matrix $\mathbf{X}^{\top}\mathbf{X}$ cannot be inverted. Under these circumstances the ordinary least squares estimator $\boldsymbol{\hat{\beta}_{OLS}}$ does not exist.

<br>
<blockquote>
In the presence of multicollinearity, the parameter vector $\boldsymbol{\beta}$ will be non-identifiable (identification problem), for it has no unique solution.
</blockquote>

<br>
In the case of non-perfect multicollinearity, the matrix $\mathbf{X}^{\top}\mathbf{X}$ actually has an inverse but a given computer algorithm may or may not be able to compute an approximate inverse. If it does so, the resulting matrix may be <b>highly sensitive to slight variations in the data, and may therefore be very inaccurate or very sample-dependent</b>. The design matrix $\mathbf{X}$ is said to be ill-conditioned.

<br>
Multicollinearity <b>does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors</b>. That is, a multiple regression model with colinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

<br>
Another issue with multicollinearity is that <b>small changes to the input data can lead to large changes in the model</b>, even resulting in changes of sign of parameter estimates.

<br>
One of the features of multicollinearity is that the <b>standard errors of the affected coefficients tend to be large</b>. In that case, the test of the hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory variable (<b>a type II error</b>).


## Detection

<br>
Indicators that multicollinearity may be present in a model include the following :

<br>
<ul style="list-style-type:square">
    <li>
        large changes in the estimated regression coefficients when a predictor variable is added or deleted
    </li>
    <br>
    <li>
        insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the
        joint hypothesis that those coefficients are all zero (using an F-test)
    </li>
    <br>
    <li>
        if a multivariable regression finds an insignificant coefficient of a particular explanatory variable, yet a simple
        linear regression of the dependent variable on the same regressor shows the coefficient to be significantly different
        from zero, this situation indicates multicollinearity in the multivariable regression
    </li>   
    <br>
    <li>
        some authors have suggested a formal detection-tolerance or the <b>variance inflation factor (VIF)</b> for
        multicollinearity : <br><br>
        $
            \quad
            \textbf{tolerance} = 1 - \boldsymbol{{\mathbf{R}_j}^2} \qquad \textbf{VIF} = 1 \ / \ \textbf{tolerance}
        $ <br><br>
        where $\boldsymbol{{\mathbf{R}_j}^2}$ is the coefficient of determination of the regression of the $j^\text{th}$
        regressor on all the other explanatory variables. A tolerance below 0.20 or 0.10, or equivalently a VIF above 5 or 10,
        indicates a multicollinearity problem
    </li>   
    <br>
    <li>
        <b>condition number test</b> : the standard measure of ill-conditioning in a matrix is the condition index; this value 
        measures the potential sensitivity of the computed inverse matrix to small changes in the original matrix. When the
        condition number is above 30, the regression may have significant multicollinearity; in addition, the latter affects the
        model also when two or more of the variables related to the high condition number have high proportions of
        variance explained. One advantage of this method is that it shows which variables are causing the problem
    </li>   
    <br>
    <li>
        <b>perturbing the data</b> : multicollinearity can be detected by adding random noise to the data and re-running the
        regression many times and seeing how much the coefficients change
    </li>   
</ul>

## Correction

<br>
Before mentioning some of the remedies, it is worth to say that we could simply leave the model as it is, despite multicollinearity. When fitting the model to new data, multicollinearity does not invalidate the reliability of the model as a whole, provided that the explanatory variables follow the same pattern of multicollinearity in the new data as in the data on which the regression model is based.

<br>
Known methods are :

<br>
<ul style="list-style-type:square">
    <li>
        <b>drop one of the variables</b> : an explanatory variable may be dropped to produce a model with significant
        coefficients. In addition to losing information, we know that the omission of a relevant variable results in biased
        coefficient estimates for the remaining explanatory variables that are correlated with the dropped one
    </li>
    <br>
    <li>
        <b>obtain more data, if possible</b> : this is the preferred solution. More data can produce more precise parameter
        estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the
        estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity
    </li>
    <br>
    <li>
        <b>standardize the independent variables</b> : this may help reduce a false flagging of a condition index above 30
    </li>
    <br>
    <li>
        it has also been suggested that the model could account for the effects of multicollinearity using the <b>Shapley
        value</b>, a game theory tool which assigns a value for each predictor and assesses all possible combinations of
        importance
    </li>
    <br>
    <li>
        <b>ridge regression</b> or <b>principal component regression</b> or <b>partial least squares regression</b> can be used
    </li>
</ul>

## <font color='#28B463'>References

<br>
<ul style="list-style-type:square">
    <li>
        Wikipedia - 
        <a href="https://bit.ly/2IKRHHG">Multicollinearity</a>        
    </li>    
</ul>