# Multicollinearity.


### 1. Definition.

According to [Wikipedia](https://en.wikipedia.org/wiki/Multicollinearity#:~:text=Multicollinearity%20refers%20to%20a%20situation,equal%20to%201%20or%20%E2%88%9A21.):
* **collinearity** is a linear association between two explanatory variables. 
* **multicollinearity** is a linear association between more than two explanatory variables within a dataset:
  + **perfect** multicollinearity occurs when linear correlation between our variables is equal to 1 or -1.
  + **nearly perfect** multicollinearity arises when there is already approximate linear relationship between variables.

### 2. When is multicollinearity a problem and why?

Let's examine the case of perfect multicollinearity. It means, among some variables from our dataset, there exists a linear relationship:

$\lambda_0 + \lambda_1 * X_{1i} + \lambda_2 * X_{2i} + ... + \lambda_k * X_{ki} = 0$ 

where $X_{ki}$ is the ith observartion on the kth explanatory variable, and $\lambda_{k}$ is a constant.

For a multiple regression equation, where $X_{1}, ... X_{k}$ are the explanatory variables and $Y$ the target variable, we would have the following equation:

$Y_{i} =  \beta_{0} + \beta_{1} * X_{1i} + ... + \beta_{k} * X_{ki} + \epsilon{i}$

Identifying the $\beta$ with OLS (Ordinary Least Squares) implies that:

$Y = X * \beta + \epsilon <=>
X^T * Y = X^T * X * \hat{\beta}  <=> 
(X^T * X) ^{-1} * X^T * Y = \hat{\beta}  $

With correlated variables, the rank of square matrix $X^TX$ is not k anymore, which makes it not invertible.

The issue with nearly perfect collinearity stays in the sense that even though existing, computer may not be able to compute or appromixate an inverse as the matrix is ill defined. Any results will be very sensitive to slighlty changes in the data, due to roundings effects. It may be very **inaccurate and sample dependent.** 


In practice this causes therefore problems when:
* we need to **identify significant variables** from a multivariable regression. (the standard error will be higher).
* we need to **interpret coefficients from a mutivariable regression.**
* we need to **avoid overfitting.**

### 3. How to detect multicollinearity?

A good way to detect multicollinearity and which are the affected variables is to use the **VIF (Variance Inflation Factor)**.

In practice, this is an index which tells how much the variance of a coefficient is increased by collinearity.
It is determined in the following way:

1. we run OLS regression to determine $X_{1}$ based on all the other $X_{i}$ variables.
$X_{1} = \alpha_{0} + \alpha_{2} * X_{2} + ... + \alpha_{k} * X_{k}$


2. we calculate VIF index as following:
$VIF_{1} = 1 / (1 - R^2_{1})$ with $R^2$ determined from step 1.


3. As a rule of thumb, if **VIF > 5**, it means multi-collinearity is there. **VIF > 10** means there is high multicolinearity. (Indeed, the higher $R^2$ is, the greater the propertion of variance of $X_{1}$ is predictable from the other $X_{k}$ variables.)

In practice with python, with example taken from: https://etav.github.io/python/vif_factor_python.html/

In [44]:
#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

url = 'https://raw.githubusercontent.com/SushmithaPulagam/Fixing-Multicollinearity/master/House%20Sales.csv'
df = pd.read_csv(url)

def vif_scores(df):
    VIF_Scores = pd.DataFrame()
    VIF_Scores["Independent Features"] = df.columns
    VIF_Scores["VIF Scores"] = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
    return VIF_Scores

# need to ask Matias about that
df1 = df.iloc[:,:-1]
vif_scores(df1)


Unnamed: 0,Independent Features,VIF Scores
0,Interior(Sq Ft),35.436502
1,# of Bed,30.207875
2,# of Bath,12.25403
3,# of Rooms,41.654966
4,Condo Fee,9.023152
5,Tax,11.603921


### 4. How to scope with multicollinearity?

1. A first way to deal with multicolinearity is to iteratively **drop the variables** with the highest VIF. For instance, let's drop here the # of Rooms and recalculate VIF. 

In [45]:
print("First drop")
df.pop("# of Rooms")
df1 = df.iloc[:,:-1]
vif_scores(df1)

First drop


Unnamed: 0,Independent Features,VIF Scores
0,Interior(Sq Ft),27.00035
1,# of Bed,16.803246
2,# of Bath,12.253809
3,Condo Fee,8.582433
4,Tax,10.961302


After, we can drop the Interior variable.

In [46]:
print("Second drop")
df.pop("Interior(Sq Ft)")
df1 = df.iloc[:,:-1]
print(vif_scores(df1))

Second drop
  Independent Features  VIF Scores
0             # of Bed    7.515687
1            # of Bath   10.411790
2            Condo Fee    8.582383
3                  Tax   10.240541


2. Another approach is to **combine** variables together.

In [47]:
df.assign(Total = df['# of Bed'] +  df['# of Bath'] )
df.pop("# of Bed")
df.pop("# of Bath")
df1 = df.iloc[:,:-1]
vif_scores(df1)

Unnamed: 0,Independent Features,VIF Scores
0,Condo Fee,7.27979
1,Tax,7.27979


3. Other approaches would be to use **PCA** which ends up with uncorrelated variables (but we may loose interpretability) or **PLS instead of OLS** which reduces our set to a smaller set with uncorrelated variables.


4. Finally we can also use **regularization techniques**.

### References.

https://towardsdatascience.com/multicollinearity-in-data-science-c5f6c0fe6edf

https://en.wikipedia.org/wiki/Multicollinearity

https://en.wikipedia.org/wiki/Variance_inflation_factor

https://etav.github.io/python/vif_factor_python.html
