<a href="https://colab.research.google.com/github/aakhterov/ML_tools/blob/master/deal_with_multicollinearity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Deal with multicollineary on variables

1. A bit theory [here](https://www.analyticsvidhya.com/blog/2021/02/multicollinearity-problem-detection-and-solution/) and [here](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/)

2. We're going to compute pairwise correlation of variables, compute variance inflation factor (VIF) and compare them

3. We're also going to fix multicollineary

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [69]:
import pandas as pd
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

In [111]:
# Load data
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Data/MulticollinearityExample.csv').iloc[:, 1:]

In [112]:
df.head()

Unnamed: 0,%Fat,Weight kg,Activity
0,25.3,52.163126,3508.44
1,29.3,61.801964,2773.54
2,37.7,93.440034,1738.97
3,32.8,59.874197,1665.29
4,24.6,50.348756,3982.95


In [113]:
# Take a look at the pairwise correlation of variables
df.corr()

Unnamed: 0,%Fat,Weight kg,Activity
%Fat,1.0,0.826715,-0.022527
Weight kg,0.826715,1.0,-0.107668
Activity,-0.022527,-0.107668,1.0


**Conclusion:** Variables '%Fat' and 'Weight kg' have a strong positive (linear) correlation.

In [114]:
# Take a look at the VIFs
pd.Series([variance_inflation_factor(df, i) for i in range(df.shape[1])], index=df.columns)

%Fat         50.718318
Weight kg    43.985363
Activity      5.243990
dtype: float64

**Conclusion:** We can see that the VIFs for 'Weight kg' and '%Fat' are much greater than 5 and we might want to fix it.

In [115]:
# Let's try to standartize variabels
ss = StandardScaler(with_std=False)
X = ss.fit_transform(df)
df = pd.DataFrame(data=X, columns=df.columns)
df.head()

Unnamed: 0,%Fat,Weight kg,Activity
0,-3.265217,-1.765066,946.450435
1,0.734783,7.873772,211.550435
2,9.134783,39.511842,-823.019565
3,4.234783,5.946005,-896.699565
4,-3.965217,-3.579436,1420.960435


In [116]:
pd.Series([variance_inflation_factor(df, i) for i in range(df.shape[1])], index=df.columns)

%Fat         3.204397
Weight kg    3.240334
Activity     1.026226
dtype: float64

**Conclusion:** We've fixed multicollinearity.

P.S. Multicollinearity doesn't affect how well the model fits the data. If we don't need to understand the role of each independent variables or their significance, we don't need to fix multicollinearity (from [here](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/#comment-11523))