# Calculate Variance Inflation Factor (VIF)

How to detect Multicollinearity?

Multicollinearity is a statistical phenomenon that occurs when two or more predictor variables in a multiple regression model are highly correlated.
This can lead to unstable and inconsistent coefficients, making it difficult to interpret the model's results.

To measure multicollinearity, you can use the 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 𝐈𝐧𝐟𝐥𝐚𝐭𝐢𝐨𝐧 𝐅𝐚𝐜𝐭𝐨𝐫 (VIF)

VIF is defined as the ratio of the variance of an estimated regression coefficient to the variance of the coefficient when the predictor variables are not correlated.

A high VIF value (VIF > 5 or > 10) indicates that multicollinearity is present and may be a problem.

To calculate the VIF for a predictor variable, you can fit a multiple regression model with all of the predictor variables except for that variable, and then calculate the VIF using the following formula:

VIF = 1 / (1 - R^2)

where R^2 is the coefficient of determination from the regression model.

You can repeat this process for each predictor variable and compare the VIF values to determine which predictor variables contribute to multicollinearity.

Now you could drop the predictor variables with high VIF and calculate the VIF for the remaining again to see, how their VIF has changed.

Below you can see how to calculate VIF with 𝐬𝐭𝐚𝐭𝐬𝐦𝐨𝐝𝐞𝐥𝐬.

In [None]:
import pandas
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = pd.DataFrame(boston.data, columns = boston.feature_names)

vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)

'''
   Predictor        VIF
0       CRIM   2.100373
1         ZN   2.844013
2      INDUS  14.485758
3       CHAS   1.152952
4        NOX  73.894947
5         RM  77.948283
6        AGE  21.386850
7        DIS  14.699652
8        RAD  15.167725
9        TAX  61.227274
10   PTRATIO  85.029547
11         B  20.104943
12     LSTAT  11.102025

In [None]:
# Check for new categories in test set with Deepchecks

In [None]:
from deepchecks.tabular.checks.train_test_validation import CategoryMismatchTrainTest
checker = CategoryMismatchTrainTest()

X_train = pd.DataFrame([["A", "B", "C"], ["B", "B", "A"]], columns=["Col1", "Col2", "Col3"])
X_test = pd.DataFrame([["B", "C", "D"], ["D", "A", "B", ]], columns=["Col1", "Col2", "Col3"])

checker.run(X_train, X_test)

In [None]:
# Get Permutation Importance with eli5

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris() 
X = iris.data 
target = iris.target 
names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, target)

svc = SVC().fit(X_train, y_train)
perm = PermutationImportance(svc).fit(X_test, y_test)
eli5.show_weights(perm, feature_names= ["Feature_1", "Feature_2", "Feature_3", "Feature_4"])
