## Variance Inflation Factor Feature Selection

When independent variables in a model are not correlated with each other, it's pretty straightforward to understand their relationship with the dependent variable. We can simply look at the size and direction of their coefficients.

However, when independent variables are correlated with each other, it becomes more complicated. The relationship between an independent variable and the dependent variable is influenced not only by their direct relationship but also by the relationships that variable has with other independent variables.

In order to invert a matrix, it must have a determinant that is not equal to zero. If the determinant is zero, it means that the elements of the matrix are linearly dependent, and we can't divide by zero during the inversion process.

Multicollinearity is a situation where there is redundant information in the design matrix, meaning that some columns of the matrix can be expressed as a combination of other columns. This redundancy is another form of linear dependence.

However, multicollinearity doesn't necessarily mean that the variables are exactly linearly dependent. It could be that they are just very highly correlated, with a correlation of 0.90 or more, for example. In this case, the determinant of the matrix might not be exactly zero, but it could be very close to zero, which can still cause problems in the analysis.

The determinate for matrix $A$ is found by; 

If $A$ is 
  $\begin{bmatrix}
    a & b \\
    c & d 
  \end{bmatrix}$ then the determinate is $ad — bc$ 

And inversion of a matrix is; 

$ X^{-1} = \frac{1}{ad-bc} \begin{bmatrix}d & -b \\-c & a \end{bmatrix}$


So if $ad-bc$ is close to 0 because of linear dependence (or close approximate to it), then the multiplication of the fraction with a very small denominator to an large number in the matrix will produce a large number because division by small number of large number make very large numbers. And this is the unstable part that messes things up later.

The inverse of the correlation matrix, also known as the variance inflation factor (VIF), can provide insights into multicollinearity in your data. Multicollinearity refers to a situation in which two or more predictor variables in a multiple regression model are highly correlated. 

In the context of multicollinearity, the diagonal elements of the inverse of the correlation matrix are of particular interest. These diagonal elements are the variance inflation factors (VIFs) for each predictor variable in a multiple regression. 

The VIF for a predictor variable quantifies how much the variance of the estimated regression coefficient for that variable is increased due to multicollinearity. In other words, it measures how much the variance of the model would be inflated if that predictor variable is included in the model. 

If the VIF is 1, there is no correlation among the kth predictor and the other predictors, and hence the variance of the estimated coefficients is not inflated at all. As the VIF increases, it indicates higher levels of correlation and hence higher levels of multicollinearity. 

A common rule of thumb is that if a VIF is greater than 5 (or sometimes 10), then the multicollinearity is high. In this case, you might consider dropping the variable from the model, combining it with another variable, or using techniques like ridge regression or principal component analysis that can handle multicollinearity. 

In [1]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.datasets import make_regression

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import warnings
warnings.filterwarnings("ignore")

In [33]:
# make classification data with binary target values
data = make_regression(
    n_features = 7, 
    n_samples = 10, 
    random_state = 101)

# make a dataframe to improve readability
df = pd.DataFrame(data[0])

# add the targets
df['Y'] = data[1]

# inspect the results
df.head()


Unnamed: 0,0,1,2,3,4,5,6,Y
0,2.154846,1.025984,0.000366,-1.136645,-0.156598,0.649826,-0.031579,18.688373
1,-0.74179,1.035125,0.681209,0.230336,-0.03116,-1.005187,1.939932,152.496034
2,0.302665,0.190794,0.955057,-0.933237,1.978757,0.683509,2.605967,394.150781
3,0.184502,-1.159119,-1.706086,1.693723,-0.134841,0.166905,0.390528,-132.350866
4,0.1968,1.901755,-0.116773,0.484752,0.238127,-0.993263,1.996652,110.151348


In [34]:
# collect independent variables
numeric_indpendent_variables = df.select_dtypes(include = np.number)

# drop the dependent variable
numeric_indpendent_variables = numeric_indpendent_variables.drop(['Y'],axis = 1)

# correlation matrix
numeric_indpendent_variables_cor = numeric_indpendent_variables.corr()

In [35]:
numeric_indpendent_variables_cor 

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,-0.031308,-0.103204,-0.57313,-0.180697,0.331469,-0.063475
1,-0.031308,1.0,0.006235,0.127747,-0.075287,-0.706634,0.286777
2,-0.103204,0.006235,1.0,-0.320236,0.381105,0.262267,0.467901
3,-0.57313,0.127747,-0.320236,1.0,-0.239777,-0.543723,-0.095698
4,-0.180697,-0.075287,0.381105,-0.239777,1.0,0.29556,0.463724
5,0.331469,-0.706634,0.262267,-0.543723,0.29556,1.0,-0.234564
6,-0.063475,0.286777,0.467901,-0.095698,0.463724,-0.234564,1.0


In [36]:
# get results in the diagonal elements of this matrix
pd.DataFrame(np.linalg.inv(numeric_indpendent_variables_cor.values), 
             index = numeric_indpendent_variables_cor.index, 
             columns=numeric_indpendent_variables_cor.columns)

Unnamed: 0,0,1,2,3,4,5,6
0,2.401268,-0.871531,1.004592,1.130929,1.123348,-1.592689,-0.853978
1,-0.871531,2.966811,-0.802841,0.506185,-0.75716,3.242603,0.62967
2,1.004592,-0.802841,2.07942,0.451214,0.50498,-1.644105,-1.2556
3,1.130929,0.506185,0.451214,2.403352,0.288215,1.102583,0.070471
4,1.123348,-0.75716,0.50498,0.288215,2.215063,-1.862515,-1.384316
5,-1.592689,3.242603,-1.644105,1.102583,-1.862515,5.890488,2.089186
6,-0.853978,0.62967,-1.2556,0.070471,-1.384316,2.089186,2.491448


In [37]:
# get results as series
vifs = pd.Series(np.linalg.inv(numeric_indpendent_variables_cor.values).diagonal(), 
                 index=numeric_indpendent_variables_cor.index)

vifs

0    2.401268
1    2.966811
2    2.079420
3    2.403352
4    2.215063
5    5.890488
6    2.491448
dtype: float64

In [38]:
def iterative_remove_VIF(df):
    while True:
        # Step 1: Compute the correlation matrix
        correlation_matrix = df.corr()
        
        # Step 2: Compute the inverse of the correlation matrix
        try:
            inv_corr_matrix = np.linalg.inv(correlation_matrix)
        except np.linalg.LinAlgError:
            # Matrix is not invertible
            print("The correlation matrix is not invertible.")
            break

        # Step 3: Collect the diagonal elements of the inversed matrix
        inv_corr_diag = np.diag(inv_corr_matrix)

        # Step 4: Remove the highest value if it's greater than 5
        max_val_index = np.argmax(inv_corr_diag)
        max_val = inv_corr_diag[max_val_index]
        
        if max_val > 5:
            df = df.drop(df.columns[max_val_index], axis=1)
        else:
            # No more values greater than 5
            break

    return df

In [39]:
iterative_remove_VIF(numeric_indpendent_variables)

Unnamed: 0,0,1,2,3,4,6
0,2.154846,1.025984,0.000366,-1.136645,-0.156598,-0.031579
1,-0.74179,1.035125,0.681209,0.230336,-0.03116,1.939932
2,0.302665,0.190794,0.955057,-0.933237,1.978757,2.605967
3,0.184502,-1.159119,-1.706086,1.693723,-0.134841,0.390528
4,0.1968,1.901755,-0.116773,0.484752,0.238127,1.996652
5,1.02481,-0.346419,-0.755325,-0.610259,0.147027,-0.479448
6,-0.848077,0.907969,0.628133,2.70685,0.503826,0.651118
7,-0.943406,0.638787,0.07296,0.807706,0.329646,-0.497104
8,-0.376519,-1.133817,1.862864,-0.925874,0.610478,0.38603
9,-0.758872,0.740122,-2.018168,0.605965,0.528813,-0.589001
