In this tutorial we will learn how to handle multicollinear features , this can be performed as a feature selection step in your machine learning pipeline.
When two or more independent variables are highly correlated with each other then we can state that those features are multi collinear.

In [85]:
import pandas as pd
import numpy as np


In [86]:
cars_df=pd.read_csv('dataset/cleaned_cars.csv')
cars_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,origin,Age
0,18.0,8,307.0,130,3504,12.0,US,49
1,15.0,8,350.0,165,3693,11.5,US,49
2,18.0,8,318.0,150,3436,11.0,US,49
3,17.0,8,302.0,140,3449,10.5,US,49
4,15.0,8,429.0,198,4341,10.0,US,49


In [87]:
cars_df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,Age
count,367.0,367.0,367.0,367.0,367.0,367.0,367.0
mean,23.556403,5.438692,191.592643,103.618529,2955.242507,15.543324,42.953678
std,7.773266,1.694068,102.017066,37.381309,831.03173,2.728949,3.698402
min,9.0,3.0,70.0,46.0,1613.0,8.0,37.0
25%,17.55,4.0,105.0,75.0,2229.0,13.8,40.0
50%,23.0,4.0,146.0,94.0,2789.0,15.5,43.0
75%,29.0,8.0,260.0,121.0,3572.0,17.05,46.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,49.0


As we can see range of these features are very different that means they all are in different scales so lets standardize the features using sklearn's scale function.

In [88]:
from sklearn import preprocessing

cars_df[['cylinders']]=preprocessing.scale(cars_df[['cylinders']].astype('float64'))
cars_df[['displacement']]=preprocessing.scale(cars_df[['displacement']].astype('float64'))
cars_df[['horsepower']]=preprocessing.scale(cars_df[['horsepower']].astype('float64'))
cars_df[['weight']]=preprocessing.scale(cars_df[['weight']].astype('float64'))
cars_df[['acceleration']]=preprocessing.scale(cars_df[['acceleration']].astype('float64'))
cars_df[['Age']]=preprocessing.scale(cars_df[['Age']].astype('float64'))


In [89]:
cars_df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,Age
count,367.0,367.0,367.0,367.0,367.0,367.0,367.0
mean,23.556403,-1.936084e-17,-1.936084e-17,9.680419000000001e-17,-7.744335000000001e-17,9.680419000000001e-17,2.3233e-16
std,7.773266,1.001365,1.001365,1.001365,1.001365,1.001365,1.001365
min,9.0,-1.441514,-1.193512,-1.543477,-1.617357,-2.76796,-1.611995
25%,17.55,-0.8504125,-0.8499642,-0.7666291,-0.8750977,-0.6396984,-0.7997267
50%,23.0,-0.8504125,-0.447522,-0.2576598,-0.2003166,-0.01589748,0.01254184
75%,29.0,1.513992,0.6714636,0.4656124,0.743172,0.5528622,0.8248104
max,46.6,1.513992,2.585518,3.385489,2.632559,3.396661,1.637079


In [90]:
from sklearn.model_selection import train_test_split

Our primary goal in this tutorial is to learn how to handle multicollinearity among features , hence we are not considering the **origin** variable as it's a categorical feature.
{: .notice--info}

In [91]:
X=cars_df.drop(['mpg','origin'],axis=1) 
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)

In [92]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)

In [93]:
print("Training score : ",linear_model.score(x_train,y_train))

Training score :  0.8003438238657309


In [94]:
y_pred = linear_model.predict(x_test)

In [95]:
from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))

Testing_score : 0.8190012505093899


## What is Adjusted $R^2$ Score?

When we have multiple predictors/features , A better measure of how good our model is **Adjusted $R^2$ score**

The Adjusted $R^2$ score is calculated using r2_score and it is a corrected goodness of fit measure for linear models.
This is an Adjusted $R^2$ score that has been adjusted for the number of predictors/features we have used in our regression analysis.

- The Adjusted $R^2$ score increases when a new predictor/feature has been added to train our model imporves our model more than the improvement that can be expected purely due to chance.

- When we don't have highly correlated features then we can observe that Adjusted $R^2$ score is very close to our actual r2 score.


In [96]:
def adjusted_r2(r_square,labels,features):
    adj_r_square = 1 - ((1- r_square)*(len(labels)-1))/(len(labels)- features.shape[1])
    return adj_r_square

In [97]:
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))

Adjusted R2 score : 0.8056925189291979


In [98]:
feature_corr=X.corr()
feature_corr

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,Age
cylinders,1.0,0.951901,0.841093,0.895922,-0.483725,0.330754
displacement,0.951901,1.0,0.891518,0.930437,-0.521733,0.362976
horsepower,0.841093,0.891518,1.0,0.862606,-0.673175,0.41011
weight,0.895922,0.930437,0.862606,1.0,-0.397605,0.302727
acceleration,-0.483725,-0.521733,-0.673175,-0.397605,1.0,-0.273762
Age,0.330754,0.362976,0.41011,0.302727,-0.273762,1.0


Now let's explore the correlation matrix.
We discovered that there are many features which are highly correlated with **displacement**.You can see that **cylinders** , **horsepower** , **weight** are all three highly correlated with displacement.This high correlation coefficient almost at 0.9 indicates that these features are likely to be <b>colinear</b>.

Another way of saying this is **cylinders**, **horsepower**, **weight** give us the same information as **displacement**.So we dont need all of them in our regression analysis.

Using this correlation matrix let's say we want to see all those features with correlation coefficients greater than 0.8 , we can do that by below code.

In [99]:
abs(feature_corr) > 0.8

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,Age
cylinders,True,True,True,True,False,False
displacement,True,True,True,True,False,False
horsepower,True,True,True,True,False,False
weight,True,True,True,True,False,False
acceleration,False,False,False,False,True,False
Age,False,False,False,False,False,True


In [100]:
trimmed_features_df = X.drop(['cylinders','horsepower','weight'],axis=1)

In [101]:
trimmed_features_corr=trimmed_features_df.corr()

In [102]:
trimmed_features_corr

Unnamed: 0,displacement,acceleration,Age
displacement,1.0,-0.521733,0.362976
acceleration,-0.521733,1.0,-0.273762
Age,0.362976,-0.273762,1.0


In [103]:
abs(trimmed_features_corr) > 0.8

Unnamed: 0,displacement,acceleration,Age
displacement,True,False,False
acceleration,False,True,False
Age,False,False,True


Now we can check that independent features' correlation has been reduced.

## Variance Inflation Factor

Another way of selecting features which are not colinear is <b><u>Variance Inflation Factor</u></b>.This is a measure to quantify the severity of multicolinearity in an ordinary least squares regression analysis.

Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables.

Variance inflation factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables.

In [104]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [105]:
vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

In [106]:
vif['Features'] = X.columns

In [107]:
vif.round(2)

Unnamed: 0,VIF Factor,Features
0,10.82,cylinders
1,19.13,displacement
2,8.98,horsepower
3,10.36,weight
4,2.5,acceleration
5,1.24,Age


- VIF = 1: Not correlated
- VIF =1-5: Moderately correlated
- VIF >5: Highly correlated


If we look at the VIF factors we can see displacement and weight are highly correlated features so let's drop it from Features.

In [108]:
X = X.drop(['displacement','weight'], axis = 1)

Now again we calculate the VIF for the rest of the features

In [109]:
vif=pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values,i) for i in range(X.shape[1])]

In [110]:
vif['Features'] = X.columns
vif.round(2)

Unnamed: 0,VIF Factor,Features
0,3.57,cylinders
1,5.26,horsepower
2,1.91,acceleration
3,1.2,Age


So now colinearity of features has been reduced using VIF.

In [111]:
X=cars_df.drop(['mpg','origin','displacement','weight'],axis=1)
Y=cars_df['mpg']
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)

In [112]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train,y_train)

In [113]:
print("Training score : ",linear_model.score(x_train,y_train))

Training score :  0.7537877265338784


In [114]:
y_pred = linear_model.predict(x_test)

In [115]:
from sklearn.metrics import r2_score

print("Testing_score :",r2_score(y_test,y_pred))

Testing_score : 0.7159725745358863


In [116]:
print("Adjusted R2 score :",adjusted_r2(r2_score(y_test,y_pred),y_test,x_test))

Adjusted R2 score : 0.7037999705874243
