# Detecting and Handling <font color=red>Multi-collinearity in Data</font>

<img src='Data/Reducing Complexity.png' width=500/>

If you are performing regression analysis using data that is __multi-collinear__, that is __one or more input features__ can be represented as __linear combinations of other features__ then your model is __not__ going to be very __robust__.

1) Detecting multi-collinearity using <font color=red>Correlation Matrix</font>

2) Detecting multi-collinearity using <font color=red>Variance Inflation Factor</font>

In [1]:
import pandas as pd

In [2]:
automobile = pd.read_csv('Data/cars_processed.csv')
automobile.head(5)

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin,Age
0,18.0,8,307.0,130,3504,12.0,1,50
1,15.0,8,350.0,165,3693,11.5,1,50
2,16.0,8,304.0,150,3433,12.0,1,50
3,17.0,8,302.0,140,3449,10.5,1,50
4,15.0,8,429.0,198,4341,10.0,1,50


In [3]:
# Check the data distribution

automobile.describe()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin,Age
count,391.0,391.0,391.0,391.0,391.0,391.0,391.0,391.0
mean,23.459847,5.465473,194.095908,104.352941,2976.411765,15.552941,1.578005,44.005115
std,7.810128,1.703152,104.590541,38.471278,850.173193,2.752786,0.80602,3.675975
min,9.0,3.0,68.0,46.0,1613.0,8.0,1.0,38.0
25%,17.0,4.0,105.0,75.0,2224.5,13.8,1.0,41.0
50%,23.0,4.0,151.0,93.0,2800.0,15.5,1.0,44.0
75%,29.0,8.0,264.5,125.0,3616.5,17.05,2.0,47.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,3.0,50.0


Observations:
- Features are at varying scales with varying mean and standard deviations
- Let us bring them to same scale

In [4]:
from sklearn import preprocessing

In [5]:
# Subtracts with mean and divides by standard deviation for each feature

automobile[['Cylinders']] = preprocessing.scale(automobile[['Cylinders']].astype('float64'))
automobile[['Displacement']] = preprocessing.scale(automobile[['Displacement']].astype('float64'))
automobile[['Horsepower']] = preprocessing.scale(automobile[['Horsepower']].astype('float64'))
automobile[['Weight']] = preprocessing.scale(automobile[['Weight']].astype('float64'))
automobile[['Acceleration']] = preprocessing.scale(automobile[['Acceleration']].astype('float64'))
automobile[['Age']] = preprocessing.scale(automobile[['Age']].astype('float64'))

In [6]:
# Notice the mean of 0 and std of 1
automobile.describe()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin,Age
count,391.0,391.0,391.0,391.0,391.0,391.0,391.0,391.0
mean,23.459847,-1.817245e-16,-1.272071e-16,-1.817245e-16,-5.4517350000000005e-17,5.4517350000000005e-17,1.578005,7.268979e-16
std,7.810128,1.001281,1.001281,1.001281,1.001281,1.001281,0.80602,1.001281
min,9.0,-1.449449,-1.20716,-1.518736,-1.605742,-2.74726,1.0,-1.635704
25%,17.0,-0.8615499,-0.8529458,-0.7639608,-0.885555,-0.6376039,1.0,-0.8185488
50%,23.0,-0.8615499,-0.412572,-0.2954798,-0.2077668,-0.01925649,1.0,-0.001393275
75%,29.0,1.490045,0.6740026,0.5373752,0.7538562,0.5445308,2.0,0.8157623
max,46.6,1.490045,2.497725,3.270181,2.54814,3.363468,3.0,1.632918


In [7]:
automobile.shape

(391, 8)

In [8]:
from sklearn.model_selection import train_test_split

X = automobile.drop(['MPG', 'Origin'], axis=1)
Y = automobile['MPG']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [9]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression(normalize=True).fit(x_train, y_train)

In [10]:
print('Training score = ', linear_model.score(x_train, y_train))

Training score =  0.8096084748053762


In [11]:
y_pred = linear_model.predict(x_test)

In [12]:
from sklearn.metrics import r2_score

print('Testing score = ', r2_score(y_test, y_pred))

Testing score =  0.7982858497575498


In [13]:
def adjusted_r2(r_square, labels, features):
    adj_r_square = 1 - ((1 - r_square) * (len(labels) - 1)) / (len(labels) - features.shape[1] - 1)
    return adj_r_square

In [14]:
print('Adjusted r2_score = ', adjusted_r2(r2_score(y_test, y_pred), y_test, x_test))

Adjusted r2_score =  0.7814763372373457


In [15]:
features_corr = X.corr()
features_corr

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Age
Cylinders,1.0,0.950713,0.842372,0.898344,-0.501583,0.341595
Displacement,0.950713,1.0,0.896888,0.933379,-0.541667,0.366835
Horsepower,0.842372,0.896888,1.0,0.864776,-0.687827,0.413578
Weight,0.898344,0.933379,0.864776,1.0,-0.416164,0.308031
Acceleration,-0.501583,-0.541667,-0.687827,-0.416164,1.0,-0.285421
Age,0.341595,0.366835,0.413578,0.308031,-0.285421,1.0


Notice:
- Cylinders, Horsepower, Weight are __highly correlated__ with Displacement
- So, let us remove Cylinders, Displacement, Weight and leave Horsepower

In [16]:
abs(features_corr) > 0.8

Unnamed: 0,Cylinders,Displacement,Horsepower,Weight,Acceleration,Age
Cylinders,True,True,True,True,False,False
Displacement,True,True,True,True,False,False
Horsepower,True,True,True,True,False,False
Weight,True,True,True,True,False,False
Acceleration,False,False,False,False,True,False
Age,False,False,False,False,False,True


In [17]:
trimmed_features_df = X.drop(['Cylinders', 'Displacement', 'Weight'], axis=1)

In [18]:
trimmed_features_df.corr()

Unnamed: 0,Horsepower,Acceleration,Age
Horsepower,1.0,-0.687827,0.413578
Acceleration,-0.687827,1.0,-0.285421
Age,0.413578,-0.285421,1.0


2) Detecting multi-collinearity using <font color=red>Variance Inflation Factor</font>

In [19]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [20]:
# We are finding the VIF for all features against rest of them

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [21]:
vif['features'] = X.columns

In [22]:
vif.round(2)

Unnamed: 0,VIF Factor,features
0,10.62,Cylinders
1,19.57,Displacement
2,9.36,Horsepower
3,10.76,Weight
4,2.61,Acceleration
5,1.24,Age


Interpretation of VIF values:
    
- VIF <font color=red>1</font> indicates __Not correlated__
- VIF <font color=red>1 to 5</font> indicates __Moderately correlated__
- VIF <font color=red>> 5</font> indicates __Highly correlated__

In [23]:
# As 'Displacement' and 'Weight' are higly correleated let us drop them

X = X.drop(['Displacement', 'Weight'], axis=1)

In [24]:
# Let us calculate the VIF again

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [25]:
vif['features'] = X.columns

In [26]:
vif.round(2)

Unnamed: 0,VIF Factor,features
0,3.59,Cylinders
1,5.32,Horsepower
2,1.98,Acceleration
3,1.21,Age


In [27]:
X = X.drop(['Horsepower'], axis=1)
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['features'] = X.columns
vif.round(2)

Unnamed: 0,VIF Factor,features
0,1.42,Cylinders
1,1.36,Acceleration
2,1.15,Age


In [29]:
# As the VIF values are < 5, we use them for model fit

X = automobile.drop(['Displacement', 'Weight', 'Horsepower', 'Origin', 'MPG'], axis=1)
Y = automobile['MPG']

In [30]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [31]:
linear_model = LinearRegression(normalize=True).fit(x_train, y_train)

In [32]:
print('Training score = ', linear_model.score(x_train, y_train))

Training score =  0.7069668692085977


In [33]:
y_pred = linear_model.predict(x_test)

In [34]:
print('Testing score = ', r2_score(y_test, y_pred))

Testing score =  0.7576743032908574


In [35]:
print('Adjusted r2_score = ', adjusted_r2(r2_score(y_test, y_pred), y_test, x_test))

Adjusted r2_score =  0.7479812754224917
