## Multicollinearity In Linear Regression

##### Definition: Multicollinearity is a phenomenon in which one independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. In other words, one independent variable can be linearly predicted from one or multiple other independent variables with a substantial degree of certainty.

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. 

##### Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.


During regression analysis, we check many things before actually performing regression forehand. We check if the independent values are correlated, we check if the feature we are selecting is significant or not and also if there are any missing values and if yes then how to handle them.

### Understanding Conceptually —
Imagine you went to watch a rock band’s concert. There are 2 singers, a drummer, a keyboard player, and 2 guitarists. You can easily differentiate between the voice of singers as one is male and other is female but you seem to have trouble telling who is playing better guitar. Both guitarists are playing on the same tone, same pitch and at the same speed. If you could remove one of them then it wouldn’t be a problem since both are almost same. The benefit of removing one guitarist is cost-cutting and fewer members in the team. In machine learning, it is fewer features for training which leads to a less complex model. Here both guitarists are collinear. If one plays the guitar slowly then another guitarist also plays the guitar slowly. If one plays faster then other also plays faster. If two variables are collinear that means if one variable increases then other also increase and vice-versa.

### The Problem with Multicollinearity
Multicollinearity undermines the statistical significance of an independent variable. Here it is important to point out that multicollinearity does not affect the model’s predictive accuracy. The model should still do a relatively decent job predicting the target variable when multicollinearity is present. Now, I know what you are thinking. “If it does not affect the model’s ability to predict my target why should I be concerned?” While multicollinearity should not have a major impact on the model’s accuracy, it does affect the variance associated with the prediction, as well as, reducing the quality of the interpretation of the independent variables. In other words, the effect your data has on the model isn’t trustworthy. Your explanation of how the model takes the inputs to produce the output will not be reliable.

In [6]:
import pandas as pd
import statsmodels.api as sm


In [7]:
df_adv=pd.read_csv('Advertising.csv',index_col=0) # First column select as index

In [8]:
df_adv.head()

Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [9]:
x=df_adv[['TV','radio','newspaper']]
y=df_adv['sales']


In [10]:
x=sm.add_constant(x)
x


  return ptp(axis=axis, out=out, **kwargs)


Unnamed: 0,const,TV,radio,newspaper
1,1.0,230.1,37.8,69.2
2,1.0,44.5,39.3,45.1
3,1.0,17.2,45.9,69.3
4,1.0,151.5,41.3,58.5
5,1.0,180.8,10.8,58.4
...,...,...,...,...
196,1.0,38.2,3.7,13.8
197,1.0,94.2,4.9,8.1
198,1.0,177.0,9.3,6.4
199,1.0,283.6,42.0,66.2


In [11]:
# fit a OLS model with intercept on TV and Radio
# y = output and x=input
model=sm.OLS(y,x).fit()

In [12]:
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 28 Jul 2020",Prob (F-statistic):,1.58e-96
Time:,18:27:12,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


In [13]:
# We look at values of R2 and Adj R2, coef, std err, p-value
# low std error indicates no multicollinearity
# p value is  less than 0.05 
# R2 and Adj r2 is high..which is good
# Only newspaper coeff is neagtive and high p value..indicating we can do away with it

In [14]:
import matplotlib.pyplot as plt
x.iloc[:,1:].corr()

Unnamed: 0,TV,radio,newspaper
TV,1.0,0.054809,0.056648
radio,0.054809,1.0,0.354104
newspaper,0.056648,0.354104,1.0


In [15]:
# Here we see that correlation values are less than 0.05
# This indicates almost no multicollinearity

In [24]:
df_salary=pd.read_csv('Salary_Data.csv')
df_salary.head()
               

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [17]:
x=df_salary[['YearsExperience','Age']]
y=df_salary['Salary']

In [18]:
x=sm.add_constant(x)
x

  return ptp(axis=axis, out=out, **kwargs)


Unnamed: 0,const,YearsExperience,Age
0,1.0,1.1,21.0
1,1.0,1.3,21.5
2,1.0,1.5,21.7
3,1.0,2.0,22.0
4,1.0,2.2,22.2
5,1.0,2.9,23.0
6,1.0,3.0,23.0
7,1.0,3.2,23.3
8,1.0,3.2,23.3
9,1.0,3.7,23.6


In [19]:
model=sm.OLS(y,x).fit()
model.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Tue, 28 Jul 2020",Prob (F-statistic):,1.35e-19
Time:,18:27:14,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


In [20]:
# We see high R2 value..which is good.
# High std err of YOE and Age. But high p value of Age...ie greater than 0.05. This indicates there may be multicollinearity .
#  To confirm multicollinearity I will find out correlation between them

In [21]:
x.iloc[:,1:].corr()

Unnamed: 0,YearsExperience,Age
YearsExperience,1.0,0.987258
Age,0.987258,1.0


### We see 98% correlation. Then keeping one feature is more than enough.
Which one to drop?

Looking at p value, Age has p value greater than 0.05. so we will drop it.

Also, YOE explains 98% of Age, so dropping age is best move.

### Years of exp and Age are highly correlated. So it is wise enough to drop one feature

In [25]:
# Dropping Age
df_salary.drop(['Age'],axis=1,inplace=True)
df_salary.head()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343
1,1.3,46205
2,1.5,37731
3,2.0,43525
4,2.2,39891
