# Multi-Collinearity in Linear Regression

In [1]:
import pandas as pd
import statsmodels.api as sm

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration

In [2]:
df_adv=pd.read_csv('Advertising.csv')
df_adv.head()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [3]:
df_adv.columns

Index(['Unnamed: 0', 'TV', 'radio', 'newspaper', 'sales'], dtype='object')

In [5]:
df_adv.drop(['Unnamed: 0'],axis=1,inplace=True)
df_adv.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [7]:
x=df_adv[['TV','radio','newspaper']] # independent features
y=df_adv['sales'] # dependent features

In [8]:
x=sm.add_constant(x)
x

Unnamed: 0,const,TV,radio,newspaper
0,1.0,230.1,37.8,69.2
1,1.0,44.5,39.3,45.1
2,1.0,17.2,45.9,69.3
3,1.0,151.5,41.3,58.5
4,1.0,180.8,10.8,58.4
...,...,...,...,...
195,1.0,38.2,3.7,13.8
196,1.0,94.2,4.9,8.1
197,1.0,177.0,9.3,6.4
198,1.0,283.6,42.0,66.2


In [9]:
# fitting an Ordinary least square (OLS) model with intercept on TV and radio
model=sm.OLS(y,x).fit()

In [10]:
model.summary()

0,1,2,3
Dep. Variable:,sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Tue, 03 Jan 2023",Prob (F-statistic):,1.58e-96
Time:,14:43:12,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
radio,0.1885,0.009,21.893,0.000,0.172,0.206
newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


here coef shows the beta values or coeficint values of different features in linear regression. also here std error values are also low so this shows that we are not facing multicollinearity problem in this dataset. Newspaper showing -ve coeficient means we don't need to invest in newspaper foir sales also can be seen from P value(only newspaper having high value ) 

so for multicollinearity check we need to : 1st check coeficients(same standards) 2nd check R2 and adjusted R2 value(should be near to 1) then check for std error(would be large if there's a relation btw independent features) and P value

In [11]:
import matplotlib.pyplot as plt
x.corr()

Unnamed: 0,const,TV,radio,newspaper
const,,,,
TV,,1.0,0.054809,0.056648
radio,,0.054809,1.0,0.354104
newspaper,,0.056648,0.354104,1.0


so here we can see there's no corr having more than 90% correlated so we can go on with these features for our model training

In [12]:
# Taking other eg of correlation check
df_salary=pd.read_csv('Salary_Data.csv')
df_salary.head()

Unnamed: 0,YearsExperience,Age,Salary
0,1.1,21.0,39343
1,1.3,21.5,46205
2,1.5,21.7,37731
3,2.0,22.0,43525
4,2.2,22.2,39891


In [13]:
x=df_salary[['YearsExperience','Age']] # independent feature
y=df_salary['Salary'] # dependent feature

In [14]:
x=sm.add_constant(x)
model=sm.OLS(y,x).fit()

In [15]:
model.summary()

0,1,2,3
Dep. Variable:,Salary,R-squared:,0.96
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,323.9
Date:,"Tue, 03 Jan 2023",Prob (F-statistic):,1.35e-19
Time:,15:13:26,Log-Likelihood:,-300.35
No. Observations:,30,AIC:,606.7
Df Residuals:,27,BIC:,610.9
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6661.9872,2.28e+04,-0.292,0.773,-5.35e+04,4.02e+04
YearsExperience,6153.3533,2337.092,2.633,0.014,1358.037,1.09e+04
Age,1836.0136,1285.034,1.429,0.165,-800.659,4472.686

0,1,2,3
Omnibus:,2.695,Durbin-Watson:,1.711
Prob(Omnibus):,0.26,Jarque-Bera (JB):,1.975
Skew:,0.456,Prob(JB):,0.372
Kurtosis:,2.135,Cond. No.,626.0


We can see std error whihc having very high value shows multicollinearity problem and then we see P value for age and year experience shows there might be corr so we can check by corr

In [16]:
x.corr()

Unnamed: 0,const,YearsExperience,Age
const,,,
YearsExperience,,1.0,0.987258
Age,,0.987258,1.0


shows yearsExperinec and Age are 98% correlated so we can drop the age features 