- Multicollinearity

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('salary.csv')
df.head(10)

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
0,Prof,B,19,18,Male,139750
1,Prof,B,20,16,Male,173200
2,AsstProf,B,4,3,Male,79750
3,Prof,B,45,39,Male,115000
4,Prof,B,40,41,Male,141500
5,AssocProf,B,6,6,Male,97000
6,Prof,B,30,23,Male,175000
7,Prof,B,45,45,Male,147765
8,Prof,B,21,20,Male,119250
9,Prof,B,18,18,Female,129000


now i will applying OrdinalEncoding Technique to convert the dataset categorical to numerical features.

In [2]:
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder()
for i in df.columns:
    if df[i].dtype == 'object':
        df[i]=oe.fit_transform(df[i].values.reshape(-1,1))

In [3]:
# now our datasets are all numerical features
df.head(10)

Unnamed: 0,rank,discipline,yrs.since.phd,yrs.service,sex,salary
0,2.0,1.0,19,18,1.0,139750
1,2.0,1.0,20,16,1.0,173200
2,1.0,1.0,4,3,1.0,79750
3,2.0,1.0,45,39,1.0,115000
4,2.0,1.0,40,41,1.0,141500
5,0.0,1.0,6,6,1.0,97000
6,2.0,1.0,30,23,1.0,175000
7,2.0,1.0,45,45,1.0,147765
8,2.0,1.0,21,20,1.0,119250
9,2.0,1.0,18,18,0.0,129000


In [4]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

- VIF starts at 1 and has no upper limit
- VIF = 1, no correlation between the independent variable and the other variables
- VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

In [5]:
X = df.iloc[:,:-1]
calc_vif(X)

Unnamed: 0,variables,VIF
0,rank,6.53441
1,discipline,2.020112
2,yrs.since.phd,24.949895
3,yrs.service,16.236006
4,sex,5.502355


###### We can see here that the ‘yrs.since.phd’ and ‘yrs.service’ have a high VIF value, meaning they can be predicted by other independent variables in the dataset.

- Fixing Multicollinearity

In [6]:
X = df.drop(['yrs.since.phd','salary'],axis=1)
calc_vif(X)

Unnamed: 0,variables,VIF
0,rank,5.444033
1,discipline,2.004074
2,yrs.service,3.625131
3,sex,5.151251


###### We were able to drop the variable ‘yrs.since.phd’ from the dataset because its information was being captured by the ‘yrs.service’ variable. This has reduced the redundancy in our dataset.



In [7]:
X.head(10)

Unnamed: 0,rank,discipline,yrs.service,sex
0,2.0,1.0,18,1.0
1,2.0,1.0,16,1.0
2,1.0,1.0,3,1.0
3,2.0,1.0,39,1.0
4,2.0,1.0,41,1.0
5,0.0,1.0,6,1.0
6,2.0,1.0,23,1.0
7,2.0,1.0,45,1.0
8,2.0,1.0,20,1.0
9,2.0,1.0,18,0.0


- Key Point :When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option