<a href="https://colab.research.google.com/github/tevfikaytekin/data_science/blob/master/collin_dummy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multicollinearity, one hot encoding, dummy encoding

(by Tevfik Aytekin)

In [2]:
import numpy as np
import pandas as pd
#import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model


X1 = np.array([2, 4, 6, 8, 12, 14, 18, 21, 23])
X2 = np.array([4, 8, 12, 16, 24, 28, 36, 42, 46])
X = np.c_[X1,X2]
y = np.array([30, 15, 17, 39, 11, 122, 17, 52, 74])

In [3]:
lm = linear_model.LinearRegression()
model = lm.fit(X, y)
print (model.intercept_)
print (model.coef_)

16.99805919456574
[0.41484716 0.82969432]


In [4]:
# Problem: A small change in the data yields drastic results on the coefficients 
# Changed 6 to 7 in X1
X1 = np.array([2, 4, 7, 8, 12, 14, 18, 21, 23])
X = np.c_[X1, X2]

In [None]:
model = lm.fit(X, y)
print (model.intercept_)
print (model.coef_)

### Explanation
The explanation of this behavior is the fact that when the predictors are highly correlated the base where the response plane sits is not robust, a small change in this base will have a high impact on the orientation of the response plane and hence you will get very different coefficient values. This situation is explained very nicely [here](https://newonlinecourses.science.psu.edu/stat501/node/346/). 

### Does it matter for machine learning?
From a statistical viewpoint, multicollinearity is a series problem because it makes it difficult to interpret the effects of the predictors on the response variable. However, from a machine learning perspective where the aim is mostly reducing prediction error one might think that multicollinearity is not a problem since it does not effect the prediction power. There is some truth in this. However, if you do not remove highly correlated features your model will not reliable. That is, assuming that you have two highly correlated features, you are learning a plane where actually the data mostly falls in a line.  So, your predictions for points far away from the line will most likely be erroneous. 

### Dummy Encoding

Linear regression expects numerical input features. In real life applications most of the time the dataset contains both numerical and categorical features. Dummy encoding is the technique for turning categorical features to numerical features so that linear regression can be applied. 


### An example

Suppose that we want to predict the house prices. Below is a simple dataset.

In [5]:
df = pd.DataFrame([[1521,"RL", 80000],
                  [1423,"RL", 70000],
                  [1123,"C", 30000],
                  [2187,"RM", 50000],
                  [2146,"FV", 40000],
                  [1749,"RH", 35000],
                  [2165,"RL", 90000],
                  [2871,"RH", 80000],
                  [1982,"C", 40000]],
                 columns = ["GrLivArea","MSzoning","SalePrice"])
df

Unnamed: 0,GrLivArea,MSzoning,SalePrice
0,1521,RL,80000
1,1423,RL,70000
2,1123,C,30000
3,2187,RM,50000
4,2146,FV,40000
5,1749,RH,35000
6,2165,RL,90000
7,2871,RH,80000
8,1982,C,40000


In [6]:
# If you run the code below you will get
# ValueError: could not convert string to float: 'C'

X = df[['GrLivArea','MSzoning']]
y = df['SalePrice']
lm = linear_model.LinearRegression()
model = lm.fit(X, y)

ValueError: could not convert string to float: 'RL'

### One Hot Encoding

In [7]:
# To fix this error you need to encode the categorical feature. To do that you can use the get_dummies function in pandas 
# which with default parameters produces a one hot encoding as follows.
df_dummy = pd.get_dummies(df)
df_dummy

Unnamed: 0,GrLivArea,SalePrice,MSzoning_C,MSzoning_FV,MSzoning_RH,MSzoning_RL,MSzoning_RM
0,1521,80000,False,False,False,True,False
1,1423,70000,False,False,False,True,False
2,1123,30000,True,False,False,False,False
3,2187,50000,False,False,False,False,True
4,2146,40000,False,True,False,False,False
5,1749,35000,False,False,True,False,False
6,2165,90000,False,False,False,True,False
7,2871,80000,False,False,True,False,False
8,1982,40000,True,False,False,False,False


In [8]:
model = lm.fit(df_dummy, y)

In [9]:
# With one hot encoding you will have the problem of multicollinearity since one of the values can be predicted 
# from the others. In the above example if you know the values of MSzoning_C,	MSzoning_FV,	MSzoning_RH, and 	MSzoning_RL
# you can predict the value of MSzoning_RM. To fix you can set drop_first=True
df_dummy = pd.get_dummies(df, drop_first=True)
df_dummy

Unnamed: 0,GrLivArea,SalePrice,MSzoning_FV,MSzoning_RH,MSzoning_RL,MSzoning_RM
0,1521,80000,0,0,1,0
1,1423,70000,0,0,1,0
2,1123,30000,0,0,0,0
3,2187,50000,0,0,0,1
4,2146,40000,1,0,0,0
5,1749,35000,0,1,0,0
6,2165,90000,0,0,1,0
7,2871,80000,0,1,0,0
8,1982,40000,0,0,0,0


### Ordinal Encoder

In [9]:
from sklearn.preprocessing import OrdinalEncoder

In [11]:
oe = OrdinalEncoder().set_output(transform="pandas")
oe.fit_transform(df)

Unnamed: 0,GrLivArea,MSzoning,SalePrice
0,2.0,3.0,5.0
1,1.0,3.0,4.0
2,0.0,0.0,0.0
3,7.0,4.0,3.0
4,5.0,1.0,2.0
5,3.0,2.0,1.0
6,6.0,3.0,6.0
7,8.0,2.0,5.0
8,4.0,0.0,2.0
