<h2>Categorical Variables and One Hot Encoding</h2>

In [60]:
import pandas as pd

In [61]:
df = pd.read_csv("homeprices.csv")
df

Unnamed: 0,state,area,price
0,new jersey,2600,550000
1,new jersey,3000,565000
2,new jersey,3200,610000
3,new jersey,3600,680000
4,new jersey,4000,725000
5,california,2600,850000
6,california,2800,900000
7,california,3300,925000
8,california,3600,985000
9,georgia,2600,350000


<h4 style='color:purple'>Using linear regression and ignoring state as shown below is going to give us horrible results</h4>

In [80]:
from sklearn import linear_model
model = linear_model.LinearRegression()

In [81]:
model.fit(df[['area']],df.price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [82]:
model.predict(2600)

array([ 580857.18730554])

<h4 style='color:purple'>Use label encoder to convert states into labels</h4>

In [65]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [96]:
df_le = df
df_le.state=le.fit_transform(df_encoded.state)
df_le

Unnamed: 0,state,area,price
0,2,2600,550000
1,2,3000,565000
2,2,3200,610000
3,2,3600,680000
4,2,4000,725000
5,0,2600,850000
6,0,2800,900000
7,0,3300,925000
8,0,3600,985000
9,1,2600,350000


In [83]:
model.fit(df_encoded[['state','area']],df_encoded.price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [84]:
model.predict([[2,2600]])

array([ 410989.96217632])

Here you can see that for new jersey, 2600 square ft home it predicts ~410000 prices but actual 
price is 550000. So this is not working quite well. This is because assigning numbers in 
increasing order for states makes it sound like california < georgia < new jersey. This doesn't make 
sense as there is nothing like one state is greater then the other.

**Here state variable is categorial hence we need to do some special handling. These varialbes are called dummy variables and we need to create new columns in our data frame using a technique called called one hot encoding**

<h2 style='color:purple'>Using pandas to create dummy variables</h2>

In [69]:
df = pd.read_csv("homeprices.csv")
dummies = pd.get_dummies(df.state)
dummies

Unnamed: 0,california,georgia,new jersey
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,1,0


In [70]:
df_dummies= pd.concat([df,dummies],axis='columns')
df_dummies

Unnamed: 0,state,area,price,california,georgia,new jersey
0,new jersey,2600,550000,0,0,1
1,new jersey,3000,565000,0,0,1
2,new jersey,3200,610000,0,0,1
3,new jersey,3600,680000,0,0,1
4,new jersey,4000,725000,0,0,1
5,california,2600,850000,1,0,0
6,california,2800,900000,1,0,0
7,california,3300,925000,1,0,0
8,california,3600,985000,1,0,0
9,georgia,2600,350000,0,1,0


In [71]:
df_dummies.drop('state',axis='columns',inplace=True)
df_dummies

Unnamed: 0,area,price,california,georgia,new jersey
0,2600,550000,0,0,1
1,3000,565000,0,0,1
2,3200,610000,0,0,1
3,3600,680000,0,0,1
4,4000,725000,0,0,1
5,2600,850000,1,0,0
6,2800,900000,1,0,0
7,3300,925000,1,0,0
8,3600,985000,1,0,0
9,2600,350000,0,1,0


<h3 style='color:purple'>Dummy Variable Trap</h3>

When you can derive one variable from other variables, they are known to be multi-colinear. Here
if you know values of california and georgia then you can easily infer value of new jersey state, i.e. 
california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this
situation linear regression won't work as expected. Hence you need to drop one column. 

**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the 
    state columns it is going to work, however we should make a habit of taking care of dummy variable
    trap ourselves just in case library that you are using is not handling this for you**

In [85]:
df_dummies.drop('new jersey',axis='columns',inplace=True)

In [87]:
df_dummies

Unnamed: 0,area,price,california,georgia
0,2600,550000,0,0
1,3000,565000,0,0
2,3200,610000,0,0
3,3600,680000,0,0
4,4000,725000,0,0
5,2600,850000,1,0
6,2800,900000,1,0
7,3300,925000,1,0
8,3600,985000,1,0
9,2600,350000,0,1


In [88]:
model.fit(df_dummies.drop('price',axis='columns'),df_dummies.price)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [90]:
model.predict([[2600,0,0]]) # 2600 sqr ft home in new jersey

array([ 545850.3547624])

In [92]:
model.predict([[2600,0,1]]) # 2600 sqr ft home in georgia

array([ 339459.79359276])

In [93]:
model.predict([[2600,1,0]]) # 2600 sqr ft home in california

array([ 859013.11545903])

<h2 style='color:purple'>Using sklearn OneHotEncoder</h2>

In [97]:
df_le

Unnamed: 0,state,area,price
0,2,2600,550000
1,2,3000,565000
2,2,3200,610000
3,2,3600,680000
4,2,4000,725000
5,0,2600,850000
6,0,2800,900000
7,0,3300,925000
8,0,3600,985000
9,1,2600,350000


In [105]:
X = df_le[['state','area']].values

In [101]:
y = df_le.price.values

In [106]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])

In [108]:
X = ohe.fit_transform(X).toarray()

In [109]:
X = X[:,1:]

In [110]:
X

array([[  0.00000000e+00,   1.00000000e+00,   2.60000000e+03],
       [  0.00000000e+00,   1.00000000e+00,   3.00000000e+03],
       [  0.00000000e+00,   1.00000000e+00,   3.20000000e+03],
       [  0.00000000e+00,   1.00000000e+00,   3.60000000e+03],
       [  0.00000000e+00,   1.00000000e+00,   4.00000000e+03],
       [  0.00000000e+00,   0.00000000e+00,   2.60000000e+03],
       [  0.00000000e+00,   0.00000000e+00,   2.80000000e+03],
       [  0.00000000e+00,   0.00000000e+00,   3.30000000e+03],
       [  0.00000000e+00,   0.00000000e+00,   3.60000000e+03],
       [  1.00000000e+00,   0.00000000e+00,   2.60000000e+03],
       [  1.00000000e+00,   0.00000000e+00,   2.90000000e+03],
       [  1.00000000e+00,   0.00000000e+00,   3.10000000e+03],
       [  1.00000000e+00,   0.00000000e+00,   3.60000000e+03]])

In [111]:
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [118]:
model.predict([[0,1,2600]])

array([ 545850.35476241])

In [114]:
model.predict([[0,0,2600]])

array([ 859013.11545903])

In [115]:
model.predict([[1,0,2600]])

array([ 339459.79359276])