# Handling Categorical Data
All data passed to a Scikit-Learn estimator must be numeric. Let's choose some string and numeric columns and attempt to fit a model with string columns.

In [2]:
import pandas as pd
import numpy as np

housing = pd.read_csv('data/kaggle_housing.csv')
housing.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
h = housing[['LotShape', 'LandContour', 'Neighborhood', 'OverallQual', 'WoodDeckSF']]
h.head()

Unnamed: 0,LotShape,LandContour,Neighborhood,OverallQual,WoodDeckSF
0,Reg,Lvl,CollgCr,7,0
1,Reg,Lvl,Veenker,6,298
2,IR1,Lvl,CollgCr,7,0
3,IR1,Lvl,Crawfor,7,0
4,IR1,Lvl,NoRidge,8,192


In [4]:
h.isna().sum()

LotShape        0
LandContour     0
Neighborhood    0
OverallQual     0
WoodDeckSF      0
dtype: int64

In [5]:
X = h.values
y = housing['SalePrice'].values

In [6]:
X

array([['Reg', 'Lvl', 'CollgCr', 7, 0],
       ['Reg', 'Lvl', 'Veenker', 6, 298],
       ['IR1', 'Lvl', 'CollgCr', 7, 0],
       ..., 
       ['Reg', 'Lvl', 'Crawfor', 7, 0],
       ['Reg', 'Lvl', 'NAmes', 5, 366],
       ['Reg', 'Lvl', 'Edwards', 5, 736]], dtype=object)

In [7]:
y

array([208500, 181500, 223500, ..., 266500, 142125, 147500])

### Try to fit the model :(

In [13]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

ValueError: could not convert string to float: 'Edwards'

## This is the worst part of scikit-learn
Other languages like R, internally handle string column. There is no standard way of handling string columns with scikit-learn. There are many workarounds.

## This is getting fixed in scikit-learn version 0.20
There has been a lot of work on adding a new estimator, **`CategoryEncoder`**, which may get released for version 0.20.

## Can use `pd.get_dummies` 
The pandas function `pd.get_dummies` will suffice for now. It converts all the string columns into numeric 0 or 1 indicator variables. This is also known as **one hot encoding**. We can transform the entire DataFrame in this manner. Pandas will not encode numeric columns. It puts the newly encoded columns behind the numeric columns.

In [9]:
h.head()

Unnamed: 0,LotShape,LandContour,Neighborhood,OverallQual,WoodDeckSF
0,Reg,Lvl,CollgCr,7,0
1,Reg,Lvl,Veenker,6,298
2,IR1,Lvl,CollgCr,7,0
3,IR1,Lvl,Crawfor,7,0
4,IR1,Lvl,NoRidge,8,192


In [10]:
h_dummies = pd.get_dummies(h)
# applying one hot encoding on data, expands the number of columns because each becmes t/f 
# for every unique value in each column it creates a new column
# this is the way R handles it in the background by default
h_dummies.head()

Unnamed: 0,OverallQual,WoodDeckSF,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,7,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,6,298,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,7,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,7,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,8,192,1,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


### Only the string columns were encoded
The columns that were numeric were left alone. You can use the **`nunique`** method to find the number of unique values in each column. This will give you an idea of how wide your DataFrame will become after the encoding.

In [11]:
h.nunique()

LotShape          4
LandContour       4
Neighborhood     25
OverallQual      10
WoodDeckSF      274
dtype: int64

In [12]:
h_dummies.shape

(1460, 35)

In [15]:
newX = h_dummies.values
y = housing['SalePrice'].values

# Still Must fill in missing values

In [24]:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
newX_filled = imp.fit_transform(newX)

In [27]:
lr2 = LinearRegression()
lr2.fit(newX_filled, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [28]:
y_pred_under = lr2.predict(newX_filled)

In [29]:
lr2.score(newX_filled, y)

0.72889104880912292

In [30]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True)
cross_val_score(lr2, newX_filled, y, cv=kf).mean()

0.70976306151069091