## One-Hot enconding

Usually a machine learning algoritmh understands numerical data only. Other kinds of data, also called categorical, carries a limited number of values. For example, if one wants to store car brands, the values would be categorical (because the answers would be things like Honda, Toyota, Ford, None, etc.). So this values must be encoded before being passed to a machine learning models in Python. 

The most popular standard Approach for Categorical Data is the *One-hot encoding*. One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data. It works very well unless your categorical variable takes on a large number of values (i.e. you generally won't it for variables taking more than 15 different values. It'd be a poor choice in some cases with fewer values, though that varies.)[1]
 
In this notebook we're using the home prices using dataset from Iowa, USA.

(Inspired by a tutorial published at www.kaggle.com)

[1] https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding/notebook

In [1]:
import numpy as np
import pandas as pd
    
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score


df_train = pd.read_csv('data/iowa_train.csv')
df_test = pd.read_csv('data/iowa_test.csv')

# Drop houses where the target value for prediction is not defined (!)
df_train.dropna(axis=0, subset=['SalePrice'], inplace=True)

print("Train size={} Test size={}".format(df_train.shape[0], df_test.shape[0]))

Train size=1460 Test size=1459


In [2]:
# Dropping columns with missing values
# Also removing Id and SalesPrice as these columns does not matter 
cols_with_missing = [col for col in df_train.columns 
                                 if df_train[col].isnull().any()]                                  

# Loading predictors and target (X/y) 
X_train = df_train.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
y_train = df_train.SalePrice
X_test  = df_test.drop(['Id'] + cols_with_missing, axis=1)

In [3]:
# Now just considering numeric columns and columns with low cardinality
# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in X_train.columns if 
                                X_train[cname].nunique() < 10 and
                                X_train[cname].dtype == "object"]
numeric_cols = [cname for cname in X_train.columns if 
                                X_train[cname].dtype in ['int64', 'float64']]
cols = low_cardinality_cols + numeric_cols

X_train = X_train[cols]
X_test  = X_test[cols]
print("Amount of columns used as predictors = {}".format(len(cols)))
X_train.dtypes.sample(10)

Amount of columns used as predictors = 57


RoofStyle        object
HeatingQC        object
MSSubClass        int64
Fireplaces        int64
HouseStyle       object
ExterQual        object
MiscVal           int64
LandSlope        object
EnclosedPorch     int64
BsmtUnfSF         int64
dtype: object

### Function to return MAE using cross_val_score (k-folding)

In [4]:
def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

### Trying some different configurations

In [5]:
# Model 1 - Not considering categorical columns
X_train_new = X_train.select_dtypes(exclude=['object'])

print('MAE of SalePrice when Dropping Categoricals = {:,.0f}'.format(get_mae(X_train_new, y_train)))

MAE of SalePrice when Dropping Categoricals = 18,465


In [6]:
# Model 2 - Using One-Hot Encoding (get_dummies)
X_train_new = pd.get_dummies(X_train)

print('MAE of SalePrice when Dropping Categoricals = {:,.0f}'.format(get_mae(X_train_new, y_train)))

MAE of SalePrice when Dropping Categoricals = 18,036


In [7]:
# Model 3 - Considering data from test dataset also

X_train_new = pd.get_dummies(X_train)
X_test_new = pd.get_dummies(X_test)
a,b = X_train_new.align(X_test_new, join='left', axis=1)
