In this notebook we will use different methods to deal with categorical data.

To create a model we need numbers, categorical data can not directly be used to create models, models don't understand things like `"Never", "Rarely", "Most days", or "Every day"`. So we need a way to convert these into numbers a model can understand. Alternatevely we can also simply drope the categorical data from the data set. 

There are 3 main approaches to deal with categorical data.

## 1) Droping categorical data

We can simply drop the categorical data. This is a very common and quick way to deal with this problem. But we may be missing out on important patterns in the data.

## 2) Ordinal encoding

If the categorical data we have is ordinal, which mean that it has an order or hierarchy, such as frequency. We can assign a value to each ver easily `0, 1, 2 adn 3` for `"Never", "Rarely", "Most days", or "Every day"`.

## 3) One-hot encoding

If the categorical data we are dealing with is **nominal data**, we can use this technique. It involves creating columns for each possible nominal type and assigning a 1 or 0 as a value if the entry has that nominal variable.

For example, this table:

```
Index   Color
1       Red
2       Green
3       Blue
4       Blue

```

Would produce this table:

```
Index   Red     Green   Blue
1       1       0       0
2       0       1       0
3       0       0       1
4       0       0       1
```

This strategy does not generally work well if there are a large number of values of each variable.

In [289]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

In [267]:
train_data = pd.read_csv('../data/train.csv', index_col= 'Id')
test_data = pd.read_csv('../data/test.csv', index_col= 'Id')


In [268]:
y = train_data['SalePrice']
X = train_data.drop('SalePrice',axis=1)
nan_columns = X.isna().any()[X.isna().any()].index.to_list() # returns a list of all columns that have a NaN value
X.drop(nan_columns, axis=1, inplace=True) 
X_test = test_data.drop(nan_columns, axis=1)

In [269]:
X_train, X_val, y_train, y_val = train_test_split(X,y, train_size= 0.8, test_size=0.2, random_state= 0)

In [270]:
def mae_score(X_train, X_val, y_train, y_val):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train,y_train) # fit model with data
    predictions = model.predict(X_val) # create predictions
    return mean_absolute_error(y_val, predictions) # estimate mae of predictions based on the known values

# Column drop method

In [271]:
drop_X_train = X_train.select_dtypes(exclude='object')
drop_X_valid = X_val.select_dtypes(exclude='object')
mae_score(drop_X_train, drop_X_valid, y_train, y_val)

17837.82570776256

The mae score for the drop method was `17837.82570776256`.

# Ordinal encoding

Ordinal encoding only applies to categorical data so we can forget about the numerical data and only select the categorical data. To do this we can easily select the categorical data using `dtypes` of each column. If a column has categorical data in it then its `dtype` will be `object`. We can store these columns in a variable called `obj_columns`.

In [272]:
obj_columns = (X_train.dtypes==object)[X_train.dtypes==object]

Within `obj_columns` we have columns that we can encode and columns that would give us errors if we try to encode them. Why would columns give errors if we try to encode them?

To encode columns we first need to fit the data, in this case the training data into `OrdinalEncoder().fit()`. The problem is that the training data and teh validation data are different. Some values present in the validation data may not be present in the training data. So if we fit using the training data and we try to transform the validation data we would get an error because the validation data contains values the `encoder` has never seen.

To fix this we can use `set()` and `issubset()` functions of python to check what columns have different unique data.

First we get the columns that have the same unique data and store the column names in `pos_columns`.

In [273]:
pos_columns = [col for col in obj_columns.index.tolist() if
            set(X_val[col]).issubset(set(X_train[col]))]

We then can substract the `pos_columns` to get the columns that did not pass the subset test. We can store these column names in `neg_columns`.

In [274]:
neg_columns = list(set(obj_columns.index.tolist())-set(pos_columns))

Now we have the column names of columns we can safely use for encoding we can fit the encoder. Use `pos_columns`.

In [275]:
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train[pos_columns])

Now, we tranform the the train and validation data. We need to copy the `X_train` and `X_val` so we can manipulate a copy instead of the original, however this copy should only have the columns we want to process. The columns we want to remove are stored in `neg_columns`.

In [276]:
ordinal_X_train = X_train.copy()
ordinal_X_val = X_val.copy()

We not drop the problematic columns stored in `neg_columns`.

In [277]:
ordinal_X_train.drop(neg_columns, axis='columns', inplace=True)
ordinal_X_val.drop(neg_columns, axis='columns',inplace=True)

In [278]:
ordinal_X_train[pos_columns] = ordinal_encoder.transform(ordinal_X_train[pos_columns])
ordinal_X_val[pos_columns] = ordinal_encoder.transform(ordinal_X_val[pos_columns])

In [279]:
mae_score(ordinal_X_train,ordinal_X_val,y_train,y_val)

17098.01649543379

The mae for ordinal encoding was **17098.01649543379**

# One-hot encoding

In [369]:
pos_OH_columns = X_train[obj_columns.index].nunique()<10 # positive column names
pos_OH_columns= pos_OH_columns[pos_OH_columns].index.tolist() # positive column names
neg_OH_columns = list(set(obj_columns.index)-set(pos_OH_columns)) # negative column names
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 
OH_X_train = pd.DataFrame(OH_encoder.fit_transform(X_train[pos_OH_columns]))
OH_X_val = pd.DataFrame(OH_encoder.transform(X_val[pos_OH_columns]))
OH_X_train.index = X_train.index
OH_X_val.index = X_val.index
OH_X_train_final = X_train.drop(obj_columns.index, axis=1)
OH_X_val_final = X_val.drop(obj_columns.index, axis=1)

OH_X_train_final = pd.concat([OH_X_train_final,OH_X_train], axis=1)
OH_X_val_final = pd.concat([OH_X_val_final,OH_X_val], axis=1)

OH_X_train_final.columns = OH_X_train_final.columns.astype(str)
OH_X_val_final.columns = OH_X_val_final.columns.astype(str)

mae_score(OH_X_train_final,OH_X_val_final, y_train, y_val)




17525.345719178084

Unnamed: 0_level_0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,112,113,114,115,116,117,118,119,120,121
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,11694,9,5,2007,2007,48,0,1774,1822,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
871,20,6600,5,5,1962,1962,0,0,894,894,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
93,30,13360,5,7,1921,2006,713,0,163,876,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
818,20,13265,8,5,2002,2002,1218,0,350,1568,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
303,20,13704,7,5,2001,2002,0,0,1541,1541,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
764,60,9430,8,5,1999,1999,1163,0,89,1252,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
836,20,9600,4,7,1950,1995,442,0,625,1067,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1217,90,8930,6,5,1978,1978,0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
560,120,3196,7,5,2003,2004,0,0,1374,1374,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


KeyError: "['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition'] not found in axis"