## How to handle Categorical Variables in a dataset - 28/Jun/2020
> #### A <font color=red>*categorical variable*</font> takes only few number of values for the data present in a column.
> - Proper handling of categorical feature column can have a significant impact the model performance.
> #### There are multiple ways to handle Categorical feature column:
    * Dropping the column data
    * Label Encoding 
    * One-Hot Encoding 
    * Dummy Coding Scheme
    * Effect coding Scheme
    * Bin Counting Scheme
    * Feature Hashing Scheme
#### We will explore the model performance using the first three encoding techniques in this notebook
#### Dataset reference - https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home
#### Notebook references - 
* https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63
* https://www.kaggle.com/alexisbcook/categorical-variables


In [62]:
# importing the required libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [3]:
data = pd.read_csv('./melb_data.csv')

### Step 1- Data Preparation
* Identify the list of columns with missing values. For simplicity, we will drop the columns with missing data
* Identify the list of categorical columns
* Identify the list of numerical columns
* Use columns identified above for predictions
* Split the data into training and test dataset

In [4]:
# exploring the dataset
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [39]:
# Copying the original dataset for further processing
data_copy = data.copy()

In [40]:
# Identifying the columns with missing entries using python list comprehension

empty_cols = [col for col in data.columns if data[col].isnull().any()]
print('List of empty columns ',empty_cols)

List of empty columns  ['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']


In [42]:
# Dropping the columns which contains missing data
data_copy.drop(columns=empty_cols,inplace=True)

In [45]:
# Identifying the categorical columns
# Categorical clmns are clmns usually with low cardinality i.e. number of unique values LT 10

low_card_cols = [lowcol for lowcol in data_copy.columns if data_copy[lowcol].nunique() < 10 and data_copy[lowcol].dtype == 'object']
print('List of categorical columns ',low_card_cols)

List of categorical columns  ['Type', 'Method', 'Regionname']


In [46]:
# Identifying the numerical columns

num_cols = [numcol for numcol in data_copy.columns if data_copy[numcol].dtype in['float64','int64']]

print('List of numerical columns ',num_cols)

List of numerical columns  ['Rooms', 'Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude', 'Propertycount']


In [47]:
# Using the categorical columns and numerical columns only
my_cols = low_card_cols + num_cols
data_copy = data_copy[my_cols]

In [53]:
data_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Type           13580 non-null  object 
 1   Method         13580 non-null  object 
 2   Regionname     13580 non-null  object 
 3   Rooms          13580 non-null  int64  
 4   Price          13580 non-null  float64
 5   Distance       13580 non-null  float64
 6   Postcode       13580 non-null  float64
 7   Bedroom2       13580 non-null  float64
 8   Bathroom       13580 non-null  float64
 9   Landsize       13580 non-null  float64
 10  Lattitude      13580 non-null  float64
 11  Longtitude     13580 non-null  float64
 12  Propertycount  13580 non-null  float64
dtypes: float64(9), int64(1), object(3)
memory usage: 1.3+ MB


In [55]:
# Separating the attributes as X and label as y
y = data_copy['Price']
x = data_copy.drop(columns='Price')


In [59]:
# Splitting the training and test dataset
x_train, x_valid, y_train, y_valid = train_test_split(x, y, train_size = 0.8, test_size=0.2, random_state=0)

In [61]:
x_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


## Step 2 - Creating model using the different categorical data encoding techniques:
* Approach 1 - Dropping the categorical column data
* Approach 2 - Using Label Encoding
* Approach 3 - Using One-Hot Encoding

In [63]:
# defining a function which return the MAE value for the set of arguments

def score_dataset(x_train,x_valid,y_train,y_valid):
    model = RandomForestRegressor(n_estimators=100,random_state=0)
    model.fit(x_train,y_train)
    preds = model.predict(x_valid)
    return mean_absolute_error(preds,y_valid)

### Approach 1 - Identifying model performance by simply dropping the categorical columns

In [64]:
drop_x_train = x_train.drop(columns=low_card_cols)
drop_x_valid = x_valid.drop(columns=low_card_cols)

In [92]:
print('MAE for Approach 1: Dropping columns- ',score_dataset(drop_x_train,drop_x_valid,y_train,y_valid))

MAE for Approach 1: Dropping columns-  175703.48185157913


## Approach 2 - Identifying model performance using the Label encoder

In [66]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_x_train = x_train.copy()
label_x_valid = x_valid.copy()

for cols in low_card_cols:
    label_x_train[cols] = label_encoder.fit_transform(x_train[cols])
    label_x_valid[cols] = label_encoder.fit_transform(x_valid[cols])


In [94]:
print('MAE for Approach 2: Label Encoding - ',score_dataset(label_x_train,label_x_valid,y_train,y_valid))

MAE for Approach 2: Label Encoding -  165936.40548390493


## Approach 3 - Identifying model performance using One Hot encoder
* Process the categorical columns using the onehot encoding
* Drop the original categorical columns 
* Concatenate the newly created onehot encoded columns into the training and validation dataset

In [68]:
# One hot encoding checks only the categorical columns of a dataset 
# and then processes the data and creates additionalcolumns

from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(x_train[low_card_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.fit_transform(x_valid[low_card_cols]))

In [84]:
# Upon creation of the new columns, the indices gets removed, therefore the indices have to be fixed
OH_cols_train.index = x_train.index
OH_cols_valid.index = x_valid.index

In [85]:
OH_cols_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
12167,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6524,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8413,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2919,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6043,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13123,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3264,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9845,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10799,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [87]:
# dropping the categorical columns
num_OH_train = x_train.drop(columns=low_card_cols)
num_OH_valid = x_valid.drop(columns=low_card_cols)

In [88]:
num_OH_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10864 entries, 12167 to 2732
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Rooms          10864 non-null  int64  
 1   Distance       10864 non-null  float64
 2   Postcode       10864 non-null  float64
 3   Bedroom2       10864 non-null  float64
 4   Bathroom       10864 non-null  float64
 5   Landsize       10864 non-null  float64
 6   Lattitude      10864 non-null  float64
 7   Longtitude     10864 non-null  float64
 8   Propertycount  10864 non-null  float64
dtypes: float64(8), int64(1)
memory usage: 1.1 MB


In [90]:
# concatenating OH encoded columns with the numerical feature columns

x_train_OH = pd.concat([num_OH_train,OH_cols_train], axis =1)
x_valid_OH = pd.concat([num_OH_valid,OH_cols_valid],axis=1)


In [93]:
print('MAE for Approach 3: One Hot Encoding ',score_dataset(x_train_OH,x_valid_OH,y_train,y_valid))

MAE for Approach 3: One Hot Encoding  166089.4893009678


## Conclusion - 
* MAE for Approach 1: Dropping columns-  175703.48185157913
* MAE for Approach 2: Label Encoding -  165936.40548390493
* MAE for Approach 3: One Hot Encoding  166089.4893009678

Approach 1 performed worst, whereas the other two approaches performed pretty similar
