# Pandas GetDummies Vs SKLearn OneHotEncoder

In this notebook we will investigate why OneHotEncoder may be the better choice for Machine Learning models.

## Import Dataset

Import the dAirBnB_Simplified_Dataset.csv and identifiy the categorical variables that we should encode.

In [2]:
# Basics
import numpy as np
import pandas as pd

In [3]:
# Importing training data
df_train = pd.read_csv('airbnb_dataset_training.csv')
df_train

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood_group,neighbourhood,room_type,price
0,2539,1,9,Brooklyn,Kensington,Private room,149
1,2595,1,45,Manhattan,Midtown,Entire home/apt,225
2,3647,3,0,Manhattan,Midtown,Private room,150
3,3831,1,270,Brooklyn,Clinton Hill,Entire home/apt,89
4,5022,10,9,Manhattan,Murray Hill,Entire home/apt,80
5,5099,3,74,Manhattan,Murray Hill,Entire home/apt,200
6,5121,45,49,Brooklyn,Bedford-Stuyvesant,Private room,60
7,5178,2,430,Manhattan,Hell's Kitchen,Private room,79


In [5]:
#import testing data and split into x and y
df_test = pd.read_csv('airbnb_dataset_testing.csv')
df_test

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood_group,neighbourhood,room_type,price
0,13808,1,112,Brooklyn,Bedford-Stuyvesant,Private room,80
1,16338,7,27,Brooklyn,Clinton Hill,Private room,55
2,16421,30,191,Manhattan,Hell's Kitchen,Private room,52
3,15220,2,289,Manhattan,Hell's Kitchen,Private room,69
4,5238,1,160,Manhattan,Chinatown,Entire home/apt,150
5,12937,3,248,Queens,Long Island City,Private room,130


## One Hot Encoding Features in Pandas' Get Dummies

### Transform Training Data

In [15]:
# Perform the same action but on the full dataset, creating one hot encoded columns for the following columns:
#'neighbourhood_group','neighbourhood','room_type'
pd_train_encoded = pd.get_dummies(df_train,
                                columns = ['neighbourhood_group'],
                                prefix = ['ng'],
                                prefix_sep='_',
                                drop_first=False)
pd_train_encoded

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood,room_type,price,ng_Brooklyn,ng_Manhattan
0,2539,1,9,Kensington,Private room,149,1,0
1,2595,1,45,Midtown,Entire home/apt,225,0,1
2,3647,3,0,Midtown,Private room,150,0,1
3,3831,1,270,Clinton Hill,Entire home/apt,89,1,0
4,5022,10,9,Murray Hill,Entire home/apt,80,0,1
5,5099,3,74,Murray Hill,Entire home/apt,200,0,1
6,5121,45,49,Bedford-Stuyvesant,Private room,60,1,0
7,5178,2,430,Hell's Kitchen,Private room,79,0,1


### Transform Testing Data

In [7]:
# Perform the same action but on the full dataset, creating one hot encoded columns for the following columns:
#'neighbourhood_group','neighbourhood','room_type'
pd_test_encoded = pd.get_dummies(df_test,
                                columns = ['neighbourhood_group'],
                                prefix = ['ng'],
                                prefix_sep='_',
                                drop_first=False)
pd_test_encoded

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood,room_type,ng_Brooklyn,ng_Manhattan,ng_Queens
0,13808,1,112,Bedford-Stuyvesant,Private room,1,0,0
1,16338,7,27,Clinton Hill,Private room,1,0,0
2,16421,30,191,Hell's Kitchen,Private room,0,1,0
3,15220,2,289,Hell's Kitchen,Private room,0,1,0
4,5238,1,160,Chinatown,Entire home/apt,0,1,0
5,12937,3,248,Long Island City,Private room,0,0,1


Notice that the testing set has one additional column due to the presence of **Queens** in the neighbourhood column.

A model which has been fitted to the training datset wil be expecting the same features, but the testing data does not deliver this.

## One Hot Encoding Features in OneHotEncoder

### Transform Training Data

For this demonstration, we will only be one hot encoding the neighbourhood column.

In [16]:
# Initiatlize the ohe method
from sklearn.preprocessing import OneHotEncoder
train_ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)     

#Specify the categorical columns
col_names = ['neighbourhood_group']
col_prefixes = ['ng']

#Apply the ohe encoding
dummy_cols = train_ohe.fit_transform(df_train[col_names])
dummy_names = train_ohe.get_feature_names(col_prefixes)
dummy_cols = pd.DataFrame(dummy_cols,columns = dummy_names, dtype = int)

#Combine the encoded columns with the non categorical columns
ohe_train_encoded = pd.concat([df_train,dummy_cols], axis = 1)
ohe_train_encoded = ohe_train_encoded.drop(col_names, axis = 1)
ohe_train_encoded

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood,room_type,price,ng_Brooklyn,ng_Manhattan
0,2539,1,9,Kensington,Private room,149,1,0
1,2595,1,45,Midtown,Entire home/apt,225,0,1
2,3647,3,0,Midtown,Private room,150,0,1
3,3831,1,270,Clinton Hill,Entire home/apt,89,1,0
4,5022,10,9,Murray Hill,Entire home/apt,80,0,1
5,5099,3,74,Murray Hill,Entire home/apt,200,0,1
6,5121,45,49,Bedford-Stuyvesant,Private room,60,1,0
7,5178,2,430,Hell's Kitchen,Private room,79,0,1


### Transform Testing Data

In [19]:
#Showing which variables will be used from the training code above.
train_ohe = train_ohe          #'use the trained OHE transformer from above'
col_names = col_names          #'keep column names the same as for training data'
col_prefixes = col_prefixes    #'keep column prefixes the same as for training data'
dummy_names = dummy_names      #'use the same dummy variable names as for training'

#Apply the transformation from the fitted ohe encoder
ohe_test_encoded = train_ohe.transform(df_test[col_names])
ohe_test_encoded = pd.DataFrame(ohe_test_encoded, columns = dummy_names, dtype = int)

#Combine the encoded columns with the non categorical columns
ohe_test_encoded = pd.concat([df_test,ohe_test_encoded], axis = 1)
ohe_test_encoded = ohe_test_encoded.drop(col_names, axis = 1)
ohe_test_encoded

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood,room_type,price,ng_Brooklyn,ng_Manhattan
0,13808,1,112,Bedford-Stuyvesant,Private room,80,1,0
1,16338,7,27,Clinton Hill,Private room,55,1,0
2,16421,30,191,Hell's Kitchen,Private room,52,0,1
3,15220,2,289,Hell's Kitchen,Private room,69,0,1
4,5238,1,160,Chinatown,Entire home/apt,150,0,1
5,12937,3,248,Long Island City,Private room,130,0,0


When the OneHotEncoder was fitted on the training data, the argument provided was: handle_unknown='ignore'.

Notice that with handle_unknown='ignore', the additional neighbourhood in the testing data, Queens, no longer has an additional column. Instead Queens is represented by all 0s in the neighbourhood dummy columns.

This method means that machine learning models will have the correct number of features to use as inputs.

### Exercise - Experiment with the handle_unknown argument

Experiment with each of the  hundle_unknown arguments in the fitted OneHotEncoder. Fit the OneHotEncoder again and see what happens when you transform the testing data using that fitted transformation.

Write down what you find for:
- handle_unknown='ignore':
- handle_unknown='error':