# Data Science Quick Tip #005: Two Correct Ways Perform One-Hot Encoding!
In this notebook, we're going to demonstrate two correct ways to one-hot encode your data and show the other often used means to OHE but is actually incorrect. To learn more, check out the corresponding blog post for this notebook.

## Project Setup

In [1]:
# Importing the required libraries
import pandas as pd
import joblib
from sklearn import preprocessing
from category_encoders import one_hot

  import pandas.util.testing as tm


In [2]:
# Creating a small DataFrame with fake animal data
animals_df = pd.DataFrame({'animal': ['cat', 'dog', 'bird', 'monkey', 'elephant', 'cat', 'bird']})
animals_df

Unnamed: 0,animal
0,cat
1,dog
2,bird
3,monkey
4,elephant
5,cat
6,bird


## The INCORRECT Way to Perform OHE for Data Science

In [3]:
# Performing one-hot encoding with Pandas' "get_dummies" function
pandas_dummies = pd.get_dummies(animals_df['animal'])
pandas_dummies

Unnamed: 0,bird,cat,dog,elephant,monkey
0,0,1,0,0,0
1,0,0,1,0,0
2,1,0,0,0,0
3,0,0,0,0,1
4,0,0,0,1,0
5,0,1,0,0,0
6,1,0,0,0,0


## Correct Way #1: Using Scikit-Learn's One Hot Encoder

In [4]:
# Instantiating the Scikit-Learn OHE object
sklearn_ohe = preprocessing.OneHotEncoder()

In [5]:
# Fitting the animals DataFrame to the Scikit-Learn one-hot encoder
sklearn_dummies = sklearn_ohe.fit_transform(animals_df)

In [6]:
# Viewing the categories of the Scikit-Learn fitted transformer
sklearn_ohe.categories_

[array(['bird', 'cat', 'dog', 'elephant', 'monkey'], dtype=object)]

In [7]:
# Viewing the object type of the output from the Scikit-Learn transformation
sklearn_dummies

<7x5 sparse matrix of type '<class 'numpy.float64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [8]:
# Using the output dummies and transformer categories to produce a cleaner looking dataframe
sklearn_dummies_df = pd.DataFrame(data = sklearn_dummies.toarray(), 
                                  columns = sklearn_ohe.categories_)
sklearn_dummies_df

Unnamed: 0,bird,cat,dog,elephant,monkey
0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,0.0


In [9]:
# Dumping the transformer to an external pickle file
joblib.dump(sklearn_ohe, 'sklearn_ohe.pkl')

['sklearn_ohe.pkl']

## Correct Way #2: Using Category Encoder's One Hot Encoder

In [10]:
# Instantiating the Category Encoders OHE object
ce_ohe = one_hot.OneHotEncoder(use_cat_names = True)

In [11]:
# Fitting the animals DataFrame to the Category Encoders one-hot encoder
ce_dummies = ce_ohe.fit_transform(animals_df)
ce_dummies

Unnamed: 0,animal_cat,animal_dog,animal_bird,animal_monkey,animal_elephant
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,0,1
5,1,0,0,0,0
6,0,0,1,0,0


In [12]:
# Dumping the transformer to an external pickle file
joblib.dump(ce_ohe, 'ce_ohe.pkl')

['ce_ohe.pkl']

## Re-Importing Exported Pickles for Transformation Demo

In [13]:
# Creating a newly ordered animals dataframe
new_animals_df = pd.DataFrame({'animal': ['cat', 'cat', 'dog', 'monkey', 'elephant', 'cat', 'bird', 'dog']})

In [14]:
# Importing the exported pickle files
imported_sklearn_ohe = joblib.load('sklearn_ohe.pkl')
imported_ce_ohe = joblib.load('ce_ohe.pkl')

In [15]:
# Running the new animals DataFrame through the Scikit-Learn imported transformer
loaded_sklearn_dummies = imported_sklearn_ohe.transform(new_animals_df)
loaded_sklearn_dummies_df = pd.DataFrame(data = loaded_sklearn_dummies.toarray(), 
                                         columns = imported_sklearn_ohe.categories_)
loaded_sklearn_dummies_df

Unnamed: 0,bird,cat,dog,elephant,monkey
0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,1.0,0.0,0.0


In [16]:
# Running the new animals DataFrame through the Category Encoders imported transformer
loaded_ce_dummies = imported_ce_ohe.transform(new_animals_df)
loaded_ce_dummies

Unnamed: 0,animal_cat,animal_dog,animal_bird,animal_monkey,animal_elephant
0,1,0,0,0,0
1,1,0,0,0,0
2,0,1,0,0,0
3,0,0,0,1,0
4,0,0,0,0,1
5,1,0,0,0,0
6,0,0,1,0,0
7,0,1,0,0,0
