# One Hot Encoding
One hot encoding is a common technique used to work with categorical features. 

There are multiple ways to faciliate this pre-processing step in Python, but it usally becomes much harder when you need your code to work on new data that might have missing or additional values. That's the case when you want to deploy a model to production - sometimes you don't know what values will appear in the data you receive.

We'll walk through two ways of dealing with this problem. In both ways we will run one hot encoding on our training set and save a few attributes that we can reuse later on when we need to process new data.

If you deploy a model to production, the best way of saving those values is writing your own class and defining them as attributes that will be set at training, as an internal state. 

If working in a notebook (like here), it's fine to save them as simple variables.

### Creating a dataset
Let's make up a dataset containing journeys that happened in different cities in the Mid Atlantic, using different ways of transportation.

We'll create a `DataFrame` that contains 2 categorical features: `city` and `transport`, as well as a numerical feature: `duration`.

In [3]:
import pandas as pd

In [4]:
df = pd.DataFrame(
    [['Baltimore', 'car', 20],
    ['Washington', 'car', 10],
    ['Pittsburgh', 'bus', 30]],
    columns=['city', 'transport', 'duration']
)
df

Unnamed: 0,city,transport,duration
0,Baltimore,car,20
1,Washington,car,10
2,Pittsburgh,bus,30


Now we'll create an 'unseen' test set. To make it difficult, we'll simulate the case where the test data has different values for the categorical features.

In [5]:
df_test = pd.DataFrame([
    ["Baltimore", "bike", 30], 
    ["Washington", "car", 40], 
    ["Dover", "bike", 10]],
    columns=["city", "transport", "duration"]
)

Here our column `city` does not have the value `Pittsburgh`, instead it has `Dover`. Our column `transport` has no value `bus` but the new value `bike`.

We'll now one hot encode using `pandas`' `get_dummies` method and `sklearn`'s `OneHotEncoder` class. 

### pandas' `get_dummies`

First, we define the list of cateogircal features that we want ot process:

In [29]:
cat_columns = ['city', 'transport']

We can quickly build dummy features with pandas by calling the `get_dummies` function - we'll create a new `DataFrame` with this:

In [14]:
df_processed = pd.get_dummies(df,
                              prefix_sep='__',
                              columns=cat_columns)
df_processed

Unnamed: 0,duration,city__Baltimore,city__Pittsburgh,city__Washington,transport__bus,transport__car
0,20,1,0,0,0,1
1,10,0,0,1,0,1
2,30,0,1,0,1,0


That's it for the training set part. We'll need to save a few things into variables to make sure that we build the exact same columns on the test set.

See how pandas created new columns with the following format: `column__name`? Let's create a list that looks for those new columns and store them in a `cat_dummies` variable:

In [17]:
cat_dummies = [col for col in df_processed
               if "__" in col
               and col.split("__")[0] in cat_columns]
cat_dummies

['city__Baltimore',
 'city__Pittsburgh',
 'city__Washington',
 'transport__bus',
 'transport__car']

Let's also save the list of columns so that we can enfore the order of columns later on:

In [9]:
processed_columns = list(df_processed.columns[:])
processed_columns

['duration',
 'city__Cambridge',
 'city__Liverpool',
 'city__London',
 'transport__bus',
 'transport__car']

### Processing our test data
Now let's ensure our test data has the same columns.

First, we'll call `get_dummies` on it:

In [18]:
df_test_processed = pd.get_dummies(df_test,
                                   prefix_sep="__",
                                   columns=cat_columns)
df_test_processed

Unnamed: 0,duration,city__Baltimore,city__Dover,city__Washington,transport__bike,transport__car
0,30,1,0,0,1,0
1,40,0,0,1,0,1
2,10,0,1,0,1,0


As we expected, we have the new column `city__Dover` and the missing column `transport__bus`.

Good news! We can easily clean it up:

In [23]:
# Remove additional columns
for col in df_test_processed.columns:
    if (("__" in col) 
    and (col.split("__")[0] in cat_columns) 
    and col not in cat_dummies):
        print(f"Removing additional feature {col}")
        df_test_processed.drop(col, axis=1, inplace=True)
df_test_processed

Removing additional feature city__Dover
Removing additional feature transport__bike


Unnamed: 0,duration,city__Baltimore,city__Washington,transport__car
0,30,1,0,0
1,40,0,1,1
2,10,0,0,0


Now we need to add the missing columns. We can set all missing columns to a vector of 0s since those values did not appear in the test data:

In [28]:
for col in cat_dummies:
    if col not in df_test_processed.columns:
        print(f"Adding missing feature{col}")
        df_test_processed[col] = 0
df_test_processed

Adding missing featurecity__Pittsburgh
Adding missing featuretransport__bus


Unnamed: 0,duration,city__Baltimore,city__Washington,transport__car,city__Pittsburgh,transport__bus
0,30,1,0,0,0,0
1,40,0,1,1,0,0
2,10,0,0,0,0,0


That's it, we now have the same features.

Note - the order of the columns isn't kept though - if you need to reorder the columns, reuse the list of processed columns we saved earlier:

In [None]:
df_test_processed = df_test_processed[processed_columns]

### Using `sklearn`'s one hot and label encoder

#### Processing our training data

We'll start by importing two things: `OneHotEncoder` to build one hot features and `LabelEncoder` to transform strings into integer labels (needed before `OneHotEncoder`):

In [2]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

We start again from an initial dataframe and our list of categorical features.

First we create our `df_processed` DataFrame with non-categorical features:

In [15]:
df_processed = pd.DataFrame([['20'],['10'],['30']],
                            columns=['duration'])

Now we encode every categorical feature separately, meaning we need as many encoders as categorical features. 

We loop over all categorical features and build a dictionary that will map a feature to its encoder:

In [16]:
# For each categorical column
# We fit a label encoder, transform our column and 
# add it to our new dataframe
label_encoders = {}
for col in cat_columns:
    print("Encoding {}".format(col))
    new_le = LabelEncoder()
    df_processed[col] = new_le.fit_transform(df[col])
    label_encoders[col] = new_le

Encoding city
Encoding transport


In [17]:
df_processed

Unnamed: 0,duration,city,transport
0,20,0,1
1,10,2,1
2,30,1,0


Now that we have proper integer labels, we need to one hot encode our categorical features.

Unfortunately, the one hot encoder does not support the passing of a list of categorical features by their names, but only their indexes. We'll use `get_loc` to create a new list with indexes:

In [18]:
cat_columns_idx = [df_processed.columns.get_loc(col)
                   for col in cat_columns]

We need to specifiy `handle_unknown` as `ignore` so the `OneHotEncoder` can work later on with our unseen data. The `OneHotEncoder` will build a numpy array for our data, replacing our original features by one hot encoded versions. Unfortunately this makes it hard to rebuild the DataFrame with nice labels, but most algorithms work with numpy arrays, so we can stop there.

In [22]:
ohe = OneHotEncoder(categorical_features=cat_columns_idx,
                    sparse=False,
                    handle_unknown='ignore')
df_processed_np = ohe.fit_transform(df_processed)

df_processed_np



array([[ 1.,  0.,  0.,  0.,  1., 20.],
       [ 0.,  0.,  1.,  0.,  1., 10.],
       [ 0.,  1.,  0.,  1.,  0., 30.]])

#### Process our unseen data
Now we need to apply the same transform on our test data: first creating a new dataframe with our non-categorical features:

In [25]:
df_test_processed = df_test[[
    col for col in df_test.columns
    if col not in cat_columns
]]

df_test_processed

Unnamed: 0,duration
0,30
1,40
2,10


Now we need to reuse our `LabelEncoder` to properly assign the same integer to the same values. Unfortunately, since we have new unseen values in our test set, we cannot use the transform.

Instead, we create a new dictionary from the `classes_` defined in our label encoder. Those classes map a value to an integer. If we then use `map` on our pandas `Series`, it sets the new values as `NaN` and converts the type of float.

Here, we add a new step that fills the `NaN` by a huge integer and converts the column to `int`:

In [32]:
for col in cat_columns:
    print(f"Encoding {col}")
    label_map = {val: label for label, val in
                 enumerate(label_encoders[col].classes_)}
    
    print(label_map)
    
    df_test_processed[col] = df_test[col].map(label_map)
    
    #fillna and convert to int
    df_test_processed[col] = df_test_processed[col].fillna(9999).astype(int)
    
df_test_processed    

Encoding city
{'Baltimore': 0, 'Pittsburgh': 1, 'Washington': 2}
Encoding transport
{'bus': 0, 'car': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,duration,city,transport
0,30,0,9999
1,40,2,1
2,10,9999,9999


Now we can finally apply our fitted `OneHotEncoder` "out-of-the-box" by using the transform method:

In [33]:
df_test_processed_np = ohe.transform(df_test_processed)

df_test_processed_np

array([[ 1.,  0.,  0.,  0.,  0., 30.],
       [ 0.,  0.,  1.,  0.,  1., 40.],
       [ 0.,  0.,  0.,  0.,  0., 10.]])