
## In this notebook:

We will see how to perform one hot encoding with:
- pandas
- Scikit-learn
- Feature-Engine

And the advantages and limitations of each implementation using the Titanic dataset.

In [2]:
# import libraries

import pandas as pd

# from sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# from feature engine
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

In [10]:
# import data
cols_to_use = ['sex', 'embarked', 'cabin', 'survived']

data = pd.read_csv('..\\titanic.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   survived  1309 non-null   int64 
 1   sex       1309 non-null   object
 2   cabin     295 non-null    object
 3   embarked  1307 non-null   object
dtypes: int64(1), object(3)
memory usage: 41.0+ KB


In [17]:
# lets use the first letter of the cabin variable
data['cabin'] = data['cabin'].str[0]
data['cabin'].head()

0    B
1    C
2    C
3    C
4    C
Name: cabin, dtype: object

In [18]:
# split the data into train and test

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['sex', 'embarked', 'cabin']],  # predictors
    data['survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 3), (393, 3))

In [20]:
## Exploring the Cardinality
X_train['sex'].unique()

array(['female', 'male'], dtype=object)

In [21]:
X_train['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [22]:
X_train['cabin'].unique()

array([nan, 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)

## One HOT encoding using Pandas

### Advantages

- quick
- returns pandas dataframe
- returns feature names for the dummy variables

### Limitations of pandas:

- it does not preserve information from train data to propagate to test data


-----

The pandas method get_dummies(), will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using get-dummies in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

### into k  dummy variables

In [25]:
# we can create dummy variables with the build in
# pandas method get_dummies

temp = pd.get_dummies(X_train['sex'])
temp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


In [26]:
# for better visualisation let's put the dummies next
# to the original variable

pd.concat([X_train['sex'],
           pd.get_dummies(X_train['sex'])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


In [27]:
# and now for all variables together: train set

tmp = pd.get_dummies(X_train)

print(tmp.shape)

tmp.head()

(916, 13)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,1,0,0,0,1,0,0,0,0,0,0,0,0
588,1,0,0,0,1,0,0,0,0,0,0,0,0
402,1,0,1,0,0,0,0,0,0,0,0,0,0
1193,0,1,0,1,0,0,0,0,0,0,0,0,0
686,1,0,0,1,0,0,0,0,0,0,0,0,0


### into k-1 dummy variables

In [28]:
# and now for all variables together: train set

tmp = pd.get_dummies(X_train, drop_first=True)

print(tmp.shape)

tmp.head()

(916, 10)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,0,0,1,0,0,0,0,0,0,0
588,0,0,1,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0,0,0
1193,1,1,0,0,0,0,0,0,0,0
686,0,1,0,0,0,0,0,0,0,0


In [29]:
# and now for all variables together: train set

tmp = pd.get_dummies(X_test, drop_first=True)

print(tmp.shape)

tmp.head()

(393, 9)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,1,0,1,0,0,0,0,0,0
533,0,0,1,0,0,0,0,0,0
459,1,0,1,0,0,0,0,0,0
1150,1,0,1,0,0,0,0,0,0
393,1,0,1,0,0,0,0,0,0


- So, here one column from each variable is dropped
- We are now seeing only 10 variables


Notice the positives of pandas get_dummies:
- dataframe returned with feature names

**And the limitations:**

The train set contains 13 dummy features, whereas the test set contains 12 features. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

### Bonus: get_dummies() can handle missing values

In [31]:
# we can add an additional dummy variable to indicate
# missing data

pd.get_dummies(X_train['embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
501,0,1,0
588,0,1,0
402,0,0,0
1193,1,0,0
686,1,0,0


## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set

### Limitations

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration

In [36]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return  -1, use drop=false to return k dummies
                       sparse=False, # if True will return sparse matrix
                       handle_unknown='error') # helps deal with rare labels

encoder.fit(X_train.fillna('Missing'))

OneHotEncoder(drop='first', sparse=False)

In [37]:
# we observed the learned categories
encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object)]

In [38]:
# transform the train set

tmp = encoder.transform(X_train.fillna('Missing'))

pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [39]:
# NEW: in latest release of Scikit-learn
# we can now retrieve the feature names as follows:

encoder.get_feature_names()



array(['x0_male', 'x1_Missing', 'x1_Q', 'x1_S', 'x2_B', 'x2_C', 'x2_D',
       'x2_E', 'x2_F', 'x2_G', 'x2_Missing', 'x2_T'], dtype=object)

In [40]:
encoder.feature_names_in_

array(['sex', 'embarked', 'cabin'], dtype=object)

In [42]:
# we can go ahead and transfom the test set
# and then reconstitute it back to a pandas dataframe
# and add the feature names derived by OHE

tmp = encoder.transform(X_test.fillna('Missing'))

tmp = pd.DataFrame(tmp)
tmp.columns = encoder.get_feature_names()

tmp.head()



Unnamed: 0,x0_male,x1_Missing,x1_Q,x1_S,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_Missing,x2_T
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


We can see that train and test contain the same number of features.

More details about Scikit-learn's OneHotEncoder can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    

## One hot encoding with Feature-Engine

### Advantages
- quick
- returns dataframe
- returns feature names
- allows to select features to encode

### Limitations
- Not sure yet.

In [43]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    variables=['sex', 'embarked'],  # we can select which variables to encode # if not passed, will consider all cat variables
    drop_last=True)  # to return k-1, false to return k


ohe_enc.fit(X_train.fillna('Missing'))

OneHotEncoder(drop_last=True, variables=['sex', 'embarked'])

In [44]:
tmp = ohe_enc.transform(X_train.fillna('Missing'))

tmp.head()

Unnamed: 0,cabin,sex_female,embarked_S,embarked_C,embarked_Q
501,Missing,1,1,0,0
588,Missing,1,1,0,0
402,Missing,1,0,1,0
1193,Missing,0,0,0,1
686,Missing,1,0,0,1


In [45]:
X_train.head()

Unnamed: 0,sex,embarked,cabin
501,female,S,
588,female,S,
402,female,C,
1193,male,Q,
686,female,Q,


Note how feature-engine returns the dummy variables with their names, and drops the original variable, leaving the dataset ready for further exploration or building machine learning models.

In [46]:
tmp = ohe_enc.transform(X_test.fillna('Missing'))

tmp.head()

Unnamed: 0,cabin,sex_female,embarked_S,embarked_C,embarked_Q
1139,Missing,0,1,0,0
533,Missing,1,1,0,0
459,Missing,0,1,0,0
1150,Missing,0,1,0,0
393,Missing,0,1,0,0


In [47]:
# Feature-Engine's one hot encoder also selects
# all categorical variables automatically

ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    drop_last=True)  # to return k-1, false to return k


ohe_enc.fit(X_train.fillna('Missing'))

OneHotEncoder(drop_last=True, variables=['sex', 'embarked', 'cabin'])

In [48]:
ohe_enc.variables

['sex', 'embarked', 'cabin']

In [49]:
tmp = ohe_enc.transform(X_train.fillna('Missing'))

tmp.head()

Unnamed: 0,sex_female,embarked_S,embarked_C,embarked_Q,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T
501,1,1,0,0,1,0,0,0,0,0,0,0
588,1,1,0,0,1,0,0,0,0,0,0,0
402,1,0,1,0,1,0,0,0,0,0,0,0
1193,0,0,0,1,1,0,0,0,0,0,0,0
686,1,0,0,1,1,0,0,0,0,0,0,0


In [50]:
tmp = ohe_enc.transform(X_test.fillna('Missing'))

tmp.head()

Unnamed: 0,sex_female,embarked_S,embarked_C,embarked_Q,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,1,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


Note how this encoder returns a variable cabin_T for the test set as well, even though this category is not present in the test set. This allows the integration with Scikit-learn pipeline and scoring of test set by the built algorithm..

In fact, we can check that the sum of cabin_t is 0:

In [51]:
tmp['cabin_T'].sum()

0