## One Hot Encoding of Frequent Categories

We learned in Section 3 that high cardinality and rare labels may result in certain categories appearing only in the train set, therefore causing over-fitting, or only in the test set, and then our models wouldn't know how to score those observations.

We also learned in the previous lecture on one hot encoding, that if categorical variables contain multiple labels, then by re-encoding them with dummy variables we will expand the feature space dramatically.

**In order to avoid these complications, we can create dummy variables only for the most frequent categories**

This procedure is also called one hot encoding of top categories.

### Advantages of OHE of top categories

- Straightforward to implement
- Does not require hrs of variable exploration
- Does not expand massively the feature space
- Suitable for linear models


### Limitations

- Does not add any information that may make the variable more predictive
- Does not keep the information of the ignored labels


Often, categorical variables show a few dominating categories while the remaining labels add little information. Therefore, OHE of top categories is a simple and useful technique.

### Note

The number of top variables is set arbitrarily. This number can be chosen arbitrarily or derived from data exploration.


## In this demo:

We will see how to perform one hot encoding with:
- pandas and NumPy
- Feature-Engine

And the advantages and limitations of these implementations using the House Prices dataset.

In [1]:
# import libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from feature_engine.encoding import OneHotEncoder

In [2]:
# load data

cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']

data = pd.read_csv('..\\house_price.csv', usecols=cols)
data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Neighborhood  1460 non-null   object
 1   Exterior1st   1460 non-null   object
 2   Exterior2nd   1460 non-null   object
 3   SalePrice     1460 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 45.8+ KB


In [6]:
# check cardinality for each variable

for col in cols:
    print('Number of labels for {} : {}'.format(col, data[col].nunique()))

Number of labels for Neighborhood : 25
Number of labels for Exterior1st : 15
Number of labels for Exterior2nd : 16
Number of labels for SalePrice : 663


In [9]:
# let's explore the unique categories
data['Neighborhood'].unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [10]:
# let's explore the unique categories
data['Exterior1st'].unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

In [11]:
# let's explore the unique categories
data['Exterior2nd'].unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)

In [7]:
# split the data

X_train, X_test, y_train, y_test = train_test_split(data.drop('SalePrice', axis = 1),
                                                   data['SalePrice'],
                                                   test_size=0.3,
                                                   random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [12]:
# lets see how many variables we got when using pandas pd.get_dummies

pd.get_dummies(X_train,drop_first=True).shape

(1022, 53)

From the initial 3 categorical variables, we end up with 53 variables. 

These numbers are still not huge, and in practice we could work with them relatively easily. However, in real-life datasets, categorical variables can be highly cardinal, and with OHE we can end up with datasets with thousands of columns.


## OHE with pandas and NumPy


### Advantages

- quick
- returns pandas dataframe
- returns feature names for the dummy variables

### Limitations:

- it does not preserve information from train data to propagate to test data

In [17]:
# first let finds the top 10 categories 

X_train['Neighborhood'].value_counts().sort_values(ascending=False).head(10)

NAmes      151
CollgCr    105
OldTown     73
Edwards     71
Sawyer      61
Somerst     56
Gilbert     55
NWAmes      51
NridgHt     51
SawyerW     45
Name: Neighborhood, dtype: int64

In [20]:
# storing these categories in a list
top_cat = [name for name in X_train['Neighborhood'].value_counts().sort_values(ascending=False).head(10).index]
top_cat

['NAmes',
 'CollgCr',
 'OldTown',
 'Edwards',
 'Sawyer',
 'Somerst',
 'Gilbert',
 'NWAmes',
 'NridgHt',
 'SawyerW']

In [22]:
# now we need to create dummy variables for each of these labels

for label in top_cat:
    X_train['Neighborhood_'+label] = np.where(X_train['Neighborhood'] == label, 1, 0)
    
    X_test['Neighborhood_'+label] = np.where(X_test['Neighborhood'] == label, 1, 0)

In [23]:
X_test[['Neighborhood'] + ['Neighborhood_' + col for col in top_cat]].head()

Unnamed: 0,Neighborhood,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NWAmes,Neighborhood_NridgHt,Neighborhood_SawyerW
529,Crawfor,0,0,0,0,0,0,0,0,0,0
491,NAmes,1,0,0,0,0,0,0,0,0,0
459,BrkSide,0,0,0,0,0,0,0,0,0,0
279,ClearCr,0,0,0,0,0,0,0,0,0,0
655,BrDale,0,0,0,0,0,0,0,0,0,0


In [28]:
# now lets write a function to create this for the rest of the variables

def top_10_labels(X_train, col, how_many):
    
    return [name for name in X_train[col].value_counts().sort_values(ascending=False).head(how_many).index]


In [26]:
# function to generate label columns

def one_hot_encode(X_train, X_test,col, top_cat):
    for label in top_cat:
        X_train[col + '_'+label] = np.where(X_train[col] == label, 1, 0)

        X_test[col + '_'+label] = np.where(X_test[col] == label, 1, 0)

In [29]:
# and now we run a loop over the remaining categorical variables

for variable in ['Exterior1st', 'Exterior2nd']:
    
    top_categories = top_10_labels(X_train, variable, how_many=10)
    
    one_hot_encode(X_train, X_test, variable, top_categories)

In [30]:
# check the data
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NWAmes,Neighborhood_NridgHt,Neighborhood_SawyerW,Exterior1st_VinylSd,Exterior1st_HdBoard,Exterior1st_Wd Sdng,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_CemntBd,Exterior1st_BrkFace,Exterior1st_WdShing,Exterior1st_Stucco,Exterior1st_AsbShng,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_HdBoard,Exterior2nd_MetalSd,Exterior2nd_Plywood,Exterior2nd_CmentBd,Exterior2nd_Wd Shng,Exterior2nd_BrkFace,Exterior2nd_AsbShng,Exterior2nd_Stucco
64,CollgCr,VinylSd,VinylSd,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
682,ClearCr,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
960,BrkSide,Wd Sdng,Plywood,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1384,Edwards,WdShing,Wd Shng,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1100,SWISU,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [31]:
X_train.shape

(1022, 33)

Note how we now have 30 additional dummy variables instead of the 53 that we would have had if we had created dummies for all categories.

## One hot encoding of top categories with Feature-Engine

### Advantages

- quick
- creates the same number of features in train and test set

### Limitations

- None to my knowledge

In [32]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],  # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [33]:
ohe_enc = OneHotEncoder(
    top_categories=10,  # you can change this value to select more or less variables
    # we can select which variables to encode
    variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'],
    drop_last=False)

ohe_enc.fit(X_train)

OneHotEncoder(top_categories=10,
              variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])

In [34]:
ohe_enc.variables

['Neighborhood', 'Exterior1st', 'Exterior2nd']

In [36]:
# we can see the varibles names along with the top 10 lables for each of them
ohe_enc.encoder_dict_

{'Neighborhood': ['NAmes',
  'CollgCr',
  'OldTown',
  'Edwards',
  'Sawyer',
  'Somerst',
  'Gilbert',
  'NWAmes',
  'NridgHt',
  'SawyerW'],
 'Exterior1st': ['VinylSd',
  'HdBoard',
  'Wd Sdng',
  'MetalSd',
  'Plywood',
  'CemntBd',
  'BrkFace',
  'WdShing',
  'Stucco',
  'AsbShng'],
 'Exterior2nd': ['VinylSd',
  'Wd Sdng',
  'HdBoard',
  'MetalSd',
  'Plywood',
  'CmentBd',
  'Wd Shng',
  'BrkFace',
  'AsbShng',
  'Stucco']}

In [37]:
X_train = ohe_enc.transform(X_train)
X_test = ohe_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NWAmes,Neighborhood_NridgHt,Neighborhood_SawyerW,Exterior1st_VinylSd,Exterior1st_HdBoard,Exterior1st_Wd Sdng,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_CemntBd,Exterior1st_BrkFace,Exterior1st_WdShing,Exterior1st_Stucco,Exterior1st_AsbShng,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_HdBoard,Exterior2nd_MetalSd,Exterior2nd_Plywood,Exterior2nd_CmentBd,Exterior2nd_Wd Shng,Exterior2nd_BrkFace,Exterior2nd_AsbShng,Exterior2nd_Stucco
64,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
682,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
960,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1384,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1100,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [38]:
X_train.shape

(1022, 30)

**Note**

If the argument variables is left to None, then the encoder will automatically identify **all categorical variables**. 

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.