In [1]:
import pandas as pd

## Generate the Dummy Dataset

Let's create a sample dataset with a categorical feature `Color` and a numerical feature `Value`.

In [2]:
data = {
    'Color': ['Red', 'Green', 'Blue', 'Purple', 'Green'],
    'Value': [10, 20, 15, 25, 30]
}

df = pd.DataFrame(data)

## One-Hot Encoding

We will use `get_dummies` to one-hot encode the `Color` column.

In [3]:
df_encoded = pd.get_dummies(df, columns=['Color'])

In [4]:
df_encoded

Unnamed: 0,Value,Color_Blue,Color_Green,Color_Purple,Color_Red
0,10,0,0,0,1
1,20,0,1,0,0
2,15,1,0,0,0
3,25,0,0,1,0
4,30,0,1,0,0


## Create the Holdout Dataset

Now, let's create a holdout dataset that might not have all the same categories in the `Color` feature.

In [5]:
holdout_data = {
    'Color': ['Red', 'Blue', 'Yellow', 'Green'],
    'Value': [18, 22, 28, 35]
}

holdout_df = pd.DataFrame(holdout_data)

In [6]:
holdout_encoded = pd.get_dummies(holdout_df, columns=['Color'])

The one-hot-encoded holdout dataset looks like this:

In [7]:
holdout_encoded

Unnamed: 0,Value,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,18,0,0,1,0
1,22,1,0,0,0
2,28,0,0,0,1
3,35,0,1,0,0


Notice that the original training dataset contains Color_Purple

In [8]:
df_encoded.dtypes

Value           int64
Color_Blue      uint8
Color_Green     uint8
Color_Purple    uint8
Color_Red       uint8
dtype: object

Notice that this holdout DataFrame does not contain Color_Purple, but it does contain Color_Yellow

In [9]:
holdout_encoded.dtypes

Value           int64
Color_Blue      uint8
Color_Green     uint8
Color_Red       uint8
Color_Yellow    uint8
dtype: object

## Align the Holdout Dataset

Now, we'll use the align method to adjust the holdout dataset based on the one-hot encoded categories of the training dataset. Any missing categories will be added with zeros.

Since we use join `left` we will only have features contained in the original dataset (here we add `Color_Purple` that was missing from the holdout dataset) and remove all others (in this case the `Color_Yellow`).

In [10]:
df_encoded, holdout_encoded = df_encoded.align(holdout_encoded, fill_value=0, axis=1, join="left")

In [11]:
df_encoded

Unnamed: 0,Value,Color_Blue,Color_Green,Color_Purple,Color_Red
0,10,0,0,0,1
1,20,0,1,0,0
2,15,1,0,0,0
3,25,0,0,1,0
4,30,0,1,0,0


In [12]:
holdout_encoded

Unnamed: 0,Value,Color_Blue,Color_Green,Color_Purple,Color_Red
0,18,0,0,0,1
1,22,1,0,0,0
2,28,0,0,0,0
3,35,0,1,0,0


## Complete

Now that both DataFrames have the same columns we can make predictions on the holdout dataset