When we have a dataset like as shown, we can build a simple model
that’s trained on all features except “f3”. Thus, you will be creating a model that
predicts “f3” when it’s not known or not available in training. I can’t say if this kind
of model is going to give you an excellent performance but might be able to handle
those missing values in test set or live data and one can’t say without trying just like
everything else when it comes to machine learning.

If you have a fixed test set, you can add your test data to training to know about the
categories in a given feature. This is very similar to semi-supervised learning in
which you use data which is not available for training to improve your model. This
will also take care of rare values that appear very less number of times in training
data but are in abundance in test data. Your model will be more robust.

Many people think that this idea overfits. It may or may not overfit. There is a
simple fix for that. **If you design your cross-validation in such a way that it
replicates the prediction process when you run your model on test data, then it’s
never going to overfit**. It means that the first step should be the separation of folds,
and in each fold, you should apply the same pre-processing that you want to apply
to test data. Suppose you want to concatenate training and test data, then in each
fold you must concatenate training and validation data and also make sure that your
validation dataset replicates the test set. In this specific case, you must design your
validation sets in such a way that it has categories which are “unseen” in the training
set.

How this works is can be understood easily by looking at figure 4 and the following
code.

In [2]:
import pandas as pd
from sklearn import preprocessing
# read training data
train = pd.read_csv("../cat_train.csv")
#read test data
test = pd.read_csv("../cat_test.csv")
# create a fake target column for test data
# since this column doesn't exist
test.loc[:, "target"] = -1
# concatenate both training and test data
data = pd.concat([train, test]).reset_index(drop=True)
# make a list of features we are interested in
# id and target is something we should not encode
features = [x for x in train.columns if x not in ["id", "target"]]
# loop over the features list
for feat in features:
# create a new instance of LabelEncoder for each feature
    lbl_enc = preprocessing.LabelEncoder()
# note the trick here
# since its categorical data, we fillna with a string
# and we convert all the data to string type
# so, no matter its int or float, its converted to string
# int/float but categorical!!!
temp_col = data[feat].fillna("NONE").astype(str).values
# we can use fit_transform here as we do not
# have any extra test data that we need to
# transform on separately
data.loc[:, feat] = lbl_enc.fit_transform(temp_col)

In [3]:
# split the training and test data again
train = data[data.target != -1].reset_index(drop=True)
test = data[data.target == -1].reset_index(drop=True)

This trick works when you have a problem where you already have the test dataset.
It must be noted that this trick will not work in a live setting. For example, let’s say
you are in a company that builds a real-time bidding solution (RTB). RTB systems
bid on every user they see online to buy ad space. The features that can be used for
such a model may include pages viewed in a website. Let’s assume that features are
the last five categories/pages visited by the user. In this case, if the website
introduces new categories, we will no longer be able to predict accurately. Our
model, in this case, will fail.

A situation like this can be avoided by using an
**“unknown”** category.

In our cat-in-the-dat dataset, we already have unknowns in ord_2 column.

In [5]:
df = pd.read_csv('../cat_train.csv')
df.ord_2.fillna("NONE").value_counts()

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
NONE            18075
Name: ord_2, dtype: int64

We can treat “NONE” as unknown. So, if during live testing, we get new categories
that we have not seen before, we will mark them as “NONE”.

So, you can either assume that your test data will have the same categories as
training or you can introduce a rare or unknown category to training to take care of
new categories in test data.

Let’s see the value counts in ord_4 column after filling NaN values:

In [6]:
df.ord_4.fillna("NONE").value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
G        3404
V        3107
J        1950
L        1657
Name: ord_4, dtype: int64

We see that some values appear only a couple thousand times, and some appear
almost 40000 times. NaNs are also seen a lot.

We can now define our criteria for calling a value “rare”. Let’s say the requirement
for a value being rare in this column is a count of less than 2000. So, it seems, J and
L can be marked as rare values. With pandas, it is quite easy to replace categories
based on count threshold. Let’s take a look at how it’s done.

In [10]:
df.ord_4 = df.ord_4.fillna("NONE")

In [11]:
df.loc[df["ord_4"].value_counts()[df["ord_4"]].values < 2000, "ord_4"] = "RARE"

In [12]:
df.ord_4.value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
RARE     3607
G        3404
V        3107
Name: ord_4, dtype: int64

We say that wherever the value count for a certain category is less than 2000,
replace it with rare. So, now, when it comes to test data, all the new, unseen
categories will be mapped to “RARE”, and all missing values will be mapped to
“NONE”.

This approach will also ensure that the model works in a live setting, even if you
have new categories.