We observed that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding we will expand the feature space dramatically.

In [None]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('Mercedes.csv',usecols=['X1','X2','X3','X4','X5'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5
0,v,at,a,d,u
1,t,av,e,d,y
2,w,n,c,d,x
3,t,n,f,d,x
4,v,n,f,d,h


In [6]:
# let's have a look at how many labels each variable has
for col in data.columns:
    print(col, ':', len(data[col].unique()), 'Labels')

X1 : 27 Labels
X2 : 44 Labels
X3 : 7 Labels
X4 : 4 Labels
X5 : 29 Labels


In [5]:
#Lets apply One Hot Encoding this
pd.get_dummies(data,drop_first=True).head()

Unnamed: 0,X1_aa,X1_ab,X1_b,X1_c,X1_d,X1_e,X1_f,X1_g,X1_h,X1_i,...,X5_o,X5_p,X5_q,X5_r,X5_s,X5_u,X5_v,X5_w,X5_x,X5_y
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
pd.get_dummies(data,drop_first=True).shape

(4209, 106)

We can see that from just 6 initial categorical variables, we end up with 117 new variables.

These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.

What can we do instead?

In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble Selection" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.

How can we do that in python?

In [8]:
# let's find the top 10 most frequent categories for the variable X2
data['X2'].value_counts().sort_values(ascending=False).head(10)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
Name: X2, dtype: int64

In [11]:
# let's make a list with the most frequent categories of the variable
top_10 = [x for x in data['X2'].value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [15]:
# and now we make the 10 binary variables
for label in top_10:
        data[label] = np.where(data['X2']==label,1,0)
data[['X2']+top_10]

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
4204,as,1,0,0,0,0,0,0,0,0,0
4205,t,0,0,0,0,0,0,0,0,0,0
4206,r,0,0,0,0,0,1,0,0,0,0
4207,e,0,0,0,0,0,0,0,0,0,1


In [23]:
#Lets write some common function which do the above operation
def one_hot_topx(df,var,top_x_labels):
    for label in top_x_labels:
        df[var+'_'+label] = np.where(df[var]==label,1,0)

In [25]:
#Lets read the data again and do for' X2' column

data = pd.read_csv('Mercedes.csv',usecols = ['X1','X2','X3','X4','X5'])
one_hot_topx(data,'X2',top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,0,0,0,0,0,0,1,0,0,0


In [29]:
# find the 10 most frequent categories for X3
top_10 = [x for x in data['X3'].value_counts().sort_values(ascending=False).head(10).index]
one_hot_topx(data,'X3',top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X2_as,X2_ae,X2_ai,X2_m,X2_ak,...,X2_s,X2_f,X2_e,X3_c,X3_f,X3_a,X3_d,X3_g,X3_e,X3_b
0,v,at,a,d,u,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,t,av,e,d,y,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,w,n,c,d,x,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,t,n,f,d,x,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,v,n,f,d,h,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [32]:
# find the 10 most frequent categories for X1
top_10 = [x for x in data['X1'].value_counts().sort_values(ascending=False).head(10).index]
one_hot_topx(data,'X1',top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X2_as,X2_ae,X2_ai,X2_m,X2_ak,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
