## One Hot Encoding - Variables with many categories

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("File/mercendesbenz.csv", usecols=['X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [3]:
# Check the shape
df.shape

(4209, 6)

In [4]:
# Let's have a look how many label each variable has

for col in df.columns:
    print(col,': ',len(df[col].unique()),' labels')

X1 :  27  labels
X2 :  44  labels
X3 :  7  labels
X4 :  4  labels
X5 :  29  labels
X6 :  12  labels


In [5]:
# Let's examine how many columns we will obtain after one hot encoding these variables
pd.get_dummies(data= df, drop_first= True).shape

(4209, 117)

##### we can see that from just 6 initial categories variables, we end up with 117 new variable.

###### what can we do insted?

The soluation is we peek the 10 most frequent labels of the variable or much more as your wish based on datasets. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present(1) or not (0) for a particular observation.

How can we do that in Python?

In [6]:
# Let's find the top 10 most frequent categories for X2 variable

# df.X2.value_counts().sort_values(ascending=False).head() # we can do it this way also

df.X2.value_counts().head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
i       25
k       25
b       21
ao      20
z       19
ag      19
Name: X2, dtype: int64

In [7]:
# Let's make a list with the most frequent categories of the variable

top_10 = [x for x in df.X2.value_counts().head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [8]:
# And now we make the 10 binary variables

for label in top_10:
    df[label] = np.where(df['X2']==label, 1, 0)
    
df[['X2']+top_10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [9]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,ak,r,n,s,f,e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [10]:
# get whole set of dummy variables, for all the categorical variable

def one_hot_top_encoder(data, variable, top_x_labels):
    """ function to create the dummy variables for the most frequent labels
        we can vary the number of most frequent labels that we encode"""
    for label in top_x_labels:
        data[variable+'_'+label] = np.where(data[variable]==label, 1, 0)

In [11]:
# Read the data again (optional)
df2 = pd.read_csv("File/mercendesbenz.csv", usecols=['X1','X2','X3','X4','X5','X6'])
df2.head(3)

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j


In [12]:
# Now apply the functions
one_hot_top_encoder(df2, 'X2', top_10)
df2.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [13]:
# Find the 10 most frequent categories for X1
top_10 = [x for x in df.X1.value_counts().head(10).index]

# Now apply the function on the X1 columns to create the 10 most frequent dummy variables
one_hot_top_encoder(df2, 'X1', top_10)
df2.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


#### One Hot encoding of top variables

#### Advantages
  * Straightforward to implement.
  * Does not require hrs of variable exploration.
  * Does not expand massively the feature space(number of columns in the dataset)

#### Disadvantages
  * Does not add any information that may make the variable more predictive
  * Does not keep the information of the ignored labels

Because it is not unusual that categorial variables have a few dominating categories and the remaining labels and mostly noise, this is a quite simple and straightforward approch that may be useful on many occasions.

It is worth nothing that the top 10 variables is a totally arbitrary number. You could also choose the 5, or top 20


In [14]:
# Make a function that convert those column in dataset into it's dummy variable

def get_dummy(datasets, top_num_categories:int):
    all_col = [x for x in datasets.columns]
    for i in range(len(all_col)):
        top_labels = [x for x in datasets[all_col[i]].value_counts().head(top_num_categories).index]
        for label in top_labels:
            datasets[all_col[i]+'_'+label] = np.where(datasets[all_col[i]] == label, 1 ,0)

In [15]:
# Load the frash datasets
df3 = pd.read_csv("File/mercendesbenz.csv", usecols=['X1','X2','X3','X4','X5','X6'])
df3.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [16]:
get_dummy(df3, 10)

In [17]:
df3.head(20)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_aa,X1_s,X1_b,X1_l,...,X6_g,X6_j,X6_d,X6_i,X6_l,X6_a,X6_h,X6_k,X6_c,X6_b
0,v,at,a,d,u,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5,b,e,c,d,g,h,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
6,r,e,f,d,f,h,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,l,as,f,d,f,j,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
8,s,as,e,d,f,i,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
9,b,aq,c,d,f,a,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
