## One Hot Encoding - variables with many categories

In [1]:
import os

In [2]:
import pandas as pd
import numpy as np
import  matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

In [3]:
#Let's load the mercedes benz dataset for demonstration, only the categorical variables
os.chdir('..')
path = os.path.join(os.getcwd(),"Datasets\mercedes.csv")
data  = pd.read_csv(path, usecols=['X1','X2','X3','X4','X5','X6'])
data

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d
...,...,...,...,...,...,...
4204,s,as,c,d,aa,d
4205,o,t,d,d,aa,h
4206,v,r,a,d,aa,g
4207,r,e,f,d,aa,l


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X1      4209 non-null   object
 1   X2      4209 non-null   object
 2   X3      4209 non-null   object
 3   X4      4209 non-null   object
 4   X5      4209 non-null   object
 5   X6      4209 non-null   object
dtypes: object(6)
memory usage: 197.4+ KB


In [5]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [6]:
data.shape

(4209, 6)

In [7]:
#Let's have a look at how many labels each variable has
for col in data.columns:
    print(col, ': ', len(data[col].unique()), 'labels')

X1 :  27 labels
X2 :  44 labels
X3 :  7 labels
X4 :  4 labels
X5 :  29 labels
X6 :  12 labels


In [8]:
#Let's examine how many columns we will obtain after one hot encoding these variables
pd.get_dummies(data, drop_first=True).shape

(4209, 117)

We can see that from just 6 initial categorical variables, we end up with 117 new variables.<br/>
What can we do instead?

Sometimes, what may happen is in one of our dataset, if we have one or more feaure that has 500 or more categories then performing one hot encoding means that we'll be creating 499 or more columns. And always remember, as the number of feature or columns increases it will always lead to curse of dimensionality and may affect our accuracy.

The solution is to limit one hot encoding to the 10 most frequent labels of the variable. This means that we will be making one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present(1) or not(0) for a particular observation.

In [9]:
#let's find the top 10 most frequent categories for the variable X2
data.X2.value_counts().sort_values(ascending = False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
ag      19
z       19
Name: X2, dtype: int64

In [10]:
#Let's make a list with the most frequent categories of the variable
top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [11]:
#and now we make the 10 binary variabels
for label in top_10:
    data[label] = np.where(data['X2']==label,1,0)
data[['X2']+top_10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [12]:
#get whole set of dummy variables, for all the categorical variables
def one_hot_top_x(df, variable, top_x_labels):
    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable]==label,1,0)
    return None
        
#read the data again
data = pd.read_csv(path, usecols=['X1','X2','X3','X4','X5','X6'])

#encode X2 into the 10 most frequent categories
one_hot_top_x(data, 'X2', top_10)

In [13]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [14]:
#find the 10 most frequent categories for X1
top_10 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]

#now create the 10 most frequent dummy variables for X1
one_hot_top_x(data,'X1', top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## One Hot Encoding Variables

### Advantages
<ul>
    <li>Straightforward to implement</li>
    <li>Does not require hrs of variable exploration</li>
    <li>Does not expand massively the feature space (number of columns in the dataset</li>
</ul>

### Disadvantages
<ul>
    <li>Does not add any information that may make the variable more predictive</li>
    <li>Does not keep the information of the ignored labels</li>
</ul>


Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.

## Ordinal Number Encoding

<b>Ordinal categorical variables</b><br/>
Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not know.

For example:
1. Student's grade in an exam (A,B,C or Fail).
2. Educational level, with the categories: Elementary school, High School, College graduate, PhD ranked from 1 to 4.

When the categorical variables are ordinal, the most straightforward best approach is to replace the labels by some ordinal number based on the ranks.

In [15]:
import datetime

In [17]:
today_date = datetime.datetime.today()
df_date_list = [today_date - datetime.timedelta(x) for x in range(0,15)]
df = pd.DataFrame(df_date_list, columns=['day'])
df

Unnamed: 0,day
0,2022-01-14 09:56:22.761046
1,2022-01-13 09:56:22.761046
2,2022-01-12 09:56:22.761046
3,2022-01-11 09:56:22.761046
4,2022-01-10 09:56:22.761046
5,2022-01-09 09:56:22.761046
6,2022-01-08 09:56:22.761046
7,2022-01-07 09:56:22.761046
8,2022-01-06 09:56:22.761046
9,2022-01-05 09:56:22.761046


In [18]:
df['day_of_week']=df['day'].dt.day_of_week

In [19]:
df

Unnamed: 0,day,day_of_week
0,2022-01-14 09:56:22.761046,4
1,2022-01-13 09:56:22.761046,3
2,2022-01-12 09:56:22.761046,2
3,2022-01-11 09:56:22.761046,1
4,2022-01-10 09:56:22.761046,0
5,2022-01-09 09:56:22.761046,6
6,2022-01-08 09:56:22.761046,5
7,2022-01-07 09:56:22.761046,4
8,2022-01-06 09:56:22.761046,3
9,2022-01-05 09:56:22.761046,2


In [20]:
weekday_map = {
    1:'Monday',
    2: 'Tuesday',
    3: 'Wednesday',
    4: 'Thursday',
    5: 'Friday',
    6: 'Saturday',
    0:'Sunday'
}

In [21]:
df['weekday'] = df.day_of_week.map(weekday_map)

In [22]:
df

Unnamed: 0,day,day_of_week,weekday
0,2022-01-14 09:56:22.761046,4,Thursday
1,2022-01-13 09:56:22.761046,3,Wednesday
2,2022-01-12 09:56:22.761046,2,Tuesday
3,2022-01-11 09:56:22.761046,1,Monday
4,2022-01-10 09:56:22.761046,0,Sunday
5,2022-01-09 09:56:22.761046,6,Saturday
6,2022-01-08 09:56:22.761046,5,Friday
7,2022-01-07 09:56:22.761046,4,Thursday
8,2022-01-06 09:56:22.761046,3,Wednesday
9,2022-01-05 09:56:22.761046,2,Tuesday


## Count or Frequency Encoding

In [23]:
path = os.path.join(os.getcwd(),"Datasets\mercedes.csv")
df = pd.read_csv(path, usecols=['X1','X2'])

In [24]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [25]:
len(df['X1'].unique())

27

In [26]:
len(df['X2'].unique())

44

In [27]:
for feature in df.columns: 
    print(f'{feature} has {len(df[feature].unique())} labels')

X1 has 27 labels
X2 has 44 labels


In [28]:
df.shape

(4209, 2)

In [29]:
# One Hot Encoding
pd.get_dummies(df).shape

(4209, 71)

In [30]:
# we make a dictionary that maps each label to the coutns
df_frequency_map = df.X2.value_counts().to_dict()

In [31]:
df_frequency_map

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'ag': 19,
 'z': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'at': 6,
 'h': 6,
 'al': 5,
 'an': 5,
 'q': 5,
 'av': 4,
 'ah': 4,
 'p': 4,
 'au': 3,
 'am': 1,
 'j': 1,
 'af': 1,
 'l': 1,
 'aa': 1,
 'c': 1,
 'o': 1,
 'ar': 1}

In [32]:
df.X2 = df.X2.map(df_frequency_map)
df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


There are some advantages and disadvantages that we will discuss now

### Advantages
<ol>
    <li>It is very simple to implement</li>
    <li>Does not increase the feature dimensional space</li>
</ol>

### Disadvantages
<ul>
    <li>If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.</li>
    <li>Addes somewhat arbitrary numbers, and therefore weights to the different labels, that may nto be realted to their predictive power.</li>
</ul>

### Another Example

In [33]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None,index_col=None) 
train_set

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [34]:
train_set[1].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [35]:
len(train_set[1].unique())

9

In [36]:
columns = [1,3,5,6,7,8,9,13]
train_set = train_set[columns]
train_set.columns=['Employement','Degree','Marital Status','Designation','family_job','Race','Sex','Country']

In [37]:
train_set.head()

Unnamed: 0,Employement,Degree,Marital Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba


In [38]:
for feature in train_set.columns:
    print(f'{feature} : {len(train_set[feature].unique())} labels')

Employement : 9 labels
Degree : 16 labels
Marital Status : 7 labels
Designation : 15 labels
family_job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


In [39]:
country_map = train_set['Country'].value_counts().to_dict()
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Cambodia': 19,
 ' Trinadad&Tobago': 19,
 ' Laos': 18,
 ' Thailand': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Honduras': 13,
 ' Hungary': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [40]:
train_set['Country'] = train_set['Country'].map(country_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set['Country'] = train_set['Country'].map(country_map)


In [41]:
train_set

Unnamed: 0,Employement,Degree,Marital Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
...,...,...,...,...,...,...,...,...
32556,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,29170
32557,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,29170
32558,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,29170
32559,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,29170


### Target Guided Ordinal Encoding

1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0

In [42]:
path = os.path.join(os.getcwd(),"Datasets\\titanic.csv")
df = pd.read_csv(path, usecols=['Survived','Cabin'])
df


Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,
...,...,...
886,0,
887,1,B42
888,0,
889,1,C148


In [43]:
df.isnull().sum()

Survived      0
Cabin       687
dtype: int64

In [44]:
df['Cabin'].fillna('Missing',inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [45]:
df['Cabin'] = df['Cabin'].astype(str).str[0]

In [46]:
df

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M
...,...,...
886,0,M
887,1,B
888,0,M
889,1,C


In [47]:
df.Cabin.value_counts()

M    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

In [48]:
df.groupby('Cabin')['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [49]:
df.groupby('Cabin')['Survived'].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [50]:
ordinal_labels = df.groupby('Cabin')['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [51]:
list(enumerate(ordinal_labels))

[(0, 'T'),
 (1, 'M'),
 (2, 'A'),
 (3, 'G'),
 (4, 'C'),
 (5, 'F'),
 (6, 'B'),
 (7, 'E'),
 (8, 'D')]

In [52]:
ordinal_labels2 = {k:i for i,k in enumerate(ordinal_labels)}
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [53]:
df['Cabin_ordinal_labels'] = df['Cabin'].map(ordinal_labels2)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### Mean Encoding

In [54]:
mean_ordinal = df.groupby('Cabin')['Survived'].mean().to_dict()
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [55]:
df['Cabin_mean_ordinal'] = df['Cabin'].map(mean_ordinal)

In [56]:
df

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,Cabin_mean_ordinal
0,0,M,1,0.299854
1,1,C,4,0.593220
2,1,M,1,0.299854
3,1,C,4,0.593220
4,0,M,1,0.299854
...,...,...,...,...
886,0,M,1,0.299854
887,1,B,6,0.744681
888,0,M,1,0.299854
889,1,C,4,0.593220


## Probability Ratio Encoding

In [57]:
path = os.path.join(os.getcwd(),"Datasets\\titanic.csv")
df = pd.read_csv(path, usecols=['Survived','Cabin'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [58]:
df['Cabin'].fillna('Missing',inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [59]:
df['Cabin'] = df['Cabin'].str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [60]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [61]:
prob_df = df.groupby('Cabin')['Survived'].mean()
prob_df = pd.DataFrame(prob_df)

In [62]:
prob_df

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [63]:
prob_df['Died'] = 1 - prob_df['Survived']

In [64]:
prob_df

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25
F,0.615385,0.384615
G,0.5,0.5
M,0.299854,0.700146
T,0.0,1.0


In [65]:
prob_df['Probability Ratio'] = prob_df['Survived']/ prob_df['Died']
prob_df

Unnamed: 0_level_0,Survived,Died,Probability Ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0
F,0.615385,0.384615,1.6
G,0.5,0.5,1.0
M,0.299854,0.700146,0.428274
T,0.0,1.0,0.0


In [66]:
probabilty_encoded = prob_df['Probability Ratio'].to_dict()
probabilty_encoded

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'M': 0.42827442827442824,
 'T': 0.0}

In [67]:
df['Cabin_encoded'] = df['Cabin'].map(probabilty_encoded)

In [68]:
df

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274
...,...,...,...
886,0,M,0.428274
887,1,B,2.916667
888,0,M,0.428274
889,1,C,1.458333


# Handling Categorical Variables with sklearn

In [69]:
from sklearn import preprocessing

### LabelEncoder
- This encodes the target labels with the value between 0 to n_classes - 1. This transformer shoule be used to encode the target/dependent variable and not the feature/independent variable

In [70]:
le = preprocessing.LabelEncoder()
le.fit([1,2,2,6])

LabelEncoder()

In [71]:
le.classes_

array([1, 2, 6])

In [72]:
le.transform([1,1,2,6])

array([0, 0, 1, 2], dtype=int64)

In [73]:
le.inverse_transform([0,0,1,2])

array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

In [74]:
le = preprocessing.LabelEncoder()
le.fit(['Barcelona','Real Madrid','Real Madrid','Real Madrid','Liverpool','Bayern Munich','Chelsea'])

LabelEncoder()

In [75]:
le.classes_

array(['Barcelona', 'Bayern Munich', 'Chelsea', 'Liverpool',
       'Real Madrid'], dtype='<U13')

In [76]:
le.transform(['Barcelona','Liverpool','Chelsea','Real Madrid'])

array([0, 3, 2, 4])

In [77]:
le.inverse_transform([0,3,2,4])

array(['Barcelona', 'Liverpool', 'Chelsea', 'Real Madrid'], dtype='<U13')

### OneHotEncoder
- This creates a binary column for each category and returns a sparse matrix or dense array
- By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

In [78]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None,index_col=None) 

In [79]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       32561 non-null  int64 
 1   1       32561 non-null  object
 2   2       32561 non-null  int64 
 3   3       32561 non-null  object
 4   4       32561 non-null  int64 
 5   5       32561 non-null  object
 6   6       32561 non-null  object
 7   7       32561 non-null  object
 8   8       32561 non-null  object
 9   9       32561 non-null  object
 10  10      32561 non-null  int64 
 11  11      32561 non-null  int64 
 12  12      32561 non-null  int64 
 13  13      32561 non-null  object
 14  14      32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [80]:
train_set = train_set[[1,3,5,6,7,8,9,13,14]]
train_set.head()

Unnamed: 0,1,3,5,6,7,8,9,13,14
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [81]:
train_set.nunique()

1      9
3     16
5      7
6     15
7      6
8      5
9      2
13    42
14     2
dtype: int64

In [82]:
train_set[8].value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: 8, dtype: int64

In [83]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore',drop='first')

In [84]:
enc.fit(train_set[[8]])

OneHotEncoder(drop='first', handle_unknown='ignore')

In [85]:
enc.categories_

[array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
        ' White'], dtype=object)]

In [86]:
enc.categories_[0][1:]

array([' Asian-Pac-Islander', ' Black', ' Other', ' White'], dtype=object)

In [87]:
pd.DataFrame(enc.transform(train_set[[8]]).toarray(),columns=enc.categories_[0][1:])

Unnamed: 0,Asian-Pac-Islander,Black,Other,White
0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0
...,...,...,...,...
32556,0.0,0.0,0.0,1.0
32557,0.0,0.0,0.0,1.0
32558,0.0,0.0,0.0,1.0
32559,0.0,0.0,0.0,1.0


## OrdinalEncoder
- Encode categorical features as an integer array.

In [88]:
enc = preprocessing.OrdinalEncoder()
X = [['Male',1],['Female',3],['Female',2]]
enc.fit(X)

OrdinalEncoder()

In [89]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

In [90]:
enc.transform([['Female',3],['Male',1]])

array([[0., 2.],
       [1., 0.]])

In [91]:
enc.inverse_transform([[0., 2.],
       [1., 0.]])

array([['Female', 3],
       ['Male', 1]], dtype=object)

In [92]:
train_set.columns =['Employment','Degree','Marital Status','Job','Family Status','Race','Sex','Country','Pay']

In [93]:
train_set.head()

Unnamed: 0,Employment,Degree,Marital Status,Job,Family Status,Race,Sex,Country,Pay
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [94]:
enc = preprocessing.OrdinalEncoder()
enc.fit(train_set[['Race']])

OrdinalEncoder()

In [95]:
enc.categories_

[array([' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other',
        ' White'], dtype=object)]

In [96]:
train_set['Race_encoded'] =enc.transform(train_set[['Race']])

In [97]:
train_set['Race_encoded'].value_counts()

4.0    27816
2.0     3124
1.0     1039
0.0      311
3.0      271
Name: Race_encoded, dtype: int64

### Handling Categorical Encoders using category_encoders

In [98]:
import category_encoders as ce

### Binary Encoders
- Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

In [99]:
train_set['Race'].value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: Race, dtype: int64

In [100]:
encoder = ce.BinaryEncoder()

In [101]:
encoder.fit(train_set['Race'])

BinaryEncoder()

In [102]:
encoder.get_feature_names()

['Race_0', 'Race_1', 'Race_2']

In [103]:
encoder.transform(train_set['Race'])

Unnamed: 0,Race_0,Race_1,Race_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,1,0
4,0,1,0
...,...,...,...
32556,0,0,1
32557,0,0,1
32558,0,0,1
32559,0,0,1


In [104]:
train_set.head()

Unnamed: 0,Employment,Degree,Marital Status,Job,Family Status,Race,Sex,Country,Pay,Race_encoded
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K,4.0
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K,4.0
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K,4.0
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K,2.0
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K,2.0


In [105]:
for col in train_set.columns:
    print(f'{col}: {len(train_set[col].unique())} labels')

Employment: 9 labels
Degree: 16 labels
Marital Status: 7 labels
Job: 15 labels
Family Status: 6 labels
Race: 5 labels
Sex: 2 labels
Country: 42 labels
Pay: 2 labels
Race_encoded: 5 labels


In [106]:
encoder =  ce.BinaryEncoder()

In [107]:
encoder.fit(train_set['Country'])

BinaryEncoder()

In [108]:
encoder.get_feature_names()

['Country_0', 'Country_1', 'Country_2', 'Country_3', 'Country_4', 'Country_5']

In [109]:
encoder.transform(train_set['Country'])

Unnamed: 0,Country_0,Country_1,Country_2,Country_3,Country_4,Country_5
0,0,0,0,0,0,1
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,0,0,1,0
...,...,...,...,...,...,...
32556,0,0,0,0,0,1
32557,0,0,0,0,0,1
32558,0,0,0,0,0,1
32559,0,0,0,0,0,1


Here, the country column has 42 unique labels that would have required 42 columns using one hot encoder but using binaryencoder we only have to create 6 columns since 42 can be represented using 6 binary numbers.

### CountEncoder
- replaces the categorical values with their counts

In [110]:
train_set['Race'].value_counts()

 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: Race, dtype: int64

In [111]:
encoder = ce.CountEncoder()

In [112]:
encoder.fit(train_set['Race'])

CountEncoder(cols=['Race'], combine_min_nan_groups=True)

In [113]:
train_set['Race_countEncoder'] = encoder.transform(train_set['Race'])

In [114]:
train_set[['Race','Race_countEncoder']]

Unnamed: 0,Race,Race_countEncoder
0,White,27816
1,White,27816
2,White,27816
3,Black,3124
4,Black,3124
...,...,...
32556,White,27816
32557,White,27816
32558,White,27816
32559,White,27816


### HashingEncoder

In [115]:
encoder = ce.HashingEncoder()

In [116]:
encoder.fit(train_set['Sex'])

HashingEncoder(cols=['Sex'], max_process=6, max_sample=5426)

In [117]:
encoder.transform(train_set['Sex'])

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
32556,1,0,0,0,0,0,0,0
32557,0,1,0,0,0,0,0,0
32558,1,0,0,0,0,0,0,0
32559,0,1,0,0,0,0,0,0
