## Handle Categorical Features

### 1. One Hot Encoding

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv('titanic.csv',usecols=['Sex'])

In [None]:
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [None]:
pd.get_dummies(df).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [None]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


* **drop_first** is used to avoid the dummy variable trap.

In [None]:
df=pd.read_csv('titanic.csv',usecols=['Embarked'])

In [None]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [None]:
df.dropna(inplace=True)

In [None]:
pd.get_dummies(df).head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [None]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


#### One-Hot Encoding with many categories in a feature

In [None]:
df=pd.read_csv('mercedes.csv',usecols=["X0","X1","X2","X3","X4","X5","X6"])

In [None]:
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [None]:
df.columns[0]

'X0'

In [None]:
for i in df.columns:
    print(len(df[i].unique()))

47
27
44
7
4
29
12


**KDD Cup Orange Challenge - Ensemble - Kaggle Competition**

* Here we are taking the 10 most frequent categories.

* Only for those 10 features, one hot encoding was applied.

* Remaining features are labelled as 0. (Sometimes dropped)

In [None]:
df.X1.value_counts().sort_values(ascending=False).head(10)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

In [None]:
lst_10=df.X1.value_counts().sort_values(ascending=False).head(10).index
lst_10

Index(['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o'], dtype='object')

In [None]:
lst_10=list(lst_10)
lst_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

* All the features present in the Top 10 are labeled as 1 and the rest are labeled as 0.

In [None]:
import numpy as np
for categories in lst_10:
    df[categories]=np.where(df['X1']==categories,1,0)

In [None]:
lst_10.append('X1')

In [None]:
df[lst_10]

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1
0,0,0,0,0,1,0,0,0,0,0,v
1,0,0,0,0,0,0,0,0,0,0,t
2,0,0,0,0,0,0,0,0,0,0,w
3,0,0,0,0,0,0,0,0,0,0,t
4,0,0,0,0,1,0,0,0,0,0,v
...,...,...,...,...,...,...,...,...,...,...,...
4204,0,1,0,0,0,0,0,0,0,0,s
4205,0,0,0,0,0,0,0,0,0,1,o
4206,0,0,0,0,1,0,0,0,0,0,v
4207,0,0,0,0,0,1,0,0,0,0,r


* All the features that are not present in the Top 10 Most Frequent features are labelled as 0. **(2nd and 4th Row)**