### Handling Categorical Features

We will be demonstrating 2 techniques
    1) One Hot Encoding
    2) Probability Ration Encoding
    
#### 1. One Hot Encoding

This is very easy to implement
use pandas function -->   pd.get_dummy(___)

Cons of One Hot Encoding: It creates more features. For instance, if there are 100 unique Pincode it will create 99 features

In [1]:
import pandas as pd
df = pd.read_csv('titanic.csv', usecols = ['Sex'])
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [2]:
pd.get_dummies(df).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [3]:
pd.get_dummies(df, drop_first=True).head()

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1


In [4]:
df = pd.read_csv('titanic.csv', usecols = ['Embarked'])

In [5]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [6]:
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [7]:
pd.get_dummies(df, drop_first=False).head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


From the 2nd and 3rd column (Embarked_Q, Embarked_S) -- we can understand the status of the first column (Embarked_C) 

In [8]:
pd.get_dummies(df,drop_first=True).head()

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


#### One Hot Encoding with MANY categories in a feature

In [9]:
df = pd.read_csv('mercedez.csv', usecols = ['X0','X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [10]:
# df['X0'].value_counts()
# df['X0'].unique()
# Output: array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
#       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
#       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
#       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
#      dtype=object)
        
len(df['X0'].unique())

47

In [11]:
for i in df.columns:
    print(len(df[i].unique()))

47
27
44
7
4
29
12


Performing One Hot Encoding will be tedius because it will create a lot of features.
"KDD Cup Orange challenge" They took 10 most frequent categories and applied ONE HOT ENCODING
and they dropped the rest of the features. AND, it worked well!!

In [12]:
df['X0'].value_counts()

z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
j     181
az    175
aj    151
s     106
ap    103
h      75
d      73
al     67
v      36
af     35
m      34
ai     34
e      32
ba     27
at     25
a      21
ax     19
i      18
aq     18
am     18
u      17
l      16
aw     16
ad     14
au     11
k      11
b      11
as     10
r      10
bc      6
ao      4
c       3
aa      2
q       2
ab      1
g       1
ac      1
Name: X0, dtype: int64

If we see top 10 results use mose frequent occuring variables

In [13]:
df.X1.value_counts().sort_values(ascending=False).head(10)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
Name: X1, dtype: int64

### Let us see how to use it

In [14]:
lst_10 = df.X1.value_counts().sort_values(ascending=False).head(10).index ## This is the same code as above -->> Index has been appended
lst_10 = list(lst_10)
lst_10

['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']

#### We will take these 10 features and apply ONE HOT ENCODING

In [15]:
import numpy as np
for categories in lst_10:
    df[categories]= np.where(df['X1']==categories,1,0)

In [16]:
lst_10.append('X1')

In [17]:
df[lst_10]

## Wherever there is 1, it is shown under X1
## t, w are not present in ['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']. So, they are not shown but come under X1

Unnamed: 0,aa,s,b,l,v,r,i,a,c,o,X1
0,0,0,0,0,1,0,0,0,0,0,v
1,0,0,0,0,0,0,0,0,0,0,t
2,0,0,0,0,0,0,0,0,0,0,w
3,0,0,0,0,0,0,0,0,0,0,t
4,0,0,0,0,1,0,0,0,0,0,v
5,0,0,1,0,0,0,0,0,0,0,b
6,0,0,0,0,0,1,0,0,0,0,r
7,0,0,0,1,0,0,0,0,0,0,l
8,0,1,0,0,0,0,0,0,0,0,s
9,0,0,1,0,0,0,0,0,0,0,b


## 2. Probability Ratio Encoding

Steps I performed:

    - Probability of Survived based on Cabin--- Categorical Feature
    - Probability of Not Survived---1-pr(Survived)
    - pr(Survived)/pr(Not Survived)
    - Dictonary to map cabin with probability
    - replace with the categorical feature

In [26]:
import pandas as pd

In [27]:
df = pd.read_csv('titanic.csv', usecols = ['Survived','Cabin'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [28]:
# Replace NaN with 'Missing'; Note that inplace 'i' is small
df['Cabin'].fillna('Missing', inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [29]:
df['Cabin'].unique()

array(['Missing', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62

In [30]:
df['Cabin']=df['Cabin'].astype(str).str[0] # Replace each with 1st alphabet
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [31]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [33]:
df.groupby(['Cabin'])['Survived'].mean() # Cabin value based on Survived this si the ration

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [34]:
prob_df = df.groupby(['Cabin'])['Survived'].mean()

In [36]:
# Converting it to dataframe
prob_df = pd.DataFrame(prob_df)
prob_df

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [39]:
prob_df['Died'] = 1- prob_df['Survived']
prob_df.head()

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


Sum of this will always be 1 !!

In [42]:
# Let us find the ratio

prob_df['Probability Ratio'] = prob_df['Survived']/prob_df['Died']
prob_df.head()

Unnamed: 0_level_0,Survived,Died,Probability Ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [50]:
# Converting it to dictionary using .to_dict() function

probability_encoded = prob_df['Probability Ratio'].to_dict()
probability_encoded # Display Dictionary

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'M': 0.42827442827442824,
 'T': 0.0}

In [51]:
# Map it
df['Cabin_encoded']=df['Cabin'].map(probability_encoded)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274
