## Categorical Encoding using Pandas

Pandas has some of its own inbuilt methods for categorical encoding. It's largely a matter of preference whether you use sklearn, pandas or another library for categorical encoding, but as both sklearn and pandas are popular choices, for completeness, we're including both methods.

In this video we'll look at:
1. Integer encoding
2. Ordinal Encoding
3. One-Hot encoding 

## load data

Note: We're using same data as in 'Categorical Summary stats & visualisation'

In [1]:
import pandas as pd
df = pd.read_csv('adult.data')

feature_pairs = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',  'target']
df.columns=feature_pairs
df[['age', 'sex']].head()
subset = df[['age', 'sex', 'education', 'occupation']].copy() ## Note: Only working with subset of the feature, so the changes are more visible

In [2]:
subset.head()

Unnamed: 0,age,sex,education,occupation
0,50,Male,Bachelors,Exec-managerial
1,38,Male,HS-grad,Handlers-cleaners
2,53,Male,11th,Handlers-cleaners
3,28,Female,Bachelors,Prof-specialty
4,37,Female,Masters,Exec-managerial


## Integer Encoding

1. Using get_dummies

In [3]:
pd.get_dummies(subset, columns=['sex'], drop_first=True).head()

Unnamed: 0,age,education,occupation,sex_ Male
0,50,Bachelors,Exec-managerial,1
1,38,HS-grad,Handlers-cleaners,1
2,53,11th,Handlers-cleaners,1
3,28,Bachelors,Prof-specialty,0
4,37,Masters,Exec-managerial,0


2. Using replace...

In [4]:
subset['sex'].unique()

array([' Male', ' Female'], dtype=object)

In [5]:
_ = subset['sex'].replace({" Male" : 0, " Female" : 1}) # this is a copy

In [6]:
#_ = subset['sex'].replace({" Male" : 0, " Female" : 1}, inplace=True) # this directly changes

3. Using factorize

In [7]:
subset['sex'].factorize()

(array([0, 0, 0, ..., 1, 0, 1]), Index([' Male', ' Female'], dtype='object'))

## Ordinal Encoding

In [8]:
categories = pd.Categorical(subset['education'], categories=[' Preschool' , ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th',' HS-grad', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', 
' Some-college', ' Bachelors', ' Masters', ' Doctorate'], ordered=True)

In [9]:
categories

[Bachelors, HS-grad, 11th, Bachelors, Masters, ..., Assoc-acdm, HS-grad, HS-grad, HS-grad, HS-grad]
Length: 32560
Categories (16, object): [Preschool < 1st-4th < 5th-6th < 7th-8th ... Some-college < Bachelors < Masters < Doctorate]

In [10]:
labels, unique = pd.factorize(categories, sort=True)

In [11]:
encoded = pd.DataFrame(index=df.index)
encoded['education'] = labels
encoded.head(7)

Unnamed: 0,education
0,13
1,8
2,6
3,13
4,14
5,4
6,8


## One Hot Encoding
aka dummy encoding

In [12]:
pd.get_dummies(subset, columns=['occupation'], drop_first=True).head()

Unnamed: 0,age,sex,education,occupation_ Adm-clerical,occupation_ Armed-Forces,occupation_ Craft-repair,occupation_ Exec-managerial,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving
0,50,Male,Bachelors,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,38,Male,HS-grad,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,53,Male,11th,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,28,Female,Bachelors,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,37,Female,Masters,0,0,0,1,0,0,0,0,0,0,0,0,0,0
