## Encoding Categorical features

In this video we'll look at:
1. Ordinal / Label Encoding
2. One-hot encoding
3. Binary Encoding
4. TargetEncoder

## load data
Note: We're using same data as in 'Categorical Summary stats & visualisation'

In [163]:
import pandas as pd
df = pd.read_csv('adult.data')

feature_pairs = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',  'target']
df.columns=feature_pairs
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [164]:
categorical_data = df.select_dtypes(include='object')
categorical_data.columns

Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country', 'target'],
      dtype='object')

In [165]:
categorical_data['workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1297
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

In [166]:
categorical_data['workclass'].nunique()

9

In [167]:
categorical_data.nunique()

workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex                2
native-country    42
target             2
dtype: int64

In [168]:
df.shape

(32560, 15)

## Integer Encoding  (sex)

* OrdinalEncoder in sklearn is used for encoding *both* nominal and ordinal features
* LabelEncoder in sklearn, despite its name, is intended *only* for use with the target variable

sex is categorical nominal variable

In [169]:
encoded = pd.DataFrame(index=df.index)

In [170]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
sex_enc = encoder.fit_transform(df[['sex']])
encoded['sex'] = sex_enc

In [171]:
encoded.head()

Unnamed: 0,sex
0,1.0
1,1.0
2,1.0
3,0.0
4,0.0


### Label Encoding (target)

In [172]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
target = encoder.fit_transform(df['target'])
encoded['target'] = target

In [173]:
encoded.tail()

Unnamed: 0,sex,target
32555,0.0,0
32556,1.0,1
32557,0.0,0
32558,1.0,0
32559,0.0,1


## Ordinal Encoding (Education)

Education is a categorical ordinal variable.

*Note: Education is already ordinal-encoded under the column 'education-num', so this is strictly necessary!*

In [194]:
df['education'].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [175]:
encoder = OrdinalEncoder(categories=[[' Preschool' , ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th',' HS-grad', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', 
' Some-college', ' Bachelors', ' Masters', ' Doctorate']])
_enc = encoder.fit_transform(df[['education']])
encoded['education'] = _enc

In [176]:
encoded.tail()

Unnamed: 0,sex,target,education
32555,0.0,0,10.0
32556,1.0,1,8.0
32557,0.0,0,8.0
32558,1.0,0,8.0
32559,0.0,1,8.0


## One-hot encoding 
Race, Relationship, Marital Status have modest cardinality (5 - 7 unique values each).
One hot encoding these features is probably the best initial approach

In [177]:
df['race'].nunique()

5

In [178]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first').fit(df[['race']])
race_encoded  = encoder.transform(df[['race']])
race_encoded

<32560x4 sparse matrix of type '<class 'numpy.float64'>'
	with 32249 stored elements in Compressed Sparse Row format>

In [179]:
race_encoded.toarray()

array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

In [180]:
race_encoded[:, 0].toarray()

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

*Note: Using toarray() just so that we can see the values. there's no need to convert these from sparse array*

In [181]:
encoded['race_0'] = race_encoded[:, 0].toarray()
encoded['race_1'] = race_encoded[:, 1].toarray()
encoded['race_2'] = race_encoded[:, 2].toarray()
encoded['race_3'] = race_encoded[:, 3].toarray()

In [182]:
encoded.tail()

Unnamed: 0,sex,target,education,race_0,race_1,race_2,race_3
32555,0.0,0,10.0,0.0,0.0,0.0,1.0
32556,1.0,1,8.0,0.0,0.0,0.0,1.0
32557,0.0,0,8.0,0.0,0.0,0.0,1.0
32558,1.0,0,8.0,0.0,0.0,0.0,1.0
32559,0.0,1,8.0,0.0,0.0,0.0,1.0


## Likelihood Encoding

Occupation & Native Country have higher cardinalities, and one-hot encoding these will result in a lot of extra sparse columns. For some data, for some models, this might be appropriate. But, it's a good opportunity to show you some more advanced encoding techniques, like likelihood encoding aka. target encoding. For this we'll need an extension to sklearn called 'categorical_encoders'

In [183]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'target'],
      dtype='object')

In [184]:
from category_encoders.target_encoder import TargetEncoder
encoder = TargetEncoder(smoothing=True) ## smoothing attempts to balance out differences in the frequency of each country. eg. most of the data is from the US.
country_enc = encoder.fit_transform(df['native-country'], encoded['target'])

In [185]:
df['native-country'].head()

0     United-States
1     United-States
2     United-States
3              Cuba
4     United-States
Name: native-country, dtype: object

In [186]:
encoded['native-country'] = country_enc
encoded.head()

Unnamed: 0,sex,target,education,race_0,race_1,race_2,race_3,native-country
0,1.0,0,13.0,0.0,0.0,0.0,1.0,0.245843
1,1.0,0,8.0,0.0,0.0,0.0,1.0,0.245843
2,1.0,0,6.0,0.0,1.0,0.0,0.0,0.245843
3,0.0,0,13.0,0.0,1.0,0.0,0.0,0.263158
4,0.0,0,14.0,0.0,0.0,0.0,1.0,0.245843


## Binary Encoding

We'll show you 'binary encoding' on the 'Occupation' column.

In [187]:
from category_encoders.binary import BinaryEncoder
encoder = BinaryEncoder()
occ_enc = encoder.fit_transform(df['occupation'])
occ_enc.head()

Unnamed: 0,occupation_0,occupation_1,occupation_2,occupation_3,occupation_4
0,0,0,0,0,1
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,1
4,0,0,0,0,1


In [188]:
all_encoded = encoded.join(occ_enc)
all_encoded

Unnamed: 0,sex,target,education,race_0,race_1,race_2,race_3,native-country,occupation_0,occupation_1,occupation_2,occupation_3,occupation_4
0,1.0,0,13.0,0.0,0.0,0.0,1.0,0.245843,0,0,0,0,1
1,1.0,0,8.0,0.0,0.0,0.0,1.0,0.245843,0,0,0,1,0
2,1.0,0,6.0,0.0,1.0,0.0,0.0,0.245843,0,0,0,1,0
3,0.0,0,13.0,0.0,1.0,0.0,0.0,0.263158,0,0,0,1,1
4,0.0,0,14.0,0.0,0.0,0.0,1.0,0.245843,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,0.0,0,10.0,0.0,0.0,0.0,1.0,0.245843,0,1,0,1,1
32556,1.0,1,8.0,0.0,0.0,0.0,1.0,0.245843,0,1,0,1,0
32557,0.0,0,8.0,0.0,0.0,0.0,1.0,0.245843,0,0,1,0,1
32558,1.0,0,8.0,0.0,0.0,0.0,1.0,0.245843,0,0,1,0,1
