## <center> Encoding data for ML models </center>

In this notebook, we will look at two common encoders used to convert categorical variables to numeric data that can be used with training and testing machine learning models. Not only this,  what it comes to ML model productionization, a suitable encoder should be carefully selected. We compare two transformers: OneHotEncoder and get_dummies and assess their suitability for model productionization. 

Before we go with the code, recall that there is another transformer called "LabelEncoder".  This transformer encodes target/dependent variables (y not the input X) with value between 0 and n_classes-1. It is out of the scope of this notebook.

In [34]:
clear all

[H[2J

In [35]:
# Import import required packages

import pandas as pd
import numpy as np
from sklearn.compose       import ColumnTransformer
from sklearn.pipeline      import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute        import SimpleImputer


In [36]:
# Read CSV file into a dataframe
df=pd.read_csv('dataset/train.csv')

In [37]:
df.head(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [38]:
df=df.dropna()  

In [39]:
df=df.drop('Loan_ID', axis=1)

In [40]:
X=df.drop('Loan_Status', axis=1)

# Numerical columns

In [41]:
numerical_cols=X.select_dtypes(include=np.number).columns.tolist()

In [42]:
ss=StandardScaler()
ssf=ss.fit(X[numerical_cols])
ssf.mean_

array([5.36423125e+03, 1.58109358e+03, 1.44735417e+02, 3.42050000e+02,
       8.54166667e-01])

In [43]:
df[numerical_cols].describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,480.0,480.0,480.0,480.0,480.0
mean,5364.23125,1581.093583,144.735417,342.05,0.854167
std,5668.251251,2617.692267,80.508164,65.212401,0.353307
min,150.0,0.0,9.0,36.0,0.0
25%,2898.75,0.0,100.0,360.0,1.0
50%,3859.0,1084.5,128.0,360.0,1.0
75%,5852.5,2253.25,170.0,360.0,1.0
max,81000.0,33837.0,600.0,480.0,1.0


In [44]:
import pickle
import joblib
joblib.dump(ssf, 'standerdscaler.pickle')

['standerdscaler.pickle']

In [45]:
ssf_rev=joblib.load('standerdscaler.pickle')

In [46]:
# Create a test example
d=dict(zip(numerical_cols,np.array([4000.0, 1500.0, 120, 360, 1.0]).reshape(-1,1)))

In [47]:
x=np.array([1,2,3,4]).reshape(-1,1)
x.shape

(4, 1)

In [48]:
dd=pd.DataFrame(d)
dd

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,4000.0,1500.0,120.0,360.0,1.0


In [49]:
ssf_rev.transform(dd)

array([[-0.24093049, -0.03101136, -0.30756164,  0.27554157,  0.41319694]])

In [50]:
ssf.transform(dd)  # The same answer before dumping the value

array([[-0.24093049, -0.03101136, -0.30756164,  0.27554157,  0.41319694]])

# Categorical columns


# get_dummies()

- Converts categorical variable into dummy/indicator variables.
- There is no need to extract the categorical features before the transfermer is applied 
- It automatically pick the categorical features
- It returns a data frame 

In [51]:
# The list of categorical predictors (features)
categorical_cols=X.select_dtypes(include='object').columns.tolist()

In [52]:
D=pd.get_dummies(X[categorical_cols], drop_first=False)
D.head(7)

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,Dependents_1,Dependents_2,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0
2,0,1,0,1,1,0,0,0,1,0,0,1,0,0,1
3,0,1,0,1,1,0,0,0,0,1,1,0,0,0,1
4,0,1,1,0,1,0,0,0,1,0,1,0,0,0,1
5,0,1,0,1,0,0,1,0,1,0,0,1,0,0,1
6,0,1,0,1,1,0,0,0,0,1,1,0,0,0,1
7,0,1,0,1,0,0,0,1,1,0,1,0,0,1,0


The levels of Depenedents are coded as follows: 

0 ==>  1 0 0 0

1 ==>  0 1 0 0

2 ==>  0 0 1 0

3+ ==> 0 0 0 1

Which shows clear uniqness. 

#### Does dropping the first level in get_dummies affect the uniqness of the data?

In [53]:
D=pd.get_dummies(X[categorical_cols], drop_first=True)
D.head(7)

Unnamed: 0,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban
1,1,1,1,0,0,0,0,0,0
2,1,1,0,0,0,0,1,0,1
3,1,1,0,0,0,1,0,0,1
4,1,0,0,0,0,0,0,0,1
5,1,1,0,1,0,0,1,0,1
6,1,1,0,0,0,1,0,0,1
7,1,1,0,0,1,0,0,1,0


Note that dropping the first category does not affect the uniqueniss of the code representing the leveles of the categoical predictor, Dependents. This is great. let us now check how the transformer will do with a previously unseen example. 

In [54]:
# Test example
d=dict(zip(categorical_cols,np.array(['Female', 'Yes', '3+', 'Graduate', 'Yes', 'Urban']).reshape(-1,1)))
dd=pd.DataFrame(d)

In [55]:
D=pd.get_dummies(dd)
D

Unnamed: 0,Gender_Female,Married_Yes,Dependents_3+,Education_Graduate,Self_Employed_Yes,Property_Area_Urban
0,1,1,1,1,1,1


There is an obvious error here, which needs to be tackled. We can see that the level (Female) has been assigned the numeric value (1) while wjhich is , in fact, was given to the level "Male" in the training data. Apparently, t additional processing of the unseen data is needed so that its code remains consistent the one of the trainig data. 

##  OneHotEncoder

We have to extract the categorical features  before the transformer can be applied. 

In [45]:
OHE=OneHotEncoder(sparse=False, drop='first') # The first level (category in each feature)
OHE_f=OHE.fit(X[categorical_cols])
OHE_f.categories_

[array(['Female', 'Male'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['0', '1', '2', '3+'], dtype=object),
 array(['Graduate', 'Not Graduate'], dtype=object),
 array(['No', 'Yes'], dtype=object),
 array(['Rural', 'Semiurban', 'Urban'], dtype=object)]

In [46]:
Xf=OHE_f.transform(X[categorical_cols])

In [47]:
X.shape, Xf.shape

((480, 11), (480, 9))

In [24]:
X.head(7)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban
6,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban
7,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban


The data frame shows that the categorical feature, Dependents has multiple levels [0,1,2,3+]. let us check how such a feature is encoded for examples 1,2,5 and 7.

if dependents=1 ==>  code=1 0

if dependents=0 ==>  code=0 0

if dependents=2 ==>  code=0 1

if dependents=3+ ==> code=0 0

We can obviously realize that levels 0 and 3+ are encoded with the same values 0 0. This is due to the removal of the first category. We understand that with this transformer we should not drop any of the categories if the feature has more than 2 levels. 



In [25]:
Xf[[0,1,4, 6]]

array([[1., 1., 1., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0., 1.],
       [1., 1., 0., 1., 0., 0., 1., 0., 1.],
       [1., 1., 0., 0., 1., 0., 0., 1., 0.]])

In [57]:
OHE=OneHotEncoder(sparse=False,  drop=None) # The first level (category in each feature)
OHE_f=OHE.fit(X[categorical_cols])
Xf=OHE_f.transform(X[categorical_cols])
Xf[[0,1,4,6]]

array([[0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0.]])

Now we can see that the levels of Dependents are uniquely encoded each with two binary values. This increases the number of features to 15.

0 ==>  1 0 0

1 ==>  0 1 0

2 ==>  0 0 1

3+ ==> 0 0 0

In [27]:
import pickle
import joblib
joblib.dump(OHE_f, 'OHE_f.pickle')

['OHE_f.pickle']

In [28]:
test_example=pd.Series({'Gender':'Male',
 'Married':'No',
 'Dependents':'3+',
 'Education':'Graduate',
 'Self_Employed':'No',
 'Property_Area': 'Rural'})

In [29]:
test_example

Gender               Male
Married                No
Dependents             3+
Education        Graduate
Self_Employed          No
Property_Area       Rural
dtype: object

In [30]:
OHE_f=joblib.load('OHE_f.pickle')

In [31]:
test_example.values.reshape(1,-1).shape

(1, 6)

In [32]:
OHE_f.transform(test_example.values.reshape(1,-1))

array([[0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.]])

The previously unseen test example is correctly encoded 

Gender_Male ==> 0 1 

Married_No ==> 1 0

Dependents_3+ ==> 0 0 0

Education_grad ==> 1 1
.
.
.

## Conclusion

In this notebook, we have looked at two encoders: get_dummies and OneHotEncoder which are used to convert categorical variables to numeric data.  Although the formermaintains uniqueness of the code of the training data, additional processing is needed when dealing with unseen data to ensure consistence with the code of training data. . 

On the other hand, the OnHotEncoder is a straightforward encoder which gurentees the consistence of the code of unseedn daat with that of the training data, on condition that no first level is removed. 