# Feature Encoding Techniques

In Machine Learning, We have to convert every categorical features/variables into numerical features/variables before fitting any model.For converting these features into numerical variables,we have different
techniques/methods which are termed as encoding techniques

Categorical features are generally divided into 3 types:

## Binary: Either/or
Examples:
Yes, No
True, False
## Ordinal: Specific ordered Groups.
Examples:
low, medium, high
cold, hot, lava Hot
## Nominal: Unordered Groups.
Examples:
cat, dog, tiger
pizza, burger, coke




In [2]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

In [3]:
# Reading data
Data=pd.DataFrame({'Id':['0','1','2','3','4','5'],'Col_1':['M','F','F','M','F','M'],
                   'Col_2':['Y','N','Y','N','Y','N'],'Col_3':['Cricket','Football',
                                                              'Cricket','Football','Football','Cricket'
                                                             ],'Col_4':['Red','Blue','Blue','Red','Red','Blue']})
print(Data)

  Id Col_1 Col_2     Col_3 Col_4
0  0     M     Y   Cricket   Red
1  1     F     N  Football  Blue
2  2     F     Y   Cricket  Blue
3  3     M     N  Football   Red
4  4     F     Y  Football   Red
5  5     M     N   Cricket  Blue


# Label Encoding
In this encoding, we convert categorical variable into numerical variable which is 1 or 0.
Following is the Implementation:

In [15]:
from sklearn.preprocessing import LabelEncoder  
LE = LabelEncoder()
Data['Col_3'] = LE.fit_transform(Data['Col_3'])
Data.head()

Unnamed: 0,id,Col_1,Col_2,Col_3,Col_4
0,0,M,Y,0,Red
1,1,F,N,1,Blue
2,2,F,Y,0,Blue
3,3,M,N,1,Red
4,4,F,Y,1,Red


# One-Hot Encoding
Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called 'One-Hot Encoding'. 

In [18]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
# tranforming the column after fitting
enc = enc.fit_transform(Data[['Col_4']]).toarray()
# converting arrays to a dataframe
encoded_colm = pd.DataFrame(enc)
# concating dataframes 
Data = pd.concat([Data, encoded_colm], axis = 1) 
# removing the encoded column.
Data = Data.drop(['Col_4'], axis = 1) 
Data.head(10)

Unnamed: 0,Id,Col_1,Col_2,Col_3,0,1,0.1,1.1
0,0,M,Y,Cricket,0.0,1.0,0.0,1.0
1,1,F,N,Football,1.0,0.0,1.0,0.0
2,2,F,Y,Cricket,1.0,0.0,1.0,0.0
3,3,M,N,Football,0.0,1.0,0.0,1.0
4,4,F,Y,Football,0.0,1.0,0.0,1.0
5,5,M,N,Cricket,1.0,0.0,1.0,0.0


# Ordinal Encoding
Ordinal Encoding: We can use Ordinal Encoding provided in Scikit learn class to encode Ordinal features. It ensures that ordinal nature of the variables is sustained.

In [19]:
from sklearn.preprocessing import OrdinalEncoder
ord1 = OrdinalEncoder()
# fitting encoder
ord1.fit([Data['Col_3']])
# tranforming the column after fitting
Data["Col_3"]= ord1.fit_transform(Data[["Col_3"]]) 
Data.head(10)

Unnamed: 0,Id,Col_1,Col_2,Col_3,0,1,0.1,1.1
0,0,M,Y,0.0,0.0,1.0,0.0,1.0
1,1,F,N,1.0,1.0,0.0,1.0,0.0
2,2,F,Y,0.0,1.0,0.0,1.0,0.0
3,3,M,N,1.0,0.0,1.0,0.0,1.0
4,4,F,Y,1.0,0.0,1.0,0.0,1.0
5,5,M,N,0.0,1.0,0.0,1.0,0.0


# Binary Encoding
Binary Encoding:
In This approach,variables are first converted into integer and then into binary coded form and then placed into separate columns
for eg: for 7 : 1 1 1
This method is quite preferable when there are more number of categories. In some scenario, we might have 100 different categories. One hot encoding will create 100 different columns, But binary encoding only need 7 columns.To overcome this limitation we use Binary encoding.

In [1]:
# Installing the required library
pip install category_encoders


Collecting category_encoders
  Using cached category_encoders-2.2.2-py2.py3-none-any.whl (80 kB)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2
Note: you may need to restart the kernel to use updated packages.


In [5]:
from category_encoders import BinaryEncoder 
encoder = BinaryEncoder(cols =['Col_2']) 
# tranforming the column after fitting
newdata = encoder.fit_transform(Data['Col_2'])
# concating dataframe
df = pd.concat([Data, newdata], axis = 1) 
# dropping old column 
df = df.drop(['Col_2'], axis = 1)
df.head(10)

Unnamed: 0,Id,Col_1,Col_3,Col_4,Col_2_0,Col_2_1
0,0,M,Cricket,Red,0,1
1,1,F,Football,Blue,1,0
2,2,F,Cricket,Blue,0,1
3,3,M,Football,Red,1,0
4,4,F,Football,Red,0,1
5,5,M,Cricket,Blue,1,0


# Hash Encoding
HashEncoding: Hashing is the process of converting of a string of characters into a unique hash value with applying a hash function.This approach is quite useful in terms of memory usage.

In [7]:
from sklearn.feature_extraction import FeatureHasher
# n_features contains the number of bits you want in your hash value.
h = FeatureHasher(n_features = 3, input_type ='string') 
# tranforming the column after fitting
hashed_Feature = h.fit_transform(Data['Col_3'])
hashed_Feature = hashed_Feature.toarray()
df = pd.concat([Data, pd.DataFrame(hashed_Feature)], axis = 1)
df=df.drop(['Col_3'],axis=1)
df.head(10)

Unnamed: 0,Id,Col_1,Col_2,Col_4,0,1,2
0,0,M,Y,Red,-3.0,2.0,-2.0
1,1,F,N,Blue,-1.0,2.0,1.0
2,2,F,Y,Blue,-3.0,2.0,-2.0
3,3,M,N,Red,-1.0,2.0,1.0
4,4,F,Y,Red,-1.0,2.0,1.0
5,5,M,N,Blue,-3.0,2.0,-2.0


# Mean/Target Encoding
Mean/Target Encoding: Target encoding is good because it picks up values that can explain the target.Best for Competitive Data Science or any hackathon.In this approach, we convert categorical variable by replacing it by mean of target variable.


In [9]:
# inserting Target column in the dataset since it needs a target
Data.insert(5, "Target", [0, 1, 1, 0, 0,1], True) 
# importing TargetEncoder
from category_encoders import TargetEncoder
Targetenc = TargetEncoder()
# tranforming the column after fitting
values = Targetenc.fit_transform(X = Data.Col_2, y = Data.Target)
# concating values with dataframe
df = pd.concat([Data, values], axis = 1)
# Dropping the previous columns
df=df.drop(['Col_2'],axis=1)
df.head()

Unnamed: 0,Id,Col_1,Col_3,Col_4,Target
0,0,M,Cricket,Red,0
1,1,F,Football,Blue,1
2,2,F,Cricket,Blue,1
3,3,M,Football,Red,0
4,4,F,Football,Red,0
