#Encoding 

---

---




Encoding is a feature engineering technique used to handle categorical data.There are many encoding techniques that we could use . Before getting into that we need to understand what is a categorical data

##Categorical data

As the name suggests,categorical data is the data that is represented in  a category.They are also represented as 'strings' .Here the categorical variables would be finite only. Ex: gender,educational qualification,city ,state,etc.

Numerical data is data that is represented in numbers.

Machines don't understand textual data like gender(male,female) for it to process so we use encoding techniques to convert this categorical data into numerical data.This would be easier for mathematical and statistic analysis.

---



##Types of Categorical Data

There are two kinds of categorical data:

* Ordinal Data
* Nominal Data

###Oridinal Data
In ordinal data,the categories would **have an order** or we can say **rank**,they could be arranged according to the 'ranks'.
Here,one thing to keep in note is that we should retain the information regarding the order.

Example:
The ranks of students,order of the degree for the person to be considered for a post.

###Nominal Data

In nominal data there is  **no 'rank' or order** for it, it is independant of the 'ranks'.There is no requirement for arrangement or ordering in nominal data.
Example: Color of flowers,city where a person lives,etc.


---




##Types of encoding techniques

* One Hot Encoding
* Dummy Encoding
* Label Encoding 
* Ordinal Encoding
* Helmert Encoding
* Frequency Encoding 
* Weight of Evidence Encoding
* Mean Encoding
* Probability Ratio Encoding
* Hashing Encoding
* Backward Difference Encoding
* Leave Out Encoding
* James -Stein Encoding
* M-estimator Encoding
* Thermometer Encoder


Some of the listed encoding techniques are pretty advanced and are not commonly used. For using the some of these we need to use  category_encoders module .

---



Here we will focus on some techniques used .

##One Hot Encoding

This is a popular encoding technique to handle categorical values. Here each categorical variables is mapped to vectors which contains **1** and **0**.1 denotes that the variable is present and 0 is used to say that the variable is absent. 
The number of feature depends on the number of categorical variables present.For example, if there are around 5 categorical variables then we would have 4 columns representing those variables.
**N variables then n-1 columns**.

This would inefficient to use when there are many categorical variables as the number of columns increase. With the increase in the column, there is a steep increase in the dimensionality.

There is a built in function in pandas called **get_dummies** .

In [2]:
import pandas as pd
import numpy as np
datas = {'Temperature':['Hot','Cold','Very Hot','Warm','Warm','Hot','Cold','Cold','Hot'],
         'Color':['Red','Yellow','Red','Blue','Yellow','Blue','Yellow','Yellow','Yellow'],
         'Target':[1,0,1,0,1,1,1,0,0]}
df = pd.DataFrame(datas,columns = ['Temperature','Color','Target'])
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,0
2,Very Hot,Red,1
3,Warm,Blue,0
4,Warm,Yellow,1
5,Hot,Blue,1
6,Cold,Yellow,1
7,Cold,Yellow,0
8,Hot,Yellow,0


We could use the Scikit-learn's **OneHotEncoder** function too.

In [3]:
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder()


In [4]:
one_hots = one_hot.fit_transform(df.Temperature.values.reshape(-1,1)).toarray()

In [5]:
onehot_df = pd.DataFrame(one_hots,columns=["Temps_"+str(one_hot.categories_[0][i]) for i in range(len(one_hot.categories_[0]))])
dfh = pd.concat([df,onehot_df],axis=1)
dfh

Unnamed: 0,Temperature,Color,Target,Temps_Cold,Temps_Hot,Temps_Very Hot,Temps_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,0,1.0,0.0,0.0,0.0
2,Very Hot,Red,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Warm,Yellow,1,0.0,0.0,0.0,1.0
5,Hot,Blue,1,0.0,1.0,0.0,0.0
6,Cold,Yellow,1,1.0,0.0,0.0,0.0
7,Cold,Yellow,0,1.0,0.0,0.0,0.0
8,Hot,Yellow,0,0.0,1.0,0.0,0.0


In [6]:
data_fram = pd.get_dummies(df,prefix = ['Temps'],columns = ['Temperature'])
data_fram

Unnamed: 0,Color,Target,Temps_Cold,Temps_Hot,Temps_Very Hot,Temps_Warm
0,Red,1,0,1,0,0
1,Yellow,0,1,0,0,0
2,Red,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Yellow,1,0,0,0,1
5,Blue,1,0,1,0,0
6,Yellow,1,1,0,0,0
7,Yellow,0,1,0,0,0
8,Yellow,0,0,1,0,0


In Regression , we use N-1 categories(N-1 as in we would drop the first of the last column). For classification task we would use all the N categories.

---



##Drawbacks of One Hot Encoding

One hot encoding would result in Dummy Variable Trap because the outcome of one variable can be easily predicted with the help of the other remaining variables. Here , the variables are highly correlated to each other.

Not only this but Dummy Variable Trap leads to a problem called **multicollinearity**. Multicollinearity comes up when the independant features are dependant.

This would effect badly in machine learning models like **linear regression** and **logistic regression**.To overcome this we can check the **Variance Inflation Factor(VIF)**.

---



---



* VIF=1,Very less multicollinearity.
* VIF<5,Moderate levels of multicollinearity.
* VIF>5,Extreme levels of multicollinearity.


##Label Encoding

Here,each category is assigned a value from 1 to N.There maybe no relation or order in them.

In [8]:
from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded
0,Hot,Red,1,1
1,Cold,Yellow,0,0
2,Very Hot,Red,1,2
3,Warm,Blue,0,3
4,Warm,Yellow,1,3
5,Hot,Blue,1,1
6,Cold,Yellow,1,0
7,Cold,Yellow,0,0
8,Hot,Yellow,0,1


When label encoding is performed , the temperatures are ranked based on the alphabetical order. Cold<Hot<Very Hot< Warm . 
This obstacle is overcomed by **One Hot Encoding**.

---



---



##When to consider Label Encoding or One Hot Encoding

We would apply One Hot Encoding if:

* The categorical variables aren't ordinal.
* The number of categorical features is less.If it is more then there would be a problem of dimensionality.

We would apply Label Encoding if:

* There are specific ranks in the categorical data.
* When the number of categorical variables are more.

---



---



##Dummy Encoding

It is similar to one-hot encoding. The only difference is that teh categorical variables are transformed into a set of binary variables.
In the case of one-hot encoding ,it would have N binary variables.

In [10]:
import pandas as pd
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi']})

data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi


In [12]:
data_dummies=pd.get_dummies(data=data,drop_first=True)
data_dummies

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,0,1,0,0
1,0,0,0,1
2,0,0,1,0
3,1,0,0,0
4,0,0,0,0
5,0,1,0,0


##Ordinal Encoding

It is used to ensure the encoded variables are encoded based on the **rank**.This is only used for ordinal variables for nominal there is no order or ranking for the categorical variables. It resembles the Label Encoding but in Label Encoding ,we aren't sure if the categorical variable is nominal or ordinal.

---



---



In [15]:
dict_temp = {
    'Very Hot':1,
    'Hot':2,
    'Warm':3,
    'Cold':4
}
df['Temps_ordinal'] = df.Temperature.map(dict_temp)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temps_ordinal
0,Hot,Red,1,1,2
1,Cold,Yellow,0,0,4
2,Very Hot,Red,1,2,1
3,Warm,Blue,0,3,3
4,Warm,Yellow,1,3,3
5,Hot,Blue,1,1,2
6,Cold,Yellow,1,0,4
7,Cold,Yellow,0,0,4
8,Hot,Yellow,0,1,2


##Frequency Encoding

It is used for utilizing the frequency of the categories as labels. It helps the model to understand and assign teh weight in inverse and direct proportions,depending on the nature of the given data.

Steps used for frequency encoding:
* Select the categorical variable to transform.
* group the categorical variable adn obtain the freqeuncy(counts) for each category.
* join it back with the training dataset.

In [18]:
freq_enc = df.groupby('Temperature').size()/len(df)
df.loc[:,'Temp_freq_enc'] = df['Temperature'].map(freq_enc)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temps_ordinal,Temp_freq_enc
0,Hot,Red,1,1,2,0.333333
1,Cold,Yellow,0,0,4,0.333333
2,Very Hot,Red,1,2,1,0.111111
3,Warm,Blue,0,3,3,0.222222
4,Warm,Yellow,1,3,3,0.222222
5,Hot,Blue,1,1,2,0.333333
6,Cold,Yellow,1,0,4,0.333333
7,Cold,Yellow,0,0,4,0.333333
8,Hot,Yellow,0,1,2,0.333333


references and code samples:

---



---



https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/
