This is the note for how to do encoding with tree based model. Most of the example and plot is from this [article](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769).

Tree based models like random forest do worse with one hot encoding. And the inner order of one feature is not necessary for label encoding when using tress based models.

![Snapshot of Performance](https://cdn-images-1.medium.com/max/1200/1*vWhYH9KaUeDhjNB6cOU_sw.png)
<center>(Snapshot of Performance)</center>

### What happend inside the algorithm?
For every tree based algorithm, there is a sub-algrithm that splits all the instance into two bins based on one feature. This algritm will consider all possible splits(based on all features and all possible values for each feature) and finds the most optimum split based on a criterion. And qualitatively speaking, the criterion helps the sub-algorithm select the split that minimizes the impurity of bins. 
So if a continuous variable is chosen for split, then there would be a number of choices of values on which tree can split and in most case, tree can grow in both directions.
![](https://cdn-images-1.medium.com/max/1200/1*rJldhOQ8qofb-UsvEDwAdg.png)
<center>Dense Decision Tree (Model without One Hot Encoding)</center>

But it is totally different for categorical variables, it is naturally disadvantaged in this case and have only a few options for splitting which results in very sparse decision tree.
The situation gets worse in variables that have small number of levels and one-hot encoding has only two values in each features.
![](https://cdn-images-1.medium.com/max/1200/1*waMbIQifR03o_1hHzNYgbw.png)
<center>Sparse Decision Tree (Model with One Hot Encoding)</center>
Since the data set after onehot encoding become more sparse, the trees generally tend to grow in one direction: direction of majority, zeroes, in the dummy variables. 

![](https://cdn-images-1.medium.com/max/1600/1*jOMNT-nHwABGVchKX0Pi3Q.jpeg)

If we have a categorical variable with q levels, the tree has to choose from ((2^q/2)-1) splits. For a dummy variable, there is only one possible split and this induces sparsity.
In conclusion:
1. The sparsity of dummy variables will lead trees tend to grow in one direction.
2. The trees will be deeper. 
    * If we have a q levels categorical variable, we will have (2^q/2 -1) splits.
    
### Why it's harmful?


1. By one-coding a feature into data set, we are inducing sparsity into the dataset and it is undesirable.
2. The splitting algorithm treat all the dummy variables as independent vars, which is not true. And the gain of purity from make a split on a dummy variable is marginal(very slight). As a result, it's harder to let tree to select a dummy varriables closer to the root.(tree prefer a non-dummy one even the categorical variable is crucial)

### Example
Before and after we do one-hot encoding: (feature importance)

![](https://cdn-images-1.medium.com/max/1600/1*aMaOMQ0bIt9txo_YMcSiTQ.png)

![](https://cdn-images-1.medium.com/max/1600/1*VT1vwxH9k_Ra1B6JkMwSJw.png)

In [39]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
data_class = fetch_openml(name='miceprotein', version=4)
data = pd.DataFrame(data= np.c_[data_class['data'], data_class['target']],
                     columns= data_class['feature_names'] + ['target'])
data.head()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,target
0,0.503644,0.747193,0.430175,2.81633,5.99015,0.21883,0.177565,2.37374,0.232224,1.75094,...,0.122652,,0.106305,0.108336,0.427099,0.114783,0.13179,0.128186,1.67565,c-CS-m
1,0.514617,0.689064,0.41177,2.78951,5.68504,0.211636,0.172817,2.29215,0.226972,1.59638,...,0.116682,,0.106592,0.104315,0.441581,0.111974,0.135103,0.131119,1.74361,c-CS-m
2,0.509183,0.730247,0.418309,2.6872,5.62206,0.209011,0.175722,2.28334,0.230247,1.56132,...,0.118508,,0.108303,0.106219,0.435777,0.111883,0.133362,0.127431,1.92643,c-CS-m
3,0.442107,0.617076,0.358626,2.46695,4.9795,0.222886,0.176463,2.1523,0.207004,1.59509,...,0.132781,,0.103184,0.111262,0.391691,0.130405,0.147444,0.146901,1.70056,c-CS-m
4,0.43494,0.61743,0.358802,2.36578,4.71868,0.213106,0.173627,2.13401,0.192158,1.50423,...,0.129954,,0.104784,0.110694,0.434154,0.118481,0.140314,0.14838,1.83973,c-CS-m


In [43]:
# coding examples
# For label encoding:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(data['target'])
data['target'] = le.transform(data['target'])

In [50]:
# For one-hot encoding:
data = pd.get_dummies(data, columns= ['target'])