## 1. Introduction
In this homework we will explore different ways of handling categorical variables in decision trees and weigh pros
and cons of each approach. In particular, we will see how hot encoding categorical variables yields poor performance if the
number of possible values that a categorical variable can assume is high. We will also see how categorical variables can be
handled directly without hot-encoding them in a natural way using decision trees. 
## 2. Dataset:
In this homework we will work with a synthetically created dataset with known interactions between variables$^{1}$
It contains one real valued variable $z$ and a categorical variable $c$ with 200 different possible level that strongly correlate
with the observed variable $y$. Formally they follow the following relation:

$$ y = 
\begin{cases}
    0, &\text{if  } c \in C^+ \text{ or } z > 10 \\
    1, &\text{otherwise}
\end{cases}$$

where $C+$ is the set of categories whose name begin with letter **A** and $C-$ the set that begin with letter **B**.
Also the dataset contains 100 other real valued variables $x_i, i \in (0, 99)$ with varying (but non-significant)
correlations with the observed variable that serve the purpose of noise. 
The dataset contains 10000 examples in total.  

A small subset of the data looks like the following:






In [10]:
import pandas as pd
df = pd.read_pickle('categorical.pkl') # make sure you've downloaded the file and the path is correct
df.head(10).round(3)

Unnamed: 0,y,c,z,x0,x1,x2,x3,x4,x5,x6,...,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99
0,1,Adk,3.249,-8.121,4.225,3.939,2.177,-4.563,-1.281,-18.063,...,4.282,-13.735,-4.926,-0.93,-4.438,-4.954,5.147,12.147,5.261,5.379
1,1,Adg,9.412,-22.232,6.17,-4.89,5.541,5.982,7.018,-4.786,...,7.122,-10.226,4.726,6.512,-10.393,6.112,9.457,-12.911,-0.384,9.922
2,1,Acs,9.74,-2.892,-1.393,6.831,-0.24,-23.42,7.991,18.532,...,-2.196,7.716,-15.627,-28.18,5.727,4.693,-12.137,12.972,-0.073,-7.936
3,1,Acb,16.043,6.197,-4.527,-12.651,-15.924,12.013,5.821,13.547,...,-11.758,-13.935,-0.04,8.221,1.481,11.609,9.12,-1.903,-8.438,15.006
4,1,Aas,13.717,6.48,7.84,-5.849,16.306,0.506,-14.651,4.347,...,-11.99,0.875,0.416,-18.412,-8.506,4.638,-9.237,-1.395,19.512,-4.426
5,0,Bca,5.956,7.167,15.06,0.331,-9.986,-9.756,-6.632,8.357,...,-1.747,-12.302,-11.957,-6.538,11.635,7.446,7.618,3.729,-4.476,-3.037
6,0,Bdk,2.923,24.327,-5.232,4.53,-7.362,7.063,1.951,7.538,...,-0.56,-7.271,0.213,1.155,2.093,-6.612,-1.755,12.442,-5.582,1.631
7,0,Bal,6.572,10.886,-4.413,8.222,7.348,-2.873,-1.676,-6.342,...,6.884,-7.649,6.742,16.675,-1.46,-6.662,-8.2,-1.357,15.118,-0.891
8,0,Bcx,1.101,7.103,2.293,10.635,-10.746,1.309,-20.503,-5.371,...,-1.795,-4.453,-10.653,-7.566,2.318,-18.155,-22.536,3.523,0.711,-0.072
9,0,Bcn,6.202,-30.672,-7.411,4.389,-13.742,1.3,-10.432,10.468,...,-5.976,6.892,-7.576,22.716,-3.475,-19.599,4.005,0.16,-6.803,9.38



## 3. Handling categorical variables:
There are a number of ways categorical variables can be handled. Most implementations either hot encode categorical variables
or are able to handle them directly while building decision trees. 

Handling categorical variables by hot encoding them is quite straightforward. We transform categorical variable with $q$ levels
into $q$ *independent* boolean variables. Implementations like *scikit* require categorical variables to be supplied as hot-
encodings. We shall see in later sections that this is in fact a bad idea and results in unnecessary sparsity weakening the
predictive power a categorical variable might have. 

The second approach, which decision trees quite naturally lend themselves to, is to handle categorical variables directly. Say
a categorical variable has $q$ level and we are building a binary decision tree. We need to somehow be able to split the set of all $q$ different levels into 2 disjoint non-empty sets. How many ways can this be done? 
$$\frac{2^q - 2}{2}$$

$2^q$ since each level can belong to either of the two sets, $-2$ to account for non-empty sets and division by
$2$ to account for symmetry.

But does this mean while finding a split on a categorical variable we need to exhaustively search each possible split to find
the one that best reduces the impurity? Well, depends! 
If the problem at hand is a binary classification there is an
efficient way of finding the best split. Which is, **order levels by the fraction of positive (wlog) samples they contain and find the split as if the levels were ordered just like real valued variables**.
Proof of this is beyond the scope of this exercise, however if you are interested the proof can be found in [Breiman et. al.](https://www.amazon.co.uk/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418)
For multi-class classification however no such simplification exists, though various [heuristics](https://link.springer.com/article/10.1023/A%3A1009869804967?no-access=true) have been proposed.

One of the questions that arises however is, do categorical variables with large number of levels cause overfitting? The answer is potentially! 
Consider categorical variables $C1$ and $C2$ with levels $q1$ and $q2$ respectively with $q1 > q2$. Which one of these variables do you think might be favored during tree induction? Since $C1$ provides more granularity in terms of how the samples can be split it is more likely to produce bigger reduction in sample impurity and hence is more likely be favored over $C2$. Since large number of levels allows for finer granularity while splitting samples this might potentially cause overfitting.

One another possible way of handling categorical variables is through $q$ way splitting instead of 2 way splitting (but we won't have a binary tree anymore), where $q$ is the number of levels or different possible values a categorical variable can attain. What might be the problem with this approach?




## 4. Tree induction with and without one-hot encoding
In this section you will first create a baseline on the above dataset by building a tree on all features except the categorical feature $c$. You will then compare this baseline against:

1. tree built by one-hot encoding categorical variable $c$ 

2. tree built without one-hot encoding categorical variable $c$ (i.e. handling $c$ directly by finding the best split using the approach detailed above). 

*Suggestion*: use 60% data to fully grow the tree, use 20% for pruning (using greedy pruning as mentioned in the lecture) and the rest 20% as a test set for performance reporting.

Also make sure to print the list of features sorted by their feature importance after tree induction. Feature importance is simply *the amount of reduction in impurity induced by the feature while building tree*. 

Why do you notice such a huge difference in performance? Also how does categorical feature $c$ (or its hot-encoded splits) rank in terms of feature importance in both cases?


## 5. Analysis: Why did one-hot encoding give such poor performance?
One hot encoding transforms a categorical variable into a number of independent binary variables. This
generates sparsity and weakens any predictive power a categorical variable may have by a large extent. 

Binary variables have very small degree of freedom compared to real valued variables to begin with since they can only be split in one way: all samples with value 0 for that variable in one bucket and the remaining samples in the other bucket. 
If categorical variables were to be handled directly however samples could be split in $\frac{2^q - 2}{2}$ ways as we saw above. Real values variables on the other hand, owing to the total ordering of values, can be split anywhere.

To see why transforming a categorical variable into a number of independent binary variables weakens its predictive power lets
consider the following scenario:

Say we have a categorical variable with 100 levels (and we hot-encode it to produce 100 independent boolean variables) and say the samples are distributed uniformly across different levels. This mean *at best* splitting samples on one of the induced binary variables can only reduce impurity by 1%. (Think why that is the case!) This is very insignificant and hence such induced variables struggle to be picked high up in the tree during tree induction. 

The take home message from this excercise is two-fold :
1. hot encoding a categorical variable is bad. Trees can handle categorical variables directly in a natural way and should be 
favored instead.
2. large number of levels for a categorical variable make trees suceptible to overfitting and might require strong regularization in place. 


#### References:
[1] [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)