# Decision Trees

In [1]:
import matplotlib.pyplot as plt
import numpy as np

## How is a Decision Tree fit?

* Which variables to include on the tree?
* How to choose the threshold?
* When to stop the tree?

**Key idea is that we want to choose the feature that has the lowest "impurity" to split our tree, thus our tree can reach the decision as fast as possible with smallest height possible.**

One way to measure impurity is using **Gini index** (another one is entropy but which measure very similar thing) with the following formula:

$$ I_{G} = 1 - \sum_{i=1}^{c} p_{i}^{2} $$

where $c$ is number of classes, and $p_{i}$ is the probability of each class.  For example, let's say our X is <code>[[2],[3],[10],[19]]</code> and y is <code>[0, 0, 1, 1]</code>.  That is, if a node has 4 samples, and 2 samples are of class cancer, and 2 samples are of no cancer, then the probability of each class is

$$p_{cancer}=(2/4)^2 = 0.25$$

$$p_{no_cancer}=(2/4)^2 = 0.25$$   

Thus the gini index of this node is

$$ I_{G} = 1 - (0.25 + 0.25) = 0.5 $$

Then we need to decide how to best split this node so we can get the lowest gini (highest purity) children.

For example, if we split this sample with $x < 3$: we will get left node X as <code>[[2]]</code> and y as <code>[0]</code> and the right node X as <code>[[3],[10],[19]]</code> and y as <code>[0, 1, 1]</code>.  The weighted gini of the children are 

$$ 1/4*I_{leftG} + 3/4 * I_{rightG} =  $$
$$ 1/4 * (1 - (1/1)^2) + 3/4 * (1 - (1/3)^2 - (2/3)^2) = 0.33 $$

Hmm...but we know we can split better, right?  Let's try $x < 4$: we will get left node X as <code>[[2],[3]]</code> and y as <code>[0, 0]</code> and the right node X as <code>[[10],[19]]</code> and y as <code>[1, 1]</code>.  If you do the math right, the gini is 0!

$$ 2/4 * (1 - (2/2)^2) + 2/4 * (1 - (2/2)^2 ) = 0 $$

Thus, in conclusion, we can say that spliting $x<4$ is a much better split than $x<3$.  In practice, to really find the best split, it is an exhaustive and greedy algorithm, in which we have to iterate and check every value on each feature as a candidate split, find the gini index.  

#### How do we find all threshold for continuous values?

We can first sort.  Then we are identify critical value using the midpoint between all consecutive values.  For example, given X is <code>[[2],[3],[10],[19]]</code>, the critical value to compare is 2.5, 6.5 and 14.5.

The code can be implemented in several ways.  Example are shown below:

## Let's implement!