# Supervised Learning
## Decision Trees

Author: Bingchen Wang

Last Updated: 19 Oct, 2022

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="./Supervised Learning.ipynb">Supervised Learning</a>
</nav>

---

In [1]:
%%html
<link rel='stylesheet' type='text/css' media='screen' href='../styles/custom.css'>

<section class = "section--outline">
    <div class = "outline--header">Outline </div>
    <div class = "outline--content">
        <b>Concepts:</b>
        <ul>
            <li> <a href = "#DTL">Decision tree learning</a>
            <li> <a href = "#RT">Regression trees</a>
            <li> <a href = "#CT">Classification trees</a>
                <ul>
                    <li> <a href = "#RA">Recursive Algorithm</a>
                </ul>
        </ul>
        <b>Implementation:</b>
        <ul>
            <li> <a href = "./Decision Trees/Sklearn Implementation.ipynb">Sklearn Implementation</a>
        </ul>
    </div>
</section>

<a name = "DTL"></a>
## Decision tree learning

<div class = "alert alert-block alert-info"><b>Key decisions:</b> 
<ol>
    <li> <b>(Splitting criterion)</b> How to choose what feature (and value, for continuous features) to split on at each node? 
        <ul>
            <li> (regression) minimize sum of squares
            <li> (classification) minimize impunity
        </ul>
    <li> <b>(Stopping criterion)</b> When to stop splitting?
        <ul>
            <li> fully grown (cost reaches 0)
            <li> a certain number of terminal nodes is reached
            <li> maximum tree depth is reached
            <li> improvements in the reduction of the cost (information gain) below a certain threshold
            <li> number of examples in a node below a certain threshold
        </ul>        
    <li> <b>(Optional pruning criterion)</b> How to prune the tree?
        <ul>
            <li> cost-complexity criterion
            <li> cross-validation
        </ul>
</ol>
</div>


### Model
Consider a sample of $m$ examples $(x_i, y_i), \; i = 1, 2, \dots, N$.

$$
f(x) = \sum_{i=m}^M c_m I(x \in R_m)
$$
where $c_m$ is the predicted output in region $R_m$.

### Cost function
- sum of squares for regression trees
- misclassification rate, Gini index, cross-entropy for classification trees

### Modelling approach
- Finding the best binary partition is generally computationally infeasible. Hence proceed with a **greedy algorithm**.
- Strategy:
    - First, grow a large tree $T_0$, stopping the splitting process only when some minimum node size is reached.
    - Then, prune the tree using **cost-complexity pruning**.

<a name = "RT"></a>
## Regression trees

### Notation
| Notation | Meaning |
| --- | --- |
| $T$ | a subtree obtained by pruning $T_0$|
|$\vert T \vert$| the number of terminal nodes of $T$|
|$\alpha$| the **tunning parameter**| 

### The cost complexity criterion
$$
\begin{align}
N_m =& \#\{x_i \in R_m \} \\
\hat c_m =& \frac{1}{N_m} \sum_{x_i \in R_m} y_i \\
Q_m(T) =& \frac{1}{N_m} \sum_{x_i \in R_m} (Y_i - \hat c_m)^2 \\
C_\alpha =& \sum_{m=1}^{\vert T \vert} N_m Q_m(T) + \alpha \vert T\vert
\end{align}
$$


<a name = "CT"></a>
## Classification trees

### Notation
| Notation | Meaning |
| --- | --- |
| $m$ | a terminal node|
|$k$| a class|

Proportion of observations that belong to class $k$ in region/node $m$:
$$
\hat p_{mk} =\frac{1}{N_m} \sum_{x_i \in R_m} I(y_i = k)
$$

Output class for region $m$:
$$
k(m) = \arg\max_k \hat p_{mk}
$$
### Node impurity measures for a terminal node/region
#### Mislcassification error
$$
\frac{1}{N_m} \sum_{i\in R_m} I(y_i \neq k(m)) = 1 - \hat p_{mk(m)}
$$

#### Gini index
$$
\sum_{k = 1}^K \hat p_{mk} (1 - \hat p_{mk})
$$

#### Cross-entropy/deviance
$$
- \sum_{k = 1}^K \hat p_{mk} \log\hat p_{mk}
$$


### Information gain
Denote the cross-entropy at a specific node $m$ as $H(m)$. Consider the information gain of a split at $m$:

$$
\text{Information gain} = H(m^{\text{root}}) - [w^{\text{left}}* H(m^{\text{left}}) + w^{\text{right}}* H(m^{\text{right}})]
$$

<a name = "RA"></a>
<section class = "section--algorithm">
    <div class = "algorithm--header"> Recursive Algorithm</div>
    <div class = "algorithm--content">
        <ol>
            <li>Start with all examples at the root node. 
            <li>Calculate the information gains for all possible features, and pick one with the highest information gain. 
            <li>Split dataset according to the selected feature, creating a left subtree and a right subtree. 
            <li>Keep <b>repeating the above splitting process</b> on the subtrees until a stopping criteria is met. (exhaust the left subtree first and then right subtree.)
        </ol>
        <div style = "text-align: center;">
            <img src="./images/DT_Recursive Algorithm.pdf" style="width:80%;" >
        </div>
    </div>
</section>