# Decision Trees and Random Forest fundamentals

Hypotheses: decision trees $f: X \rightarrow  Y$
- Each internal node tests and attribute $x_i$
- One branch for each possible attribute value $x_i = \nu$
- Each leaf assigns a class $y$

## What functions can be represented?

- Decision trees can represent any function of the input attributes
- For Boolean functions, path to leaf gives truth table row
- Could require exponentially many nodes

## Hypothesis space

![title](img/hypospace.png)

## Learning *simplest* decision tree is NP-hard

- Learning the simplist(smallest) decision tree is an NP-complete problem
- Resort to a greedy heuristic:
    - Start from empty decision tree
    - Split on **next best attribute**
    - Recurse

![title](img/dt-example.png)

![title](img/recursive-step.png)
![title](img/second-level.png)
![title](img/full-tree.png)


## Splitting: choosing a good attribute

Would we prefer to split on $X_1$ or $X_2$?



**Idea:** Use counts at leaves to define probability distributions, so we can measure uncertainty

## Measuring uncertainty

- Good split if we are more certain about classification after split
    - Deterministic good (all true or all false)
    - Uniform distribution bad
    - What about distributions in between?

## Entropy

Entropy $H(Y)$ for a random variable $Y$
$$
H(Y) = -\sum_{i=1}^k P(Y=y_i)\log_{2}P(Y=y_i)
$$

#### More uncertainty, more entropy!

## High, Low Entropy

- "High Entropy"
    - Y is from a uniform like distribution
    - Flat histogram
    - Values sampled from it are less predictable

- "Low Entropy"
    - Y is from a varied (peak and valleys) distribution
    - Histogram has many lows and highs
    - Values sampled from it are more predictable

#### Example:
$$
P(Y=T) = \frac{5}{6}
$$
$$
P(Y=F) = \frac{1}{6}
$$
$$
H(Y) = - \frac{5}{6} \log_{2}\frac{5}{6} - \frac{1}{6}\log_{2}\frac{1}{6} = .65
$$

## Conditional Entropy

Conditional Entropy $H(Y|X)$ of a random variable $Y$ conditioned on a random variable $X$
$$
H(Y|X) = -\sum_{j=1}^v P(X=x_j)\sum_{i=1}^k P(Y=y_i|X=x_i)\log_{2}P(Y=y_i|X=x_j)
$$

#### Example:
$$P(Y=T|X_1 = T) = 4
$$
$$
P(Y=T|X_1 = F) = 1
$$
$$
P(Y=F|X_1 = T) = 0
$$
$$
P(Y=F|X_1 = F) = 1
$$
$$
P(X_1=T) = \frac{4}{6}
$$
$$
P(X_1=F) = \frac{2}{6}
$$
$$
H(Y) = - \frac{4}{6} \Bigg(1\log_{2}1 + 0\log_{2}0 \Bigg) - 
\frac{2}{6} \Bigg(\frac{1}{2}\log_{2}\frac{1}{2} +\frac{1}{2}\log_{2}\frac{1}{2} \Bigg) = \frac{2}{6} 
$$

## Information gain
- Decrease in entropy (uncertainty) after splitting
$$
IG(X) = H(Y) - H(Y|X)
$$

In our running example:
$$
IG(X_1) = H(Y) - H(Y|X) = .65 - .33
$$
$$
IG(X_1) > 0 \rightarrow \text{ we prefer the split! }
$$

## Learning decision trees
- Start from empty decision tree
- Split on **next best attribute (feature)**
    - Use, for example, information gain to select attribute:
$$
    arg\max_i IG(X_i) = arg\max_i (H(Y) - H(Y|X_i))
$$
- Recurse

## When to stop?

#### Base Case 1
- Don't split a node if all matching records have the same output values 
    - e.g. when cylinders = 6 there are 8 bad  and 0 good

#### Base Case 2
- Don't split a node if data points are identical on remaining attributes
    - e.g. when acceleration = medium there is 1 good and 1 bad

#### Proposed Base Case 3
- If all attributes have small information gain then **don't recurse**
    - **This is not a good idea**

## The problem with proposed case 3
y = a XOR b

|a|b|y|
|-|-|-|
|0|0|0|
|0|1|1|
|1|0|1|
|1|1|0|

![title](img/info-gain.png)

#### If we omit proposed case 3:

![title](img/result-tree.png)

Instead, perform **pruning** after building a tree


## Decision trees will overfit

- Standard decision trees have no learning bias
    - Training set error is always zero!
        - (If there is not label noise)
    - Lots of variance
    - Must introduce some bias towards simpler trees

- Many strategies for picking simpler trees
    - Fixed depth
    - Minimum number of samples per leaf

- Random forests

## Real-Valued inputs
What should we do if some of the inputs are real-valued?

#### Threshold splits
- Binary tree: split on attribute X at value $t$
    - One branch: $X < t$
    - Other branch: $X \geq t$

- Requires small change
    - Allow repeated splits on same variable along a path

![title](img/threshold.png)

#### The set of possible thresholds
- Only a finite number of t's are important:
    - Sort data according to X into $\{ x_1, x_2,..., x_m \}$
    - Consider split points of the form $x_i + \frac{1}{2}(x_{i+1} - x_i)$
    - Moreover, only splits between examples of different classes matter!

## Picking the best threshold

- Suppose X is real valued with threshold $t$
- Want $IG(Y|X:t)$, the information gain for Y when testing if X is greater than or less than $t$
- Define:
    - $H(Y|X:t) = p(X < t)H(Y|X<t) + p(X \geq t)H(Y|X \geq t)$
    - $IG(Y|X:t) = H(Y)-H(Y|X:t)$
    - $IG^*(Y|X) = \max_t IG(Y|X:t)$   
- Use: $IG^*(Y|X)$ for continuous variables

## What to know about decision trees

- Decision trees are one of the most popular ML tools
    - Easy to understand, implement, and use
    - Computationally cheap (to solve heuristically)

- Information gain to select attributes
- Presented for classification, can be used for regression(https://www.saedsayad.com/decision_tree_reg.htm) and density estimation too
- Decision trees will overfit!!!
    - Must use tricks to find "simple trees" e.g.
        - Fixed depth/Early stopping
        - Pruning
    - Or, use ensembles of different trees (random forests)

# Ensemble Learning

## Reduce Variance Without Increasing Bias

- Averaging reduces variance:
$$
Var(\bar{X}) = \frac{Var(X)}{N} \text{ (when predictions are indepenent) }
$$

Average models to reduce model variance
One problem:
    only one training set
    where do multiple models come from?

## Bagging: Bootstrap Aggregation

- Take repeated **bootstrap samples** from training set D
- *Bootstrap sampling*: Given set D containing N training examples, create D' by drawing N examples
at random **with replacement** from D.

- Bagging:
    - Create k boostrap samples $D_1, ..., D_k$
    - Train distinct classifiers on each $D_r$
    - Classify new instance by majority vote/average


## Random Forests

- Ensemble method specifically designed for decision tree classifiers
- Introduce two sources of randomness: "Bagging" and "Random input vectors"
    - Bagging method: each tree is grown using a bootstrap sample of training data
    - Random vector method: **At each node**, best split is chosen from a random sample of
    m attributes instead of all attributes

## Random Forests Algorithm

1. For $b=1, ...,B:$
    - Draw a bootstrap sample **$Z^*$** of size N from the training data.
    - Grow a random-forst tree $T_b$ to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minumum node size $n_{min}$ is reached.
        - Select m variables at random from the p variables
        - Pick the best variable/split-point among the m
        - split the node into two daughter nodes
2. Output the ensemble of trees $\{ T_b \}_1^B$

To make a prediction at a new point x:

#### Regression:
$$
\hat{f}_{rf}^B (x) = \frac{1}{B}\sum_{b=1}^B T_b(x)
$$
#### Classification:

Let $\hat{C}_b (x)$ be the class prediction of the b-th random-forest tree. Then 
$$
\hat{C}_{rf}^B (x) = \text{majority vote } {\hat{C}_b (x)}_1^B
$$

![title](img/rf-flow.png)