## Decision Tree Models

A decision tree is a supervised learning method, suitable for 
classification and regression, which constructs an
IF $\rightarrow$ THEN ruleset (disjunction of conjuctions) inferred 
from the features of input data. 

### Advantages and Disadvantages

| Advantages | Disadvantages |
|:-||:-|
| Simple to interpret | Often overfit |
| Low complexity, $O(log(n))$| Sensitive to data bias |
| Handle numerical and categorical data | Unstable on small data variations|

Notes

- Decision trees algorithms are greedy, because searching all possible trees is insurmountable.
- Many methods exits to reduce the disadvantages such as pruning and ensemble methods like random forest.

## CART algorithm

The CART (classification and regression trees) algorithm produces 
a classification or regression tree if the targets are categorical 
or numeric. All resulting trees are binary.

In general the problem consists of $n$ training and target variables
$\boldsymbol{x}_{i}\in\mathbb{R}^{d}$, $i=1,\ldots,n$, $\boldsymbol{y}\in\mathbb{R}^{l}$ respectively. Where the entries of $\boldsymbol{y}$
may be continuous values (for regression) or instances of c classes in
$\{0,1,\ldots,c-1\}$. 

The tree splits the data at node $m$, $D_{m}$, by an impurity measure 
$H$. That is for each 
canditate split $\theta = (j,t_{m})$ the sets
\begin{align*}
    D_{l}(\theta) &= \{(x,y)\in D:x_{j} \leq t_{m}\}\\
    D_{r}(\theta) &= D \backslash D_{l}(\theta)
\end{align*}

are formed and the gain, $G$, is calculated, where 
\begin{align*}
    G(D,\theta) &= \frac{|D_{l}|}{|D_{m}|}H(D_{l}(\theta))+\frac{|D_{r}|}{|D_{m}|}H(D_{r}(\theta))
\end{align*}

finally the best split, $\Theta$, is found by

\begin{align*}
        \Theta &= \text{ argmax}_{\theta}G(D,\theta)
\end{align*}

### Classification 

In classification $\boldsymbol{y}$ has entries in $\{0,1,\ldots,c-1\}$ 
we define the proportion of observations at node $m$ with class $k$ by
\begin{align*}
    p_{mk} = \frac{1}{|D_{m}|}\sum_{\boldsymbol{x}_{i}\in D_{m}}\delta_{k,\boldsymbol{y}_{i}}
\end{align*}

Then common impurity measures for classification include:

Entropy $H(D_{m}) = -\sum_{k}p_{mk}\log_{b}(p_{mk})$.

Gini index: $H(D_{m}) = \sum_{k}p_mk(1-p_{mk})$.

### Regression 

In regression $\boldsymbol{y}$ has continuous entries and a measure
such as the mean squared error is used for impurity, that is

\begin{align*}
    H(D_{m}) = \frac{1}{|D_{m}|}\sum_{\boldsymbol{y}_{i}\in D_{m}}(\boldsymbol{y_{i}}-<\boldsymbol{y_{i}}\in D_{m}>)^{2}
\end{align*}

where $<x> = \frac{1}{|x|}\sum_{i=1}^{|x|}x_{i}$

### Julia Implementations

#### Libraries

- DecisionTree.jl https://github.com/bensadeghi/DecisionTree.jl
- ScikitLearn.jl https://github.com/cstjean/ScikitLearn.jl

### References

[1] The Elements of Statistical Learning (Ch 9.2)

## Random Forest

A random forest is a modified bagging method which constructs a large 
set of *un-correlated* decision tree models and then averages over
them.

Assuming a forest size of $N$ and input data $D$, for each 
$b\in\{1,2,\ldots,N\}$. 

1. A boostrap sample $B\subset D$ of size $n$ is drawn from the training data
1. Grow a tree $T_{b}$ on $B$ by repeating the following until a minimum node size $n_{min}$ is reached 
    1. Select $m$ variables at random from $p$ variables
    1. Pick the best split among the $m$ variables.
    1. Split the node into two child nodes.

When this is done for all $b$ output the forest $\{T_{b}\}_{b=1}^{N}$. 


For prediction of new data in regression and classification $x$ we take

\begin{align*}
    \hat{f}(x) &= \sum_{b=1}^{N}T_{b}(x) \\
    \hat{\text{C}}(x) &= \text{ argmax}_{c}\sum_{b=1}^{N}\delta_{c,T_{b}(x)}
\end{align*}

Where $\hat{f}$ and $\hat{C}$ are the estimated function and classification in respectively.

### Julia Implementations

#### Libraries

- DecisionTree.jl https://github.com/bensadeghi/DecisionTree.jl
- ScikitLearn.jl https://github.com/cstjean/ScikitLearn.jl

### References 

[1] The Elements of Statistical Learning (Ch 15.1)