## Decision Tree Models

A decision tree is a supervised learning method, suitable for 
classification and regression, which constructs an
IF $\rightarrow$ THEN ruleset (disjunction of conjuctions) inferred 
from the features of input data. 

### Advantages and Disadvantages

| Advantages | Disadvantages |
|:-||:-|
| Simple to interpret | Often overfit |
| Low complexity, $O(log(n))$| Sensitive to data bias |
| Handle numerical and categorical data | Unstable on small data variations|

Notes

- Decision trees algorithms are greedy, because searching all possible trees is insurmountable.
- Many methods exits to reduce the disadvantages such as pruning and ensemble methods like random forest.

## CART algorithm

The CART (classification and regression trees) algorithm produces 
a classification or regression tree if the targets are categorical 
or numeric. All resulting trees are binary.

In general the problem consists of $n$ training and target variables
$\boldsymbol{x}_{i}\in\mathbb{R}^{d}$, $i=1,\ldots,n$, $\boldsymbol{y}\in\mathbb{R}^{l}$ respectively. Where the entries of $\boldsymbol{y}$
may be continuous values (for regression) or instances of c classes in
$\{0,1,\ldots,c-1\}$. 

The tree splits the data at node $m$, $D_{m}$, by an impurity measure 
$H$. That is for each 
canditate split $\theta = (j,t_{m})$ the sets
$$
\begin{align*}
    D_{l}(\theta) &= \{(x,y)\in D:x_{j} \leq t_{m}\}\\
    D_{r}(\theta) &= D \backslash D_{l}(\theta)
\end{align*}
$$
are formed and the gain, $G$, is calculated, where 
$$
G(D,\theta) = \frac{|D_{l}|}{|D_{m}|}H(D_{l}(\theta))+\frac{|D_{r}|}{|D_{m}|}H(D_{r}(\theta))
$$
finally the best split, $\Theta$, is found by
$$
\begin{align*}
        \Theta &= \text{ argmax}_{\theta}G(D,\theta)
\end{align*}
$$
### Classification 

In classification $\boldsymbol{y}$ has entries in $\{0,1,\ldots,c-1\}$ 
we define the proportion of observations at node $m$ with class $k$ by
$$
\begin{align*}
    p_{mk} = \frac{1}{|D_{m}|}\sum_{\boldsymbol{x}_{i}\in 
    D_{m}}\delta_{k,\boldsymbol{y}_{i}}
\end{align*}
$$
Then common impurity measures for classification include:

Entropy $H(D_{m}) = -\sum_{k}p_{mk}\log_{b}(p_{mk})$.

Gini index: $H(D_{m}) = \sum_{k}p_mk(1-p_{mk})$.

### Regression 

In regression $\boldsymbol{y}$ has continuous entries and a measure
such as the mean squared error is used for impurity, that is
$$
\begin{align*}
    H(D_{m}) = \frac{1}{|D_{m}|}\sum_{\boldsymbol{y}_{i}\in D_{m}}(\boldsymbol{y_{i}}-<\boldsymbol{y_{i}}\in D_{m}>)^{2}
\end{align*}
$$
where $<x> = \frac{1}{|x|}\sum_{i=1}^{|x|}x_{i}$

### Julia Implementations

#### Libraries

- DecisionTree.jl https://github.com/bensadeghi/DecisionTree.jl
- ScikitLearn.jl https://github.com/cstjean/ScikitLearn.jl

### References

[1] The Elements of Statistical Learning (Ch 9.2)

## ID3 and C4.5 Algorithms

### ID3

The Iterative Cishotomiser 3 (ID3) algorithm builds a classification 
tree using the information gain (from entropy) as a splitting 
measure. Tree are not necessarily binary.

The process is similar to CART but at each node, $m$, of the tree we
calculate the information gain, $IG$, for each attribute not
already used in the tree. That is
$$
\begin{align*}
    IG(D_{m},A) &= H(D_{m}) - \sum_{i}\frac{|D_{m,u_{i}}|}{|D_{m}|}H(D_{m,u_{i}}) \\
    H(X) &= -\sum_{i=1}^{C}p_{i}\log_{2}p_{i}
\end{align*}
$$
Where $A$ is an attribute with values $u$, $D_{m}$ the data at 
node $m$, $D_{m,u_{i}}$ the subset of $D_{m}$ where $A=u_{i}$ and
$p_{i}$ the proportion of elements of class $i$ in $X$. The data
is then split along the attribute with the highest information gain
(largest entropy reduction) forming $|u|$ branches at the node $m$.

### C4.5

C4.5 is a successor to ID3 the major improvement is to use the
normalised information gain. At each node, $m$, the information gain
is calculated as above but is normalised by the intrinsic information
of the split, Isplit. Thats is
$$
\begin{align*}
Isplit(D_{m},A) &= -\sum_{i}\frac{|D_{m,u_{i}}|}{|D_{m}|}\log_{2}\frac{|D_{m,u_{i}}|}{|D_{m}|}\\
IG_{norm}(D_{m},A) &= \frac{IG(D_{m},A)}{Isplit(D_{m},A)}
\end{align*}
$$

Then the attribute with the highest normalised information gain is 
used for splitting. This has the impact of reducing the ID3 
algorithm's bias to attributes with many values (highest branching 
factors in splits).

Additionally C4.5 implements thresholding of continuous attributes. 
For each continuous attribute this creates a large list of 
pseudo-attributes representing all possible splits of the
continous attribute.


### References

[1] C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993

[2] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106

## Random Forest

A random forest is a modified bagging method which constructs a large 
set of *un-correlated* decision tree models and then averages over
them. Random forests may also be used to rank the 
importance of features and inform feature selection.

Assuming a forest size of $N$ and input data $D$, for each 
$b\in\{1,2,\ldots,N\}$. 

1. A boostrap sample $B\subset D$ of size $n$ is drawn from the training data
1. Grow a tree $T_{b}$ on $B$ by repeating the following until a minimum node size $n_{min}$ is reached 
    1. Select $m$ variables at random from $p$ variables
    1. Pick the best split among the $m$ variables.
    1. Split the node into two child nodes.

When this is done for all $b$ output the forest $\{T_{b}\}_{b=1}^{N}$. 


For prediction of new data, $x$, in regression and classification we
take

\begin{align*}
    \hat{f}(x) &= \sum_{b=1}^{N}T_{b}(x) \\
    \hat{\text{C}}(x) &= \text{ argmax}_{c}\sum_{b=1}^{N}\delta_{c,T_{b}(x)}
\end{align*}

Where $\hat{f}$ and $\hat{C}$ are the estimated function and classification in respectively.

### Out of Bag Sampling

Out-of-bag (OOB) sampling refers to the way a random forest test 
predictions on the training set. For each observation 
$d_{i} = (x_{i},y_{i})$ the prediction of $z_{i}$ is obtained
by averaging over all trees trained over boostrap samples where
$z_{i}$ was absent. This is almost identical to the result of a 
$K$-fold cross validation, but completed in-line.

### Feature Importance

Decision trees naturally rank features during construction. That is
at each split the increase in the splitting criterion (gain) is 
maximised, by recording this at each split variable importances can
be generated.

To get an accurate prediction of feature importances this concept is
applied over random forests. Using OOB sampling a measure is 
constructed by, for each tree in the forest, recording the
accuracy of prediction using the OOB samples, then the values
of the $j$th feature are randomly permuted in the OOB samples and
the resultant decrease in performance is recorded. Finally the
decrease in performance after permuting the values of $j$ are averaged
over all trees in the forest, this is the importance of $j$.

#### Example Feature Importance Plot

<img src="resources/titanic_import.png">

### Julia Implementations

#### Libraries

- DecisionTree.jl https://github.com/bensadeghi/DecisionTree.jl
- ScikitLearn.jl https://github.com/cstjean/ScikitLearn.jl

### References 

[1] The Elements of Statistical Learning (Ch 15.1)