# DecisionTree.jl

https://github.com/bensadeghi/DecisionTree.jl

# Package Summary

DecisionTree.jl implements decision tree classification and 
regression (regression is automatic if targets are float), this imcludes post pruning, random forests (parallelized bagging) and
adaptive boosting. 

K-fold cross validation is built in for decision trees, forests and
adaptive boosting models via *nFoldCV_forest(...)* etc.

Additionally a ScikitLearn.jl API is included so calls can be made
to *DecisionTreeClassifier* et al from ScikitLearn.jl

The main models have similar usage e.g training with *model=build_tree(...)* and prediction with *apply_tree(model,...)* or
probabilities with *apply_tree_proba(model,...)*.

Perfomance is an issue when compared to python implementations.

# Details

| Test                      | Results                           |            
| :- | :- |
| Packages works            | yes                               |
| Deprecation warnings      | None                              |
| Compatible with JuliaDB   | If tables are converted to arrays |
| Contains Documetation     | No, but many examples             |
| Simplicity                | Good, like sklearn                |

# Functions

build_tree
    
    Arguments: labels::Vector, features::Matrix, weights=[0];
               rng=Base.GLOBAL_RNG

print_tree

    Arguments: leaf::Leaf, depth=-1, indent=0


apply_tree
    
    Arguments: tree::Node, features::Vector
    
apply_tree_proba

    Arguments: tree::LeafOrNode, features::Matrix, labels

prune_tree

    Arguments: tree::LeafOrNode, purity_thresh=1.0
    
build_stump

    Arguments: labels::Vector, features::Matrix, weights=[0];
               rng=Base.GLOBAL_RNG

build_forest

    Arguments: labels::Vector, features::Matrix, 
               nsubfeatures::Integer, ntrees::Integer, 
               partialsampling=0.7, maxdepth=-1; rng=Base.GLOBAL_RNG

apply_forest

    Arguments: forest::Ensemble, features::Vector
    
apply_forest_proba

    Arguments: forest::Ensemble, features::Matrix, labels)

build_adaboost_stumps

    Arguments: labels::Vector, features::Matrix, niterations::Integer; 
               rng=Base.GLOBAL_RNG

apply_adaboost_stumps
    
    Arguments: stumps::Ensemble, coeffs::Vector{Float64}, 
               features::Vector
               
apply_adaboost_stumps_proba

    Arguments: stumps::Ensemble, coeffs::Vector{Float64},
               features::Vector, labels::Vector

confusion_matrix

    Arguments: actual::Vector, predicted::Vector

nfoldCV_tree
    
    Arguments: labels::Vector, features::Matrix, pruning_purity::Real, 
               nfolds::Integer

nfoldCV_forest

    Arguments: labels::Vector, features::Matrix, 
               nsubfeatures::Integer, ntrees::Integer, 
               nfolds::Integer, partialsampling=0.7

nfoldCV_stumps

    Arguments: labels::Vector, features::Matrix, niterations::Integer,
               nfolds::Integer
               
majority_vote

    Arguments: labels::Vector
    
R2

    Arguments: actual, predicted

mean_squared_error

    Arguments: actual, predicted

# Example Code

Taken from the github examples

## 1 Building a decision tree with pruning 

In [1]:
using DecisionTree
include("load_titanic.jl")
X_train,T_train, X_test, T_test = load()

# train full-tree classifier
model = build_tree(T_train, X_train)
# prune tree: merge leaves having >= 70% combined purity (default: 100%)
model = prune_tree(model, 0.7)
# apply learned model
println(apply_tree(model, X_test))
# get the probability of each label
println(apply_tree_proba(model, X_test, [0,1]))

Any[1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0.0 1.0; 0.96 0.04; 1.0 0.0; 0.0 1.0; 1.0 0.0; 0.214286 0.785714; 0.214286 0.785714; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 0.0246914 0.975309; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 0.0 1.0; 1.0 0.0; 0.0246914 0.975309; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 0.0 1.0; 1.0 0.0; 1.0 0.0; 0.0246914 0.975309; 0.96 0.04; 0.0 1.0; 0.0 1.0; 1.0 0.0; 1.0 0.0; 1.0 0.0; 0.0 1.0; 0.3 0.7; 0.0 1.0; 0.8 0.2; 1.0 0.0; 1.0 0.0; 0.0 1.0; 1.0 0.0; 1.0 0.0; 0.0246914 0.975309; 0.0246914 0.975309; 1.0 0.0; 1.0 0.0; 0.96 0.04; 0.96 0.04; 0.0246914 0.975309; 1.0 0.0; 1.0 0.0; 1.0 0.0; 0.75 0.25; 0.0246914 0.975309; 1.0 0.0; 0.0 1.0; 0.965517 0.0344828; 0.0246914 0.975309; 0.0 1.0; 1.0 0.0; 1.0 0.0; 0.0 1.0; 0.0246914 0.975309; 1.0 0.0; 0.833333 0.1

## 2 K-fold cross validation

In [2]:
# run n-fold cross validation for pruned tree,
# using 90% purity threshold pruning, and 3 CV folds
accuracy = nfoldCV_tree(T_train, X_train, 0.9, 3)


Fold 1
Classes:  

2×2 Array{Int64,2}:
 86  29
 33  63

[0, 1]
Matrix:   

2×2 Array{Int64,2}:
 87  40
 20  64

2×2 Array{Int64,2}:
 102  30
  27  52


Accuracy: 0.7061611374407583
Kappa:    0.40547173241228857

Fold 2
Classes:  [0, 1]
Matrix:   
Accuracy: 0.7156398104265402
Kappa:    0.4296269598125787

Fold 3
Classes:  [0, 1]
Matrix:   
Accuracy: 0.7298578199052133
Kappa:    0.42769450392576736

Mean Accuracy: 0.7172195892575038


3-element Array{Float64,1}:
 0.706161
 0.71564 
 0.729858

## 3 Model visualisation

In [3]:
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)

Feature 3, Threshold 1.0
L-> Feature 2, Threshold 3.0
    L-> Feature 7, Threshold 29.0
        L-> Feature 7, Threshold 28.7125
            L-> Feature 4, Threshold 24.0
                L-> 1 : 14/14
                R-> 
            R-> 0 : 1/1
        R-> 1 : 79/81
    R-> Feature 7, Threshold 21.075
        L-> Feature 4, Threshold 37.0
            L-> Feature 6, Threshold 2.0
                L-> 
                R-> 1 : 6/6
            R-> 0 : 4/4
        R-> Feature 1, Threshold 375.0
            L-> Feature 1, Threshold 185.0
                L-> 0 : 5/6
                R-> 1 : 2/2
            R-> 0 : 12/12
R-> Feature 4, Threshold 10.0
    L-> Feature 5, Threshold 3.0
        L-> 1 : 17/17
        R-> 0 : 10/10
    R-> Feature 2, Threshold 2.0
        L-> Feature 1, Threshold 391.0
            L-> Feature 7, Threshold 66.6
                L-> 
                R-> 0 : 13/13
            R-> Feature 4, Threshold 58.0
                L-> 
                R-> 
        R-> Feature 1, T

# Benchmarking

A simple benchmark training a random forest classifier with progressively more trees in each forest, and a decision tree
with progressively more data instances

## Julia Code

In [6]:
# I would strongly suggest not running past n=4
n = 3
Time = zeros(n)
for i = 1:n
    Time[i] = (@timed build_forest(T_train, X_train, 2, 10^i))[2]
end
print(Time)

[0.0549153, 0.384804, 3.94297]

In [8]:
# past 5 takes a while
n=3
Time = zeros(n)
for i in 1:n
    x,t = expand_data(X_train,T_train,10^i)
    X = vcat(X_train,x)
    T = vcat(T_train,t)
    Time[i] = (@timed build_tree(T, X))[2]
end
Time

3-element Array{Float64,1}:
 0.0203748
 0.0234457
 0.0639324

## Results


### Random Forest

| Forest Size        | Julia            | Python |
|:------:|:-------:|:-------:|
| 10     | 0.0466s | 0.024s | 
| 100    | 0.389s  | 0.118s |
| 1000   | 3.7s    | 1.04s  |
| 10000  | 37.8s   | 11s    |
| 100000 | 465s    | 110s    |

### Decision Tree

| Data Instances        | Julia            | Python |
|:------:|:-------:|:-------:|
| 654     | 0.0136s | 000258s | 
| 834    | 0.0238s  | 0.00265s |
| 2634   | 0.0805s    | 0.00502s  |
| 20634  | 0.621s   | 0.0196s    |
| 200634 | 7.73s    | 0.180s    |
| 2000634 | 132s    | 2.92s    |


Clearly this Julia implementation is consistently slower than the
corresponding sklearn package.