## Decision Tree Sample

This is the sample data for the decision tree
https://github.com/vincentarelbundock/Rdatasets
https://vincentarelbundock.github.io/Rdatasets/datasets.html
https://github.com/bensadeghi/DecisionTree.jl

An Example:
X_1 = (Alternate, Bar, isFriday, ishungry, isPatron, Price, isRain, IsRES??, Type, EstWait) -> (Willwait/WontWait)

### Classification Example

#### Load RDatasets and DecisionTree packages

In [115]:
Pkg.add("DecisionTree")
Pkg.add("RDatasets")

[1m[36mINFO: [39m[22m[36mPackage DecisionTree is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of DecisionTree
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage RDatasets is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of RDatasets
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m

In [116]:
using DecisionTree
using RDatasets

### Separate Fisher's Iris dataset features and labels

In [117]:
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4]);
labels = convert(Array, iris[:, 5]);

### Pruned Tree Classifier

In [118]:
# train full-tree classifier
model = build_tree(labels, features)
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for pruned tree,
# using 90% purity threshold pruning, and 3 CV folds
accuracy = nfoldCV_tree(labels, features, 0.9, 3)

3×3 Array{Int64,2}:
 16   0   0
  0  14   3
  0   2  15

3×3 Array{Int64,2}:
 18   0   0
  1  18   1
  0   1  11

3×3 Array{Int64,2}:
 16   0   0
  0  11   2
  0   1  20

Feature 3, Threshold 3.0
L-> setosa : 50/50
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 47/48
        R-> Feature 4, Threshold 1.6
            L-> virginica : 3/3
            R-> Feature 1, Threshold 7.2
                L-> versicolor : 2/2
                R-> virginica : 1/1
    R-> Feature 3, Threshold 4.9
        L-> Feature 1, Threshold 6.0
            L-> versicolor : 1/1
            R-> virginica : 2/2
        R-> virginica : 43/43

Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9
Kappa:    0.8499399759903962

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9082007343941247

Fold 3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9077490774907748

Mean Accuracy: 0.9266666666666666


3-element Array{Float64,1}:
 0.9 
 0.94
 0.94

### Random Forest Classifier

In [119]:
# train random forest classifier
# using 2 random features, 10 trees, 0.5 portion of samples per tree (optional), and a maximum tree depth of 6 (optional)
model = build_forest(labels, features, 2, 10, 0.5, 6)
# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for forests
# using 2 random features, 10 trees, 3 folds and 0.5 of samples per tree (optional)
accuracy = nfoldCV_forest(labels, features, 2, 10, 3, 0.5)

3×3 Array{Int64,2}:
 15   0   0
  0  17   2
  0   0  16

3×3 Array{Int64,2}:
 15   0   0
  0  19   1
  0   4  11

3×3 Array{Int64,2}:
 20  0   0
  0  9   2
  0  0  19


Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9399038461538461

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9
Kappa:    0.8470948012232417

Fold 3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9375780274656679

Mean Accuracy: 0.94


3-element Array{Float64,1}:
 0.96
 0.9 
 0.96

### Adaptive-Boosted Decision Stumps Classifier

In [120]:
# train adaptive-boosted stumps, using 7 iterations
model, coeffs = build_adaboost_stumps(labels, features, 7);
# apply learned model
apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for boosted stumps, using 7 iterations and 3 folds
accuracy = nfoldCV_stumps(labels, features, 7, 3)

3×3 Array{Int64,2}:
 11  0   0
  2  4  11
  0  0  22

3×3 Array{Int64,2}:
 16   4   0
  0  13   2
  0   2  13


Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.74
Kappa:    0.5841330774152271

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.84
Kappa:    0.7604790419161678

Fold 

3×3 Array{Int64,2}:
 18   1   0
  0  17   1
  0   1  12

3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9089253187613843

Mean Accuracy: 0.84


3-element Array{Float64,1}:
 0.74
 0.84
 0.94

### Regression Example

In [121]:
n, m = 10^3, 5 ;
features = randn(n, m);
weights = rand(-2:2, m);
labels = features * weights;

In [122]:
# train regression tree, using an averaging of 5 samples per leaf (optional)
model = build_tree(labels, features, 5)
# apply learned model
apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])
# run n-fold cross validation, using 3 folds, averaging of 5 samples per leaf (optional)
# returns array of coefficients of determination (R^2)
r2 = nfoldCV_tree(labels, features, 3, 5)


Fold 1
Mean Squared Error:     1.243537824432842
Correlation Coeff:      0.9101673756983628
Coeff of Determination: 0.8261039847816669

Fold 2
Mean Squared Error:     1.3735365826404549
Correlation Coeff:      0.8934653787536725
Coeff of Determination: 0.7949879961253998

Fold 3
Mean Squared Error:     1.201211290711461
Correlation Coeff:      0.9155921072546311
Coeff of Determination: 0.8378920001145272

Mean Coeff of Determination: 0.8196613270071981


3-element Array{Float64,1}:
 0.826104
 0.794988
 0.837892

### Regression Random Forest

In [124]:
# train regression forest, using 2 random features, 10 trees,
# averaging of 5 samples per leaf (optional), 0.7 of samples per tree (optional)
model = build_forest(labels,features, 2, 10, 5, 0.7)
# apply learned model
apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])
# run n-fold cross validation on regression forest
# using 2 random features, 10 trees, 3 folds, averaging of 5 samples/leaf (optional),
# and 0.7 porition of samples per tree (optional)
# returns array of coefficients of determination (R^2)
r2 = nfoldCV_forest(labels, features, 2, 10, 3, 5, 0.7)


Fold 1
Mean Squared Error:     0.7672772800670623
Correlation Coeff:      0.9537177427622131
Coeff of Determination: 0.8910805006065017

Fold 2
Mean Squared Error:     0.6486222262906012
Correlation Coeff:      0.9628494805545419
Coeff of Determination: 0.9107274573746335

Fold 3
Mean Squared Error:     0.7079725702223097
Correlation Coeff:      0.9565653120789999
Coeff of Determination: 0.8982096604688339

Mean Coeff of Determination: 0.9000058728166563


3-element Array{Float64,1}:
 0.891081
 0.910727
 0.89821 

In [125]:
show(model)

Ensemble of Decision Trees
Trees:      10
Avg Leaves: 234.5
Avg Depth:  15.9