## Decision Tree Sample

This is the sample data for the decision tree

https://github.com/vincentarelbundock/Rdatasets

https://vincentarelbundock.github.io/Rdatasets/datasets.html

https://github.com/bensadeghi/DecisionTree.jl

### Classification Example

#### Load RDatasets and DecisionTree packages

In [33]:
Pkg.add("DecisionTree")
Pkg.add("RDatasets")

[1m[36mINFO: [39m[22m[36mPackage DecisionTree is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of DecisionTree
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage RDatasets is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of RDatasets
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m

In [34]:
using DecisionTree
using RDatasets

### Separate Fisher's Iris dataset features and labels

In [35]:
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4]);
labels = convert(Array, iris[:, 5]);

### Pruned Tree Classifier

In [36]:
# train full-tree classifier
model = build_tree(labels, features)
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 9)
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for pruned tree,
# using 90% purity threshold pruning, and 3 CV folds
accuracy = nfoldCV_tree(labels, features, 0.9, 3)

3×3 Array{Int64,2}:
 14   0   0
  0  19   0
  0   2  15

3×3 Array{Int64,2}:
 18   0   0
  1  15   1
  0   1  14

3×3 Array{Int64,2}:
 18   0   0
  0  12   2
  0   0  18

Feature 3, Threshold 3.0
L-> setosa : 50/50
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 47/48
        R-> Feature 4, Threshold 1.6
            L-> virginica : 3/3
            R-> Feature 1, Threshold 7.2
                L-> versicolor : 2/2
                R-> virginica : 1/1
    R-> Feature 3, Threshold 4.9
        L-> Feature 1, Threshold 6.0
            L-> versicolor : 1/1
            R-> virginica : 2/2
        R-> virginica : 43/43

Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9393939393939393

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9096929560505719

Fold 3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9393203883495145

Mean Accuracy: 0.9533333333333333


3-element Array{Float64,1}:
 0.96
 0.94
 0.96

### Random Forest Classifier

In [37]:
# train random forest classifier
# using 2 random features, 10 trees, 0.5 portion of samples per tree (optional), and a maximum tree depth of 6 (optional)
model = build_forest(labels, features, 2, 10, 0.5, 9)
# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
xy = apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for forests
# using 2 random features, 10 trees, 3 folds and 0.5 of samples per tree (optional)
accuracy = nfoldCV_forest(labels, features, 2, 10, 3, 0.5)

3×3 Array{Int64,2}:
 13   0   0
  0  17   0
  0   4  16

3×3 Array{Int64,2}:
 20   0   0
  0  13   2
  0   0  15

3×3 Array{Int64,2}:
 17   0   0
  1  16   1
  0   2  13


Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.92
Kappa:    0.8790810157194682

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9393939393939393

Fold 3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.92
Kappa:    0.8795180722891568

Mean Accuracy: 0.9333333333333332


3-element Array{Float64,1}:
 0.92
 0.96
 0.92

In [44]:
show(xy)

[0.0, 0.0, 1.0]

### Adaptive-Boosted Decision Stumps Classifier

In [39]:
# train adaptive-boosted stumps, using 7 iterations
model, coeffs = build_adaboost_stumps(labels, features, 7);
# apply learned model
apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for boosted stumps, using 7 iterations and 3 folds
accuracy = nfoldCV_stumps(labels, features, 7, 3)

3×3 Array{Int64,2}:
 14   5   0
  0  12   5
  0   3  11


Fold 1
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.74
Kappa:    0.6107784431137725

Fold 

3×3 Array{Int64,2}:
 16   0   0
  0  14   2
  0   3  15

2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9
Kappa:    0.8499399759903962

Fold 3


3×3 Array{Int64,2}:
 15  0   0
  4  0  13
  0  0  18

Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.66
Kappa:    0.48702474351237174

Mean Accuracy: 0.7666666666666667


3-element Array{Float64,1}:
 0.74
 0.9 
 0.66

### Regression Example

In [40]:
n, m = 10^3, 5 ;
features = randn(n, m);
weights = rand(-2:2, m);
labels = features * weights;

In [41]:
# train regression tree, using an averaging of 5 samples per leaf (optional)
model = build_tree(labels, features, 10)
# apply learned model
apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])
# run n-fold cross validation, using 3 folds, averaging of 5 samples per leaf (optional)
# returns array of coefficients of determination (R^2)
r2 = nfoldCV_tree(labels, features, 3, 5)


Fold 1
Mean Squared Error:     1.0406102600491711
Correlation Coeff:      0.9423620345949416
Coeff of Determination: 0.8876648618629531

Fold 2
Mean Squared Error:     1.0031735474668468
Correlation Coeff:      0.9479971774744004
Coeff of Determination: 0.8978302930596667

Fold 3
Mean Squared Error:     1.2709257571620163
Correlation Coeff:      0.9353166655932865
Coeff of Determination: 0.8735679181381519

Mean Coeff of Determination: 0.8863543576869238


3-element Array{Float64,1}:
 0.887665
 0.89783 
 0.873568

### Regression Random Forest

In [42]:
# train regression forest, using 2 random features, 10 trees,
# averaging of 5 samples per leaf (optional), 0.7 of samples per tree (optional)
model = build_forest(labels,features, 2, 10, 5, 0.7)
# apply learned model
apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])
# run n-fold cross validation on regression forest
# using 2 random features, 10 trees, 3 folds, averaging of 5 samples/leaf (optional),
# and 0.7 porition of samples per tree (optional)
# returns array of coefficients of determination (R^2)
r2 = nfoldCV_forest(labels, features, 2, 10, 3, 5, 0.7)


Fold 1
Mean Squared Error:     0.7492733472548869
Correlation Coeff:      0.9685374685976451
Coeff of Determination: 0.9230495573954945

Fold 2
Mean Squared Error:     0.922607236924649
Correlation Coeff:      0.9624210487415116
Coeff of Determination: 0.9037230033901935

Fold 3
Mean Squared Error:     0.8059821171089434
Correlation Coeff:      0.9694820169090681
Coeff of Determination: 0.9179757924556372

Mean Coeff of Determination: 0.9149161177471085


3-element Array{Float64,1}:
 0.92305 
 0.903723
 0.917976

In [46]:
show(model)

Ensemble of Decision Trees
Trees:      10
Avg Leaves: 231.5
Avg Depth:  16.0

In [49]:
show(r2)

[0.92305, 0.903723, 0.917976]