## Decision Tree Sample

This is the sample data for the decision tree

https://github.com/vincentarelbundock/Rdatasets

https://vincentarelbundock.github.io/Rdatasets/datasets.html

https://github.com/bensadeghi/DecisionTree.jl

### Classification Example

#### Load RDatasets and DecisionTree packages

In [14]:
Pkg.add("DecisionTree")
Pkg.add("RDatasets")

[1m[36mINFO: [39m[22m[36mPackage DecisionTree is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of DecisionTree
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m[1m[36mINFO: [39m[22m[36mPackage RDatasets is already installed
[39m[1m[36mINFO: [39m[22m[36mMETADATA is out-of-date — you may not have the latest version of RDatasets
[39m[1m[36mINFO: [39m[22m[36mUse `Pkg.update()` to get the latest versions of your packages
[39m

In [15]:
using DecisionTree
using RDatasets

### Separate Fisher's Iris dataset features and labels

In [25]:
iris = dataset("datasets", "iris")
features = convert(Array, iris[:, 1:4]);
labels = convert(Array, iris[:, 5]);

### Pruned Tree Classifier

In [17]:
# train full-tree classifier
model = build_tree(labels, features)
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])
# run n-fold cross validation for pruned tree,
# using 90% purity threshold pruning, and 3 CV folds
accuracy = nfoldCV_tree(labels, features, 0.9, 3)

Feature 3, Threshold 3.0
L-> setosa : 50/50
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 47/48
        R-> Feature 4, Threshold 1.6
            L-> virginica : 3/3
            R-> Feature 1, Threshold 7.2
                L-> versicolor : 2/2
                R-> virginica : 1/1
    R-> Feature 3, Threshold 4.9
        L-> Feature 1, Threshold 6.0
            L-> versicolor : 1/1
            R-> virginica : 2/2
        R-> virginica : 43/43

Fold 1

3×3 Array{Int64,2}:
 21   0   0
  0  11   1
  0   0  17


Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   

3×3 Array{Int64,2}:
 10   0   0
  1  20   1
  0   2  16

3×3 Array{Int64,2}:
 19   0   0
  0  14   2
  0   3  12


Accuracy: 0.98
Kappa:    0.9691548426896976

Fold 2
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.92
Kappa:    0.8750000000000001

Fold 3
Classes:  String["setosa", "versicolor", "virginica"]
Matrix:   
Accuracy: 0.9
Kappa:    0.8491249245624621

Mean Accuracy: 0.9333333333333332


3-element Array{Float64,1}:
 0.98
 0.92
 0.9 

### ScikitLearn.jl

DecisionTree.jl supports the ScikitLearn.jl interface and algorithms (cross-validation, hyperparameter tuning, pipelines, ...)

The classifier example above can be rewritten as:

In [18]:
model = DecisionTreeClassifier(pruning_purity_threshold=0.9, maxdepth=6)

DecisionTree.DecisionTreeClassifier(Nullable{Float64}(0.9), 0, 6, MersenneTwister(UInt32[0x89c9dc53, 0x586d1466, 0xc9ab2764, 0xe0df138a], Base.dSFMT.DSFMT_state(Int32[859963715, 1072717950, -979364890, 1073304473, 1493797162, 1073511191, 602336344, 1073540208, 1122483103, 1073267778  …  -1042556811, 1073130256, -402223520, 1073265699, 2076029334, -1623647741, 1609144049, 1374115344, 382, 0]), [1.02356, 1.58291, 1.78005, 1.80772, 1.54791, 1.23128, 1.18049, 1.72941, 1.27951, 1.72066  …  1.87663, 1.77232, 1.85521, 1.78891, 1.56134, 1.91732, 1.29019, 1.46083, 1.41676, 1.54593], 52), #undef, #undef)

In [19]:
using ScikitLearn: fit!, predict
fit!(model, features, labels)

DecisionTree.DecisionTreeClassifier(Nullable{Float64}(0.9), 0, 6, MersenneTwister(UInt32[0x89c9dc53, 0x586d1466, 0xc9ab2764, 0xe0df138a], Base.dSFMT.DSFMT_state(Int32[859963715, 1072717950, -979364890, 1073304473, 1493797162, 1073511191, 602336344, 1073540208, 1122483103, 1073267778  …  -1042556811, 1073130256, -402223520, 1073265699, 2076029334, -1623647741, 1609144049, 1374115344, 382, 0]), [1.02356, 1.58291, 1.78005, 1.80772, 1.54791, 1.23128, 1.18049, 1.72941, 1.27951, 1.72066  …  1.87663, 1.77232, 1.85521, 1.78891, 1.56134, 1.91732, 1.29019, 1.46083, 1.41676, 1.54593], 52), Decision Tree
Leaves: 8
Depth:  5, String["setosa", "versicolor", "virginica"])

In [30]:
# pretty print of the tree, to a depth of 5 nodes (optional)
show(model.root)

Decision Tree
Leaves: 8
Depth:  5

In [31]:
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model.root, 5)

Feature 3, Threshold 3.0
L-> setosa : 50/50
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 47/48
        R-> Feature 4, Threshold 1.6
            L-> virginica : 3/3
            R-> Feature 1, Threshold 7.2
                L-> versicolor : 2/2
                R-> virginica : 1/1
    R-> Feature 3, Threshold 4.9
        L-> Feature 1, Threshold 6.0
            L-> versicolor : 1/1
            R-> virginica : 2/2
        R-> virginica : 43/43


In [21]:
# apply learned model
predict(model, [5.9,3.0,5.1,1.9])

"virginica"

In [22]:
# get the probability of each label
predict_proba(model, [5.9,3.0,5.1,1.9])
println(get_classes(model)) # returns the ordering of the columns in predict_proba's output

String["setosa", "versicolor", "virginica"]


In [23]:
# run n-fold cross validation over 3 CV folds
# See ScikitLearn.jl for installation instructions
using ScikitLearn.CrossValidation: cross_val_score
accuracy = cross_val_score(model, features, labels, cv=3)

3-element Array{Float64,1}:
 0.980392
 0.901961
 0.958333