# Classical Machine Learning Pipeline

This section describes a classical machine learning pipeline.

We leverage [MLUtils](https://github.com/JuliaML/MLUtils.jl), among the other things, for loading the data to play with, that is, the `iris` dataset.

We partition the data into a set of *instances* `X` and the corresponding *labels* `y`. Each instance is one element of the cartesian product between the domains of the *attributes*.

We want to find a relation between the instance space (i.e., many examples of iris flower) and the label space (i.e., the exact family to which each flower belongs). 

What we are going to do is train a (classification) decision tree, leveraging the `DecisionTree` library. 

Later in the notebook, we will repeat the process but leveraging `Sole.jl` library, and more-than-propositional logic.

## Data Loading and Description

In [70]:
using MLUtils

X, y = load_iris()

([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], ["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"], ["Sepal length", "Sepal width", "Petal length", "Petal width"])

In [71]:
size(X)

(4, 150)

In [72]:
size(y)

(150,)

In [73]:
mean(X, dims = 2)

4×1 Matrix{Float64}:
 5.843333333333335
 3.057333333333334
 3.7580000000000027
 1.199333333333334

In [74]:
std(X, dims = 2)

4×1 Matrix{Float64}:
 0.8280661279778629
 0.435866284936698
 1.7652982332594664
 0.7622376689603465

In [75]:
minimum(X, dims = 2)

4×1 Matrix{Float64}:
 4.3
 2.0
 1.0
 0.1

In [76]:
maximum(X, dims = 2)

4×1 Matrix{Float64}:
 7.9
 4.4
 6.9
 2.5

In [77]:
for class in unique(y)
    println("$(class) - $(count(yi -> yi == class, y))")
end

setosa - 50
versicolor - 50
virginica - 50


## Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex preprocessing of our data. For example, we are not dealing with unbalanced classes, missing data and complex encodings. 

In the cell below, we partition the data into a training and a testing bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and leverage the testing one for simulating a real-world scenario, obtaining reliable performances.

In [None]:
using Random
Random.seed!(1605)

TaskLocalRNG()

In [99]:
# take a look at the next cell; do you see why we need to shuffle our instances here?
Xs, ys = shuffleobs((X, y))

([6.9 5.8 … 6.5 5.1; 3.1 2.7 … 3.0 2.5; 4.9 3.9 … 5.8 3.0; 1.5 1.2 … 2.2 1.1], ["versicolor", "versicolor", "versicolor", "virginica", "setosa", "virginica", "setosa", "versicolor", "setosa", "virginica"  …  "virginica", "setosa", "setosa", "versicolor", "virginica", "setosa", "versicolor", "versicolor", "virginica", "versicolor"])

In [100]:
training_data, testing_data = splitobs((Xs, ys); at = 0.8)

(([6.9 5.8 … 6.0 4.9; 3.1 2.7 … 2.7 3.0; 4.9 3.9 … 5.1 1.4; 1.5 1.2 … 1.6 0.2], ["versicolor", "versicolor", "versicolor", "virginica", "setosa", "virginica", "setosa", "versicolor", "setosa", "virginica"  …  "virginica", "versicolor", "setosa", "virginica", "setosa", "setosa", "setosa", "setosa", "versicolor", "setosa"]), ([6.1 7.7 … 6.5 5.1; 2.9 2.8 … 3.0 2.5; 4.7 6.7 … 5.8 3.0; 1.4 2.0 … 2.2 1.1], ["versicolor", "virginica", "virginica", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "virginica", "versicolor"  …  "virginica", "setosa", "setosa", "versicolor", "virginica", "setosa", "versicolor", "versicolor", "virginica", "versicolor"]))

In [101]:
X_train, y_train = training_data
X_test, y_test  = testing_data

([6.1 7.7 … 6.5 5.1; 2.9 2.8 … 3.0 2.5; 4.7 6.7 … 5.8 3.0; 1.4 2.0 … 2.2 1.1], ["versicolor", "virginica", "virginica", "versicolor", "versicolor", "versicolor", "versicolor", "versicolor", "virginica", "versicolor"  …  "virginica", "setosa", "setosa", "versicolor", "virginica", "setosa", "versicolor", "versicolor", "virginica", "versicolor"])

In [102]:
size(X_train)

(4, 120)

In [103]:
size(X_train')

(120, 4)

In [104]:
size(y_train)

(120,)

In [None]:
using DecisionTree

model = DecisionTreeClassifier(
    max_depth = 5,
    min_samples_leaf = 1,
    min_samples_split = 2
)

fit!(model, X_train', y_train)

DecisionTreeClassifier
max_depth:                5
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["setosa", "versicolor", "virginica"]
root:                     Decision Tree
Leaves: 7
Depth:  4

When printing a decision tree with `DecisionTree.print_tree`, the $N \setminus M$ to the right of a leaf $l$
encodes the fact that $N$ instances respects all the condition from the root of the tree to $l$ and, among those, $M$ are correctly classified.

In [106]:
print_tree(model) # use print_tree(model, N) to limit the depth of the printing

Feature 3 < 2.6 ?
├─ setosa : 42/42
└─ Feature 4 < 1.75 ?
    ├─ Feature 3 < 5.05 ?
        ├─ versicolor : 35/35
        └─ Feature 1 < 6.05 ?
            ├─ versicolor : 1/1
            └─ virginica : 3/3
    └─ Feature 3 < 4.85 ?
        ├─ Feature 2 < 3.1 ?
            ├─ virginica : 2/2
            └─ versicolor : 1/1
        └─ virginica : 36/36


In [108]:
using Statistics

y_pred = predict(model, X_test')
y_pred[1:5]

5-element Vector{String}:
 "versicolor"
 "virginica"
 "virginica"
 "versicolor"
 "versicolor"

## Exercise: Write Your Confusion Matrix

It is a common practice to summarize the performance of a model in a *confusion matrix*,
containing the true positives and negatives found by our model on the testing data, as well as 
the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as below.
$$
\begin{array}{c|c|c}
\text{Actual / Predicted} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above: accuracy, precision and recall.

$$\text{Accuracy} = \frac{TP + TN }{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + NP}$$

In [109]:
y_test[1:5]

5-element Vector{String}:
 "versicolor"
 "virginica"
 "virginica"
 "versicolor"
 "versicolor"

In [163]:
"""
Return a confusion matrix where rows encode the true labels, and columns encode the predicted
values.
"""
function confusion_matrix(y_true, y_pred; labels=["versicolor", "virginica", "setosa"])
    if length(y_true) != length(y_pred)
        throw(ArgumentError("Length mismatch ($(length(y_true)), $(length(y_pred)))"))
    end

    nof_labels = length(labels)

    string_to_idx = Dict{String, Int}()
    for (i, label) in enumerate(labels)
        string_to_idx[label] = i
    end

    cmatrix = zeros(Int, nof_labels, nof_labels)

    for (true_value, predicted_value) in zip(y_true, y_pred)
        cmatrix[string_to_idx[true_value], string_to_idx[predicted_value]] += 1
    end

    return cmatrix
end

confusion_matrix

In [164]:
m = confusion_matrix(y_test, y_pred)

3×3 Matrix{Int64}:
 13  0  0
  2  7  0
  0  0  8

In [165]:
using LinearAlgebra

"""
Return the accuracy of `m`.
"""
function accuracy(m::Matrix{Int})
    # an efficient implementation of the confusion matrix, this would have been preprocessed
    return sum(diag(m)) / sum(m)
end

accuracy

In [169]:
"""
Return the average precision of all the classes embodied within `m`.
"""
function precision(m::Matrix{Int})
    nrows = size(m, 1)
    precisions = zeros(Float64, nrows)

    for i in 1:nrows
        true_positives = m[i,i]
        false_positives = sum(m[i, :]) - true_positives        

        precisions[i] = true_positives / (false_positives + true_positives)
    end
    
    return sum(precisions) / nrows
end

precision

In [170]:
"""
Return the average recall of all the classes embodied within `m`.
"""
function recall(m::Matrix{Int})
    nrows = size(m, 1)
    recalls = zeros(Float64, nrows)

    for i in 1:nrows
        true_positives = m[i,i]
        false_negatives = sum(m[:, i]) - true_positives
        
        recalls[i] = true_positives / (false_negatives + true_positives)
    end

    return sum(recalls) / nrows
end

recall

We can aggregate the (macro averaged) precision and recall together, via an harmonic mean.
In the jargon, this new measure is called *F1 score*.

In [176]:
"""
Return the F1 score, with respect to the given `precision` and `recall`.
"""
function f1score(precision::Float64, recall::Float64)
    return (2 * precision * recall) / (precision + recall)
end

f1score

In [179]:
_accuracy = accuracy(m)
_precision = precision(m)
_recall = recall(m)
_f1score = f1score(_precision, _recall)

println("Accuracy: $(_accuracy)")
println("Precision: $(_precision)")
println("Recall: $(_recall)")
println("F1 Score: $(_f1score)")

Accuracy: 0.9333333333333333
Precision: 0.9259259259259259
Recall: 0.9555555555555556
F1 Score: 0.9405074365704288


## Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`, as they are the meta-parameters exploited for creating a specific algorithm (i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations and keep the one which expresses the highest performances.

This technique goes under the name of *grid search*.

In [None]:
max_depths = [3, 5, 7]
min_samples_leaf = [1, 2, 5]
min_samples_split = [2, 4]

best_score = 0.0
best_params = nothing
best_model = nothing

for (_max_depth, _min_samples_leaf, _min_samples_split) in Iterators.product(
    max_depths, min_samples_leaf, min_samples_split)
    
    model = DecisionTreeClassifier(
        max_depth = 5,
        min_samples_leaf = 1,
        min_samples_split = 2
    )

    fit!(model, X_train', y_train)
   
    
end




# Learning with Sole.jl


In [92]:
# TODO: see Day1-Appetizer.ipynb