# Classical Machine Learning Pipeline

This section describes a classical machine learning pipeline.

We leverage [MLUtils](https://github.com/JuliaML/MLUtils.jl), among the other things, for loading the data to play with, that is, the `iris` dataset.

We partition the data into a set of *instances* `X` and the corresponding *labels* `y`. Each instance is one element of the cartesian product between the domains of the *attributes*.

We want to find a relation between the instance space (i.e., many examples of iris flower) and the label space (i.e., the exact family to which each flower belongs). 

What we are going to do is train a (classification) decision tree, leveraging the `DecisionTree` library. 

Later in the notebook, we will repeat the process but leveraging `Sole.jl` library, and more-than-propositional logic.

## Data Loading and Description

In [223]:
using MLJBase

X, y, attributes = load_iris()

([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], ["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"], ["Sepal length", "Sepal width", "Petal length", "Petal width"])

In [224]:
attributes

4-element Vector{String}:
 "Sepal length"
 "Sepal width"
 "Petal length"
 "Petal width"

In [225]:
size(X)

(4, 150)

In [226]:
size(y)

(150,)

In [227]:
mean(X, dims = 2)

4×1 Matrix{Float64}:
 5.843333333333335
 3.057333333333334
 3.7580000000000027
 1.199333333333334

In [228]:
std(X, dims = 2)

4×1 Matrix{Float64}:
 0.8280661279778629
 0.435866284936698
 1.7652982332594664
 0.7622376689603465

In [229]:
minimum(X, dims = 2)

4×1 Matrix{Float64}:
 4.3
 2.0
 1.0
 0.1

In [230]:
maximum(X, dims = 2)

4×1 Matrix{Float64}:
 7.9
 4.4
 6.9
 2.5

In [231]:
for class in unique(y)
    println("$(class) - $(count(yi -> yi == class, y))")
end

setosa - 50
versicolor - 50
virginica - 50


## Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex preprocessing of our data. For example, we are not dealing with unbalanced classes, missing data and complex encodings. 

In the cell below, we partition the data into a training and a testing bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and leverage the testing one for simulating a real-world scenario, obtaining reliable performances.

In [232]:
using Random
Random.seed!(1605)

TaskLocalRNG()

In [233]:
# take a look at the next cell; do you see why we need to shuffle our instances here?
Xs, ys = shuffleobs((X, y))

([6.0 6.1 … 5.9 5.7; 3.4 3.0 … 3.2 3.0; 4.5 4.9 … 4.8 4.2; 1.6 1.8 … 1.8 1.2], ["versicolor", "virginica", "virginica", "virginica", "virginica", "setosa", "versicolor", "setosa", "versicolor", "setosa"  …  "setosa", "versicolor", "virginica", "virginica", "versicolor", "versicolor", "setosa", "setosa", "versicolor", "versicolor"])

In [234]:
training_data, testing_data = splitobs((Xs, ys); at = 0.8)

(([6.0 6.1 … 5.1 6.7; 3.4 3.0 … 3.5 3.1; 4.5 4.9 … 1.4 5.6; 1.6 1.8 … 0.3 2.4], ["versicolor", "virginica", "virginica", "virginica", "virginica", "setosa", "versicolor", "setosa", "versicolor", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "versicolor", "versicolor", "virginica", "setosa", "virginica"]), ([6.1 4.4 … 5.9 5.7; 3.0 2.9 … 3.2 3.0; 4.6 1.4 … 4.8 4.2; 1.4 0.2 … 1.8 1.2], ["versicolor", "setosa", "virginica", "versicolor", "virginica", "versicolor", "virginica", "virginica", "virginica", "setosa"  …  "setosa", "versicolor", "virginica", "virginica", "versicolor", "versicolor", "setosa", "setosa", "versicolor", "versicolor"]))

In [235]:
X_train, y_train = training_data
X_test, y_test  = testing_data

([6.1 4.4 … 5.9 5.7; 3.0 2.9 … 3.2 3.0; 4.6 1.4 … 4.8 4.2; 1.4 0.2 … 1.8 1.2], ["versicolor", "setosa", "virginica", "versicolor", "virginica", "versicolor", "virginica", "virginica", "virginica", "setosa"  …  "setosa", "versicolor", "virginica", "virginica", "versicolor", "versicolor", "setosa", "setosa", "versicolor", "versicolor"])

In [236]:
size(X_train)

(4, 120)

In [237]:
size(X_train')

(120, 4)

In [238]:
size(y_train)

(120,)

In [239]:
using DecisionTree

model = DecisionTreeClassifier(
    max_depth = 5,
    min_samples_leaf = 1,
    min_samples_split = 2
)

fit!(model, X_train', y_train)

DecisionTreeClassifier
max_depth:                5
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["setosa", "versicolor", "virginica"]
root:                     Decision Tree
Leaves: 7
Depth:  5

When printing a decision tree with `DecisionTree.print_tree`, the $N \setminus M$ to the right of a leaf $l$
encodes the fact that $N$ instances respects all the condition from the root of the tree to $l$ and, among those, $M$ are correctly classified.

In [240]:
print_tree(model) # use print_tree(model, N) to limit the depth of the printing

Feature 4 < 0.8 ?
├─ setosa : 41/41
└─ Feature 4 < 1.75 ?
    ├─ Feature 3 < 4.95 ?
        ├─ Feature 4 < 1.65 ?
            ├─ versicolor : 38/38
            └─ virginica : 1/1
        └─ Feature 4 < 1.55 ?
            ├─ virginica : 3/3
            └─ Feature 1 < 6.95 ?
                ├─ versicolor : 2/2
                └─ virginica : 1/1
    └─ virginica : 34/34


In [241]:
using Statistics

y_pred = predict(model, X_test')
y_pred[1:5]

5-element Vector{String}:
 "versicolor"
 "setosa"
 "virginica"
 "versicolor"
 "virginica"

## Exercise: Write Your Confusion Matrix

It is a common practice to summarize the performance of a model in a *confusion matrix*,
containing the true positives and negatives found by our model on the testing data, as well as 
the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as follows.
$$
\begin{array}{c|c|c}
\text{Actual / Predicted} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above: accuracy, precision, and recall.
In the binary classification scenario, they are defined as follows.

$$\text{Accuracy} = \frac{TP + TN }{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + NP}$$

In the multi-class scenario, as in our case, we can compute precision or recall individually for each class.
For obtaining a unique scalar, we can average all the results.

In [242]:
y_test[1:5]

5-element Vector{String}:
 "versicolor"
 "setosa"
 "virginica"
 "versicolor"
 "virginica"

In [243]:
"""
Return a confusion matrix where rows encode the true labels, and columns encode the predicted
values.
"""
function confusion_matrix(y_true, y_pred; labels=["versicolor", "virginica", "setosa"])
    if length(y_true) != length(y_pred)
        throw(ArgumentError("Length mismatch ($(length(y_true)), $(length(y_pred)))"))
    end

    nof_labels = length(labels)

    string_to_idx = Dict{String, Int}()
    for (i, label) in enumerate(labels)
        string_to_idx[label] = i
    end

    cmatrix = zeros(Int, nof_labels, nof_labels)

    for (true_value, predicted_value) in zip(y_true, y_pred)
        cmatrix[string_to_idx[true_value], string_to_idx[predicted_value]] += 1
    end

    return cmatrix
end

confusion_matrix

In [244]:
cmatrix = confusion_matrix(y_test, y_pred)

3×3 Matrix{Int64}:
 9   1  0
 0  11  0
 0   0  9

In [245]:
using LinearAlgebra: diag

"""
Return the accuracy of `m`.
"""
function accuracy(cmatrix::Matrix{Int})
    # an efficient implementation of the confusion matrix, this would have been preprocessed
    return sum(diag(cmatrix)) / sum(cmatrix)
end

accuracy

In [246]:
"""
Return the average precision of all the classes embodied within `m`.
"""
function precision(cmatrix::Matrix{Int})
    nrows = size(cmatrix, 1)
    precisions = zeros(Float64, nrows)

    for i in 1:nrows
        true_positives = cmatrix[i,i]
        false_positives = sum(cmatrix[i, :]) - true_positives        

        precisions[i] = true_positives / (false_positives + true_positives)
    end
    
    return sum(precisions) / nrows
end

precision

In [247]:
"""
Return the average recall of all the classes embodied within `m`.
"""
function recall(cmatrix::Matrix{Int})
    nrows = size(cmatrix, 1)
    recalls = zeros(Float64, nrows)

    for i in 1:nrows
        true_positives = cmatrix[i,i]
        false_negatives = sum(cmatrix[:, i]) - true_positives
        
        recalls[i] = true_positives / (false_negatives + true_positives)
    end

    return sum(recalls) / nrows
end

recall

We can aggregate the (macro averaged) precision and recall together, via an harmonic mean.
In the jargon, this new measure is called *F1 score*.

In [248]:
"""
Compute the F1 score.
"""
function f1score(cmatrix::Matrix{Int64})
    return f1score(precision(cmatrix), recall(cmatrix))
end
function f1score(precision::Float64, recall::Float64)
    return (2 * precision * recall) / (precision + recall)
end

f1score (generic function with 3 methods)

In [249]:
_accuracy = accuracy(m)
_precision = precision(m)
_recall = recall(m)
_f1score = f1score(_precision, _recall)

println("Accuracy: $(_accuracy)")
println("Precision: $(_precision)")
println("Recall: $(_recall)")
println("F1 Score: $(_f1score)")

Accuracy: 0.9333333333333333
Precision: 0.9259259259259259
Recall: 0.9555555555555556
F1 Score: 0.9405074365704288


## Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`, as they are the meta-parameters exploited for creating a specific algorithm (i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations and keep the one which expresses the highest performances.

This technique goes under the name of *grid search*.

In [250]:
max_depths = [3, 5, 7]
min_samples_leaf = [1, 2, 5]
min_samples_split = [2, 4]

best_score = 0.0
best_params = nothing
best_model = nothing

for (_max_depth, _min_samples_leaf, _min_samples_split) in Iterators.product(
    max_depths, min_samples_leaf, min_samples_split)
    
    model = DecisionTreeClassifier(
        max_depth = 5,
        min_samples_leaf = 1,
        min_samples_split = 2
    )

    fit!(model, X_train', y_train)
   
    y_pred = predict(model, X_test')

    cm = confusion_matrix(y_test, y_pred)

    score = f1score(cm)

    if score > best_score
        best_score = score
        best_params = ((_max_depth, _min_samples_leaf, _min_samples_split))
        best_model = model
    end
end


In [251]:
println("Best parameterization: $(best_params)")
println("Corresponding F1 score: $(best_score)")

Best parameterization: (3, 1, 2)
Corresponding F1 score: 0.9694364851957975


# Learning with Sole.jl


## Tabular Datasets and Logisets

Symbolic AI treats tabular datasets, such as the iris flower, as sets of propositional interpretations, onto which formulas of propositional logic are interpreted.

Look at this classical tabular dataset $\mathcal{I}$ below. We indicate instances with $I$, and *variables*$^{[1]}$, as $V_i$.

$$
\begin{array}{c|ccc}
 & V_1 & V_2 & V_3 \\ \hline
I_1 & 1.2 & [1,2,3] & \text{A} \\
I_2 & 1.3 & [9,7,6] & \text{B} \\
I_3 & 0.8 & [2,8,2] & \text{C} \\
I_4 & 1.1 & [1,3,7] & \text{B} \\
I_5 & 1.2 & [4,3,3] & \text{B} \\
\end{array}
$$

We can change the point of view on the table above, from a statistical to a logical one, called *logiset*.

This induction requires the definition of a propositional alphabet $\mathcal{P}$.

Consider $\mathcal{P} = \{p, q, r\}$, with: 

$$p \coloneqq \text{max}(V_1) \geq 1$$
$$q \coloneqq \text{sum}(V_2) < 13$$
$$r \coloneqq V_3 = \text{B}$$

We indicate the truth constant with $\top$ (top), and the falsehood with $\bot$ (bot).

The resulting logiset $\mathcal{I}_\mathcal{P}$ is this one:

$$
\begin{array}{c|ccc}
 & p & q & r \\ \hline
I_1 & \top & \top & \bot \\
I_2 & \top & \bot & \top \\
I_3 & \bot & \top & \bot \\
I_4 & \top & \top & \top \\
I_5 & \top & \top & \top \\
\end{array}
$$

$^{[1]}$ We use the term "variable" to indicate, in general, a column of the tabular dataset: it could encode a raw attribute or a *feature* (a processed attribute).


In [252]:
using SoleData

In [254]:
X = PropositionalLogiset(MLJBase.load_iris())

PropositionalLogiset (6.17 KBs)
├ # instances:                  150
├ # features:                   5
└ Table: (sepal_length = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], sepal_width = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3