# An Example of Classical Machine Learning Pipeline

This section describes a classical machine learning pipeline.

The dataset, called `iris`, is loaded by exploiting [RDatasets](https://vincentarelbundock.github.io/Rdatasets/) library, originally intended to be distributed within the software environment of `R` language.

We partition the data into a set of *instances* `X` and the corresponding *labels* `y`. Each instance is one element of the cartesian product between the domains of the *attributes*.

We want to find a relation between the instance space (i.e., many examples of iris flower) and the label space (i.e., the exact family to which each flower belongs). 

What we are going to do is train a (classification) decision tree, leveraging the `DecisionTree` library. 

Later in the notebook, we will repeat the process but leveraging `Sole.jl` library, and more-than-propositional logic.

## Data Loading and Description

In [32]:
using RDatasets

iris = dataset("datasets", "iris")
first(iris, 5)

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


In [33]:
iris[1, 2]

3.5

In [34]:
iris[1, :SepalWidth]

3.5

In [35]:
iris[1:3, :SepalLength]

3-element Vector{Float64}:
 5.1
 4.9
 4.7

In [36]:
describe(iris)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,SepalLength,5.84333,4.3,5.8,7.9,0,Float64
2,SepalWidth,3.05733,2.0,3.0,4.4,0,Float64
3,PetalLength,3.758,1.0,4.35,6.9,0,Float64
4,PetalWidth,1.19933,0.1,1.3,2.5,0,Float64
5,Species,,setosa,,virginica,0,"CategoricalValue{String, UInt8}"


## Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex preprocessing of our data. For example, we are not dealing with unbalanced classes, missing data and complex encodings. 

In the cell below, we just separate all the attributes (`X`) from the target column, encoding the class we want to learn how to predict (`y`).

In [37]:
X = Matrix(iris[:, 1:4])

println("The attributes of the first three instances are:")
X[1:3, :]

The attributes of the first three instances are:


3×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2

Classes are encoded as `CategoricalValue`s for efficiency. Instead of repeating one string (e.g., "setosa") many times, each class essentially is a small integer (an `Int8`) and gets mapped to a string value.

In [38]:
iris[:, :Species][[1,51,101]]

3-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "versicolor"
 "virginica"

## Training and Testing

We further partition the data into a training and a testing bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and leverage the testing one for simulating a real-world scenario, obtaining reliable performances.

In [39]:
using Random
Random.seed!(1605) # for reproducibility

TaskLocalRNG()

In [51]:
using MLUtils

train_idx, test_idx = splitobs(nrow(iris); at=0.8) # nrow(iris) is length(y), or first(size(X)) 

X_train, y_train = X[train_idx, :], y[train_idx];
X_test, y_test  = X[test_idx, :], y[test_idx];

In [54]:
using DecisionTree

model = DecisionTreeClassifier(
    max_depth = 5,
    min_samples_leaf = 1,
    min_samples_split = 2
)

fit!(model, X_train, y_train)

DecisionTreeClassifier
max_depth:                5
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["setosa", "versicolor", "virginica"]
root:                     Decision Tree
Leaves: 7
Depth:  4

When printing a decision tree with `DecisionTree.print_tree`, the $N \setminus M$ to the right of a leaf $l$
encodes the fact that $N$ instances respects all the condition from the root of the tree to $l$ and, among those, $M$ are correctly classified.

In [None]:
print_tree(model) # use print_tree(model, N) to limit the depth of the printing

Feature 3 < 2.45 ?
├─ setosa : 50/50
└─ Feature 3 < 4.95 ?
    ├─ Feature 1 < 4.95 ?
        ├─ Feature 4 < 1.35 ?
            ├─ versicolor : 1/1
            └─ virginica : 1/1
        └─ versicolor : 47/47
    └─ Feature 4 < 1.75 ?
        ├─ Feature 4 < 1.55 ?
            ├─ virginica : 1/1
            └─ versicolor : 2/2
        └─ virginica : 18/18


In [64]:
using Statistics

y_pred = predict(model, X_test)
y_pred[1:5]

5-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "virginica"
 "versicolor"
 "virginica"
 "versicolor"
 "virginica"

## Exercise: Write Your Confusion Matrix

It is a common practice to summarize the performance of a model in a *confusion matrix*,
containing the true positives and negatives found by our model on the testing data, as well as 
the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as below.
$$
\begin{array}{c|c|c}
\text{Actual / Predicted} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above: accuracy, precision and recall.

$$\text{Accuracy} = \frac{TP + TN }{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + NP}$$

In [None]:
function confusion_matrix(y_true, y_pred)

    # return _confusion_matrix, accuracy, precision, recall 
end

In [None]:
# use the confusion_matrix defined above to print the performances of the trained model

## Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`, as they are the meta-parameters exploited for creating a specific algorithm (i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations and keep the one which expresses the highest performances.

This technique goes under the name of *grid search*.

In [None]:
# TODO implement the grid search leveraging Iterators

# Learning with Sole.jl


In [None]:
# TODO: see Day1-Appetizer.ipynb