# Classical Machine Learning Pipeline

This section describes a classical machine learning pipeline.

The main way-to-go for implementing machine learning pipelines in Julia is via the [MLJ](https://juliaai.github.io/MLJ.jl/stable/) package.


We are going to work with the `iris` dataset, trying to discover a relation between the *attributes* value of each instance, `X`, from the corresponding *labels** `y`. 

We want to find a relation between the specific attribute values of an iris flower and the family to which the same flower belongs.

What we are going to do is train a (classification) decision tree, leveraging the `DecisionTree` library, which can be integrated within an `MLJ` pipeline.

Later in the notebook, we will repeat the process but leveraging `Sole.jl` library, and more-than-propositional logic.

In [1]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.update()

[32m[1m  Activating[22m[39m project at `~/.julia/dev/logic-and-machine-learning`
[92m[1mPrecompiling[22m[39m project...
    932.6 ms[32m  ✓ [39m[90mTableShowUtils[39m
   1680.3 ms[32m  ✓ [39mBenchmarkTools
   1302.3 ms[32m  ✓ [39m[90mQueryOperators[39m
   2741.6 ms[32m  ✓ [39m[90mTimeseriesFeatures[39m
   1179.1 ms[32m  ✓ [39m[90mQuery[39m
   1201.9 ms[32m  ✓ [39m[90mTimeseriesFeatures → StatsBaseExt[39m
   5353.5 ms[32m  ✓ [39m[90mCatch22[39m
   7564.9 ms[32m  ✓ [39m[90mMultiData[39m
  10731.6 ms[32m  ✓ [39mSoleData
  61150.8 ms[32m  ✓ [39mPlots
   2968.4 ms[32m  ✓ [39mPlots → FileIOExt
  11 dependencies successfully precompiled in 66 seconds. 379 already precompiled.
  [33m1[39m dependency had output during precompilation:[33m
┌ [39mSoleData[33m
└  [39m
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
┌ Info: The General registry is installed via git. Consider reinstalling it via
│ the newer faster direct from

In [2]:
# for reproducibility purposes
using Random
Random.seed!(1605)

TaskLocalRNG()

## Data Loading and Description

In [3]:
using MLJ
using RDatasets


data = RDatasets.dataset("datasets", "iris");

In [4]:
schema(data)

┌─────────────┬───────────────┬─────────────────────────────────┐
│[30m names       [0m│[30m scitypes      [0m│[30m types                           [0m│
├─────────────┼───────────────┼─────────────────────────────────┤
│ SepalLength │ Continuous    │ Float64                         │
│ SepalWidth  │ Continuous    │ Float64                         │
│ PetalLength │ Continuous    │ Float64                         │
│ PetalWidth  │ Continuous    │ Float64                         │
│ Species     │ Multiclass{3} │ CategoricalValue{String, UInt8} │
└─────────────┴───────────────┴─────────────────────────────────┘


In [5]:
data

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [6]:
y, X = unpack(data, ==(:Species))

(CategoricalArrays.CategoricalValue{String, UInt8}["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"], [1m150×4 DataFrame[0m
[1m Row [0m│[1m SepalLength [0m[1m SepalWidth [0m[1m PetalLength [0m[1m PetalWidth [0m
[1m     [0m│[90m Float64     [0m[90m Float64    [0m[90m Float64     [0m[90m Float64    [0m
─────┼──────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2
   2 │         4.9         3.0          1.4         0.2
   3 │         4.7         3.2          1.3         0.2
   4 │         4.6         3.1          1.5         0.2
   5 │         5.0         3.6          1.4         0.2
   6 │         5.4         3.9          1.7         0.4
   7 │         4.6         3.4          1.4         0.3
   8 │         5.0         3.4          1.5 

In [7]:
# categorical vectors are lightier than raw vectors; can you guess why?
typeof(y)

CategoricalVector{String, UInt8, String, CategoricalValue{String, UInt8}, Union{}}[90m (alias for [39m[90mCategoricalArrays.CategoricalArray{String, 1, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}[39m[90m)[39m

In [8]:
typeof(X)

DataFrame

In [9]:
# to ensure that classes are balanced
for class in unique(y)
    println("$(class) - $(count(yi -> yi == class, y))")
end

setosa - 50
versicolor - 50
virginica - 50


## Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex preprocessing of our data. For example, we are not dealing with unbalanced classes, missing data and complex encodings. 

The usual workflow at this point, is to partition the data into a training and a testing bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and leverage the testing one for simulating a real-world scenario, obtaining reliable performances.

Actually, MLJ makes our work *much* easier, even providing us with a more sophisticated training strategy, as we will see later.

## Model Training

We will integrate an external model, coming from the `DecisionTree` package, into the MLJ workflow.

In the next lessons, we will doing something similar with another model called `ModalDecisionTree`.

In [73]:
try
    DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
catch
    println("The DecisionTreeClassifier symbol has already been imported.")
end

import MLJDecisionTreeInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/mauro/.julia/packages/MLJModels/BfLy4/src/loading.jl:159


MLJDecisionTreeInterface.DecisionTreeClassifier

In [74]:
model = MLJDecisionTreeInterface.DecisionTreeClassifier(
    max_depth=5, 
    min_samples_leaf=1, 
    min_samples_split=2
)

DecisionTreeClassifier(
  max_depth = 5, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = TaskLocalRNG())

A machine is a binding between a model and the data it works with.

It also keeps track of other information we might want to inspect, such as the specific parameter learned by a model.

In the cell below, we bind the decision tree model to all the instances we have available. This is not a smart idea, but we will return on the topic in a moment.

In [87]:
mach = machine(model, X, y)

untrained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 5, …)
  args: 
    1:	Source @095 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @096 ⏎ AbstractVector{Multiclass{3}}


In [88]:
fit!(mach)

┌ Info: Training machine(DecisionTreeClassifier(max_depth = 5, …), …).
└ @ MLJBase /home/mauro/.julia/packages/MLJBase/yVJvJ/src/machines.jl:499


trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 5, …)
  args: 
    1:	Source @095 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @096 ⏎ AbstractVector{Multiclass{3}}


In [89]:
y_predict_probabilities = MLJ.predict(mach, X)
y_predict = mode.(y_predict_probabilities)

150-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [97]:
fitted_params(mach).tree

PetalLength < 2.45
├─ setosa (50/50)
└─ PetalWidth < 1.75
   ├─ PetalLength < 4.95
   │  ├─ PetalWidth < 1.65
   │  │  ├─ versicolor (47/47)
   │  │  └─ virginica (1/1)
   │  └─ PetalWidth < 1.55
   │     ├─ virginica (3/3)
   │     └─ PetalLength < 5.45
   │        ├─ versicolor (2/2)
   │        └─ virginica (1/1)
   └─ PetalLength < 4.85
      ├─ SepalWidth < 3.1
      │  ├─ virginica (2/2)
      │  └─ versicolor (1/1)
      └─ virginica (43/43)


## Confusion Matrix and Overfitting 

It is a common practice to summarize the performance of a model in a *confusion matrix*,
containing the true positives and negatives found by our model on the testing data, as well as 
the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as follows.
$$
\begin{array}{c|c|c}
\text{Predicted / Actual} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above: accuracy, precision, and recall.
In the binary classification scenario, they are defined as follows.

$$\text{Accuracy} = \frac{TP + TN }{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + NP}$$

In the multi-class scenario, as in our case, we can compute precision or recall individually for each class.
For obtaining a unique scalar, we can average all the results.

In [95]:
cm = confusion_matrix(y_predict, y)

           ┌────────────────────────────────┐
           │          Ground Truth          │
┌──────────┼──────────┬──────────┬──────────┤
│Predicted │  setosa  │versicol… │virginica │
├──────────┼──────────┼──────────┼──────────┤
│  setosa  │    50    │    0     │    0     │
├──────────┼──────────┼──────────┼──────────┤
│versicol… │    0     │    50    │    0     │
├──────────┼──────────┼──────────┼──────────┤
│virginica │    0     │    0     │    50    │
└──────────┴──────────┴──────────┴──────────┘


In [None]:
# wow! our model is so good!
accuracy(cm)

1.0

How awful! The model we just trained is bad, for sure.

Can you tell why? If so, can you also provide a graphical sketch of the problem?

## Model Evaluation

In [119]:
acc = evaluate!(
    mach,
    resampling=StratifiedCV(; nfolds = 5, shuffle=true),
    measures=[accuracy]
)

PerformanceEvaluation object with these fields:
  model, tag, measure, operation,
  measurement, uncertainty_radius_95, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Tag: DecisionTreeClassifier-499
Extract:
┌────────────┬──────────────┬─────────────┐
│[30m measure    [0m│[30m operation    [0m│[30m measurement [0m│
├────────────┼──────────────┼─────────────┤
│ Accuracy() │ predict_mode │ 0.953       │
└────────────┴──────────────┴─────────────┘
┌─────────────────────────────────┬─────────┐
│[30m per_fold                        [0m│[30m 1.96*SE [0m│
├─────────────────────────────────┼─────────┤
│ [0.967, 0.933, 0.967, 1.0, 0.9] │ 0.0372  │
└─────────────────────────────────┴─────────┘


## Training with Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`, as they are the meta-parameters exploited for creating a specific algorithm (i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations and keep the one which expresses the highest performances.

This technique goes under the name of *grid search*.

In [125]:
max_depth_range = range(Int, :max_depth, lower=2, upper=10)
min_samples_leaf_range = range(Int, :min_samples_leaf, lower=1, upper=5)
min_samples_split_range = range(Int, :min_samples_split, lower=2, upper=10)

NumericRange(2 ≤ min_samples_split ≤ 10; origin=6.0, unit=4.0)

In [134]:
tuned_tree = TunedModel(
    model = MLJDecisionTreeInterface.DecisionTreeClassifier(),
    resampling = StratifiedCV(nfolds = 5, shuffle = true),
    range = [max_depth_range, min_samples_leaf_range, min_samples_split_range],
    measure = accuracy,
    tuning = Grid()
)

ProbabilisticTunedModel(
  model = DecisionTreeClassifier(
        max_depth = -1, 
        min_samples_leaf = 1, 
        min_samples_split = 2, 
        min_purity_increase = 0.0, 
        n_subfeatures = 0, 
        post_prune = false, 
        merge_purity_threshold = 1.0, 
        display_depth = 5, 
        feature_importance = :impurity, 
        rng = TaskLocalRNG()), 
  tuning = Grid(
        goal = nothing, 
        resolution = 10, 
        shuffle = true, 
        rng = TaskLocalRNG()), 
  resampling = StratifiedCV(
        nfolds = 5, 
        shuffle = true, 
        rng = TaskLocalRNG()), 
  measure = Accuracy(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = MLJBase.NumericRange{Int64, MLJBase.Bounded, Symbol}[NumericRange(2 ≤ max_depth ≤ 10; origin=6.0, unit=4.0), NumericRange(1 ≤ min_samples_leaf ≤ 5; origin=3.0, unit=2.0), NumericRange(2 ≤ min_samples_split ≤ 10; origin=6.0, unit=4.0)], 
  selection_heuristic = MLJTuning.NaiveSel

In [None]:
# find the best model, exploring different hyperparameterizations with cross validation
mach = machine(tuned_tree, X, y)
fit!(mach)

┌ Info: Training machine(ProbabilisticTunedModel(model = DecisionTreeClassifier(max_depth = -1, …), …), …).
└ @ MLJBase /home/mauro/.julia/packages/MLJBase/yVJvJ/src/machines.jl:499
┌ Info: Attempting to evaluate 405 models.
└ @ MLJTuning /home/mauro/.julia/packages/MLJTuning/xiLEY/src/tuned_models.jl:762


trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = DecisionTreeClassifier(max_depth = -1, …), …)
  args: 
    1:	Source @010 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @599 ⏎ AbstractVector{Multiclass{3}}


In [141]:
confusion_matrix(y_predict, y)

           ┌────────────────────────────────┐
           │          Ground Truth          │
┌──────────┼──────────┬──────────┬──────────┤
│Predicted │  setosa  │versicol… │virginica │
├──────────┼──────────┼──────────┼──────────┤
│  setosa  │    50    │    0     │    0     │
├──────────┼──────────┼──────────┼──────────┤
│versicol… │    0     │    47    │    1     │
├──────────┼──────────┼──────────┼──────────┤
│virginica │    0     │    3     │    49    │
└──────────┴──────────┴──────────┴──────────┘


# Learning with Sole.jl


## Tabular Datasets and Logisets

Symbolic AI treats tabular datasets, such as the iris flower, as sets of propositional interpretations, onto which formulas of propositional logic are interpreted.

Look at this classical tabular dataset $\mathcal{I}$ below. We indicate instances with $I$, and *variables*$^{[1]}$, as $V_i$.

$$
\begin{array}{c|ccc}
 & V_1 & V_2 & V_3 \\ \hline
I_1 & 1.2 & [1,2,3] & \text{A} \\
I_2 & 1.3 & [9,7,6] & \text{B} \\
I_3 & 0.8 & [2,8,2] & \text{C} \\
I_4 & 1.1 & [1,3,7] & \text{B} \\
I_5 & 1.2 & [4,3,3] & \text{B} \\
\end{array}
$$

We can change the point of view on the table above, from a statistical to a logical one, called *logiset*.

This induction requires the definition of a propositional alphabet $\mathcal{P}$.

Consider $\mathcal{P} = \{p, q, r\}$, with: 

$$p \coloneqq \text{max}(V_1) \geq 1$$
$$q \coloneqq \text{sum}(V_2) < 13$$
$$r \coloneqq V_3 = \text{B}$$

We indicate the truth constant with $\top$ (top), and the falsehood with $\bot$ (bot).

The resulting (propositional) logiset $\mathcal{I}_\mathcal{P}$ is this one:

$$
\begin{array}{c|ccc}
 & p & q & r \\ \hline
I_1 & \top & \top & \bot \\
I_2 & \top & \bot & \top \\
I_3 & \bot & \top & \bot \\
I_4 & \top & \top & \top \\
I_5 & \top & \top & \top \\
\end{array}
$$

$^{[1]}$ We use the term "variable" to indicate, in general, a column of the tabular dataset: it could encode a raw attribute or a *feature* (a processed attribute).


In [30]:
using SoleData

In [31]:
X = PropositionalLogiset(MLJBase.load_iris())

PropositionalLogiset (6.17 KBs)
├ # instances:                  150
├ # features:                   5
└ Table: (sepal_length = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], sepal_width = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3