# Symbolic Machine Learning

The main way-to-go for implementing a machine learning pipeline in Julia is via
the [MLJ.jl](https://juliaai.github.io/MLJ.jl/stable/) package.

We are going to work with the `iris` dataset, trying to discover the relation 
between the specific attribute values of an iris flower and the family to which
the same flower belongs to.

More generally, we want to find the relation between the values of the
*attributes* of each instance (`X`) and the corresponding *labels* (`y`). 

In order to do so, we are going to train a (classification) decision tree,
leveraging the `DecisionTree` package, which can be easily integrated within an
`MLJ` pipeline.

Later in this notebook, we will repeat this process leveraging the `Sole.jl`
library, which will allow us to explicitly model the problem through the lens of
logic.

In [1]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.update()

[32m[1m  Activating[22m[39m project at `~/.julia/dev/logic-and-machine-learning`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
┌ Info: The General registry is installed via git. Consider reinstalling it via
│ the newer faster direct from tarball format by running:
│   pkg> registry rm General; registry add General
│ 
└ @ Pkg.Registry /home/mauro/.julia/juliaup/julia-1.11.8+0.x64.linux.gnu/share/julia/stdlib/v1.11/Pkg/src/Registry/Registry.jl:478
[32m[1m    Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General`
[32m[1m   Installed[22m[39m Revise ─ v3.13.1
[32m[1m  No Changes[22m[39m to `~/.julia/dev/logic-and-machine-learning/Project.toml`
[32m[1m    Updating[22m[39m `~/.julia/dev/logic-and-machine-learning/Manifest.toml`
  [90m[295af30f] [39m[93m↑ Revise v3.13.0 ⇒ v3.13.1[39m
[92m[1mPrecompiling[22m[39m project...
  11325.7 ms[32m  ✓ [39m[90mRevise[39m
   1164.4 ms[32m  ✓ [39m[90mRevise → DistributedExt[39m

In [2]:
# for reproducibility purposes
using Random
Random.seed!(1605)

TaskLocalRNG()

## Learning with MLJ.jl

### Data Loading and Description

In [3]:
using MLJ
using RDatasets # used to load the iris dataset


data = RDatasets.dataset("datasets", "iris");

In [4]:
schema(data)

┌─────────────┬───────────────┬─────────────────────────────────┐
│[30m names       [0m│[30m scitypes      [0m│[30m types                           [0m│
├─────────────┼───────────────┼─────────────────────────────────┤
│ SepalLength │ Continuous    │ Float64                         │
│ SepalWidth  │ Continuous    │ Float64                         │
│ PetalLength │ Continuous    │ Float64                         │
│ PetalWidth  │ Continuous    │ Float64                         │
│ Species     │ Multiclass{3} │ CategoricalValue{String, UInt8} │
└─────────────┴───────────────┴─────────────────────────────────┘


In [5]:
data

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [6]:
y, X = unpack(data, ==(:Species))

(CategoricalArrays.CategoricalValue{String, UInt8}["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"], [1m150×4 DataFrame[0m
[1m Row [0m│[1m SepalLength [0m[1m SepalWidth [0m[1m PetalLength [0m[1m PetalWidth [0m
[1m     [0m│[90m Float64     [0m[90m Float64    [0m[90m Float64     [0m[90m Float64    [0m
─────┼──────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2
   2 │         4.9         3.0          1.4         0.2
   3 │         4.7         3.2          1.3         0.2
   4 │         4.6         3.1          1.5         0.2
   5 │         5.0         3.6          1.4         0.2
   6 │         5.4         3.9          1.7         0.4
   7 │         4.6         3.4          1.4         0.3
   8 │         5.0         3.4          1.5 

In [7]:
# categorical vectors are lighter than raw vectors; can you guess why?
typeof(y)

CategoricalVector{String, UInt8, String, CategoricalValue{String, UInt8}, Union{}}[90m (alias for [39m[90mCategoricalArrays.CategoricalArray{String, 1, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}[39m[90m)[39m

In [8]:
typeof(X)

DataFrame

In [9]:
# to ensure that classes are balanced
for class in unique(y)
    println("$(class) - $(count(yi -> yi == class, y))")
end

setosa - 50
versicolor - 50
virginica - 50


### Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex
preprocessing of our data. For example, we are not dealing with unbalanced
classes, missing data, or complex encodings. 

The usual workflow, at this point, is to partition the data into a training and
a test bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and
leverage the test one for simulating a real-world scenario, obtaining reliable
performance.

MLJ makes our work *much* easier, even providing us with a more sophisticated
training strategy, as we will see later.

### Model Training

We will integrate an external model, coming from the `DecisionTree` package,
into the MLJ workflow.

In the next lessons, we will be doing something similar with another model
called `ModalDecisionTree`.

In [10]:
try
    DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
catch
    println("The DecisionTreeClassifier symbol has already been imported.")
end

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/mauro/.julia/packages/MLJModels/BfLy4/src/loading.jl:159


import MLJDecisionTreeInterface ✔


MLJDecisionTreeInterface.DecisionTreeClassifier

In [11]:
model = MLJDecisionTreeInterface.DecisionTreeClassifier(
    max_depth=5, 
    min_samples_leaf=1, 
    min_samples_split=2
)

DecisionTreeClassifier(
  max_depth = 5, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = TaskLocalRNG())

A machine is a binding between a model and the data it works with.

It also keeps track of other information we might want to inspect, such as the
specific parameters learned by a model.

In the cell below, we bind the decision tree model to all the instances we have
available. This is not a good idea, but we will return on the topic in a moment.

In [12]:
mach = machine(model, X, y)

untrained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 5, …)
  args: 
    1:	Source @428 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @050 ⏎ AbstractVector{Multiclass{3}}


In [13]:
fit!(mach)

┌ Info: Training machine(DecisionTreeClassifier(max_depth = 5, …), …).
└ @ MLJBase /home/mauro/.julia/packages/MLJBase/yVJvJ/src/machines.jl:499


trained Machine; caches model-specific representations of data
  model: DecisionTreeClassifier(max_depth = 5, …)
  args: 
    1:	Source @428 ⏎ Table{AbstractVector{Continuous}}
    2:	Source @050 ⏎ AbstractVector{Multiclass{3}}


In [14]:
y_predict_probabilities = MLJ.predict(mach, X)
y_predict = mode.(y_predict_probabilities)

150-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [15]:
fitted_params(mach).tree

PetalLength < 2.45
├─ setosa (50/50)
└─ PetalWidth < 1.75
   ├─ PetalLength < 4.95
   │  ├─ PetalWidth < 1.65
   │  │  ├─ versicolor (47/47)
   │  │  └─ virginica (1/1)
   │  └─ PetalWidth < 1.55
   │     ├─ virginica (3/3)
   │     └─ SepalLength < 6.95
   │        ├─ versicolor (2/2)
   │        └─ virginica (1/1)
   └─ PetalLength < 4.85
      ├─ SepalWidth < 3.1
      │  ├─ virginica (2/2)
      │  └─ versicolor (1/1)
      └─ virginica (43/43)


### Confusion Matrix and Overfitting 

It is common practice to summarize the performance of a model using a
*confusion matrix*, containing the true positives and negatives found by our
model on the test data, as well as the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as follows.
$$
\begin{array}{c|c|c}
\text{Predicted / Ground truth} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above:
accuracy, precision, and recall.
In the binary classification scenario, they are defined as follows.
$$\text{Accuracy} = \frac{TP + TN}{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + FN}$$

In the multi-class scenario, as in our case, we can compute precision and recall
individually for each class. For obtaining a unique scalar, we can average all
the results.

In [16]:
cm = confusion_matrix(y_predict, y)

           ┌────────────────────────────────┐
           │          Ground Truth          │
┌──────────┼──────────┬──────────┬──────────┤
│Predicted │  setosa  │versicol… │virginica │
├──────────┼──────────┼──────────┼──────────┤
│  setosa  │    50    │    0     │    0     │
├──────────┼──────────┼──────────┼──────────┤
│versicol… │    0     │    50    │    0     │
├──────────┼──────────┼──────────┼──────────┤
│virginica │    0     │    0     │    50    │
└──────────┴──────────┴──────────┴──────────┘


In [17]:
# wow! our model is so good!
accuracy(cm)

1.0

How awful! The model we just trained is bad, for sure.

Can you tell why?
Answer (decode from [base64encode](https://www.base64encode.org/)): `VGhlIGNvZGUgaXMgbm90IGdlbmVyYWxpemluZyEKVGhlIHNwbGl0cyBpbiB0aGUgdHJlZXMganVzdCBiZWNvbWUgYSBzdHJhdGVneSBmb3IgbWVtb3JpemluZyAoYW5kIGNvbXByZXNzaW5nKSB0aGUgZ2l2ZW4gZGF0YS4KUmVtZW1iZXI6IGFuIGludGVsbGlnZW50IGJlaGF2aW91ciBhbHdheXMgc3RlbXMgZnJvbSBnZW5lcmFsaXphdGlvbiBjYXBhYmlsaXRpZXMu`

### Model Evaluation

Imagine projecting the data points on a bidimensional plane: can you provide a graphical sketch 
of what is happening during the inference process of the tree trained above? 

Let us to obtain a more reliable model.

In [18]:
(X_train, X_test), (y_train, y_test) = partition((X, y), 0.7, rng=121, shuffle=true, multi=true);

In [19]:
mach = machine(model, X_train, y_train)
fit!(mach)
y_predict_probabilities = MLJ.predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

┌ Info: Training machine(DecisionTreeClassifier(max_depth = 5, …), …).
└ @ MLJBase /home/mauro/.julia/packages/MLJBase/yVJvJ/src/machines.jl:499


           ┌────────────────────────────────┐
           │          Ground Truth          │
┌──────────┼──────────┬──────────┬──────────┤
│Predicted │  setosa  │versicol… │virginica │
├──────────┼──────────┼──────────┼──────────┤
│  setosa  │    14    │    0     │    0     │
├──────────┼──────────┼──────────┼──────────┤
│versicol… │    0     │    13    │    1     │
├──────────┼──────────┼──────────┼──────────┤
│virginica │    0     │    2     │    15    │
└──────────┴──────────┴──────────┴──────────┘


We can iterate the process above on multiple *folds*, to assess the overall quality of a 
machine learning training strategy. This technique is commonly called *cross-validation*.

In the following, the iris dataset will be shuffled and divided into training and test in 
different ways, and each time a decision tree will be learned and tested over a different
portion of the data.

In [20]:
acc = evaluate!(
    mach,
    resampling=StratifiedCV(; nfolds = 5, shuffle=true),    # cross validation
    measures=[accuracy]
)



PerformanceEvaluation object with these fields:
  model, tag, measure, operation,
  measurement, uncertainty_radius_95, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Tag: DecisionTreeClassifier-216
Extract:
┌────────────┬──────────────┬─────────────┐
│[30m measure    [0m│[30m operation    [0m│[30m measurement [0m│
├────────────┼──────────────┼─────────────┤
│ Accuracy() │ predict_mode │ 0.943       │
└────────────┴──────────────┴─────────────┘
┌───────────────────────────────────┬─────────┐
│[30m per_fold                          [0m│[30m 1.96*SE [0m│
├───────────────────────────────────┼─────────┤
│ [0.952, 0.857, 1.0, 0.952, 0.952] │ 0.0511  │
└───────────────────────────────────┴─────────┘


### Training with Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`,
as they are the meta-parameters exploited for creating a specific algorithm
(i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations
and keep the one which expresses the highest performance.

This technique goes under the name of *grid search*.

In [21]:
max_depth_range = range(Int, :max_depth, lower=2, upper=10)
min_samples_leaf_range = range(Int, :min_samples_leaf, lower=1, upper=5)
min_samples_split_range = range(Int, :min_samples_split, lower=2, upper=10);

In [22]:
tuned_tree = TunedModel(
    model = MLJDecisionTreeInterface.DecisionTreeClassifier(),
    resampling = StratifiedCV(nfolds = 5, shuffle = true),
    range = [max_depth_range, min_samples_leaf_range, min_samples_split_range],
    measure = accuracy,
    tuning = Grid()
)

ProbabilisticTunedModel(
  model = DecisionTreeClassifier(
        max_depth = -1, 
        min_samples_leaf = 1, 
        min_samples_split = 2, 
        min_purity_increase = 0.0, 
        n_subfeatures = 0, 
        post_prune = false, 
        merge_purity_threshold = 1.0, 
        display_depth = 5, 
        feature_importance = :impurity, 
        rng = TaskLocalRNG()), 
  tuning = Grid(
        goal = nothing, 
        resolution = 10, 
        shuffle = true, 
        rng = TaskLocalRNG()), 
  resampling = StratifiedCV(
        nfolds = 5, 
        shuffle = true, 
        rng = TaskLocalRNG()), 
  measure = Accuracy(), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = MLJBase.NumericRange{Int64, MLJBase.Bounded, Symbol}[NumericRange(2 ≤ max_depth ≤ 10; origin=6.0, unit=4.0), NumericRange(1 ≤ min_samples_leaf ≤ 5; origin=3.0, unit=2.0), NumericRange(2 ≤ min_samples_split ≤ 10; origin=6.0, unit=4.0)], 
  selection_heuristic = MLJTuning.NaiveSel

In [27]:
# find the best model, exploring different hyperparameterizations with cross validation
mach = machine(tuned_tree, X, y)
fit!(mach)
y_predict_probabilities = MLJ.predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

┌ Info: Training machine(ProbabilisticTunedModel(model = DecisionTreeClassifier(max_depth = -1, …), …), …).
└ @ MLJBase /home/mauro/.julia/packages/MLJBase/yVJvJ/src/machines.jl:499
┌ Info: Attempting to evaluate 405 models.
└ @ MLJTuning /home/mauro/.julia/packages/MLJTuning/xiLEY/src/tuned_models.jl:762


           ┌────────────────────────────────┐
           │          Ground Truth          │
┌──────────┼──────────┬──────────┬──────────┤
│Predicted │  setosa  │versicol… │virginica │
├──────────┼──────────┼──────────┼──────────┤
│  setosa  │    14    │    0     │    0     │
├──────────┼──────────┼──────────┼──────────┤
│versicol… │    0     │    13    │    0     │
├──────────┼──────────┼──────────┼──────────┤
│virginica │    0     │    2     │    16    │
└──────────┴──────────┴──────────┴──────────┘


## Learning with Sole.jl

### Tabular Datasets and Logisets

Symbolic AI treats tabular datasets, such as the iris flower, as sets of
propositional interpretations, onto which formulas of propositional logic are
interpreted.

Look at the (classical) tabular dataset $\mathcal{I}$ below. We denote instances
with $I$, and *variables*$^{[1]}$, as $V_i$.

$$
\begin{array}{c|ccc}
 & V_1 & V_2 & V_3 \\ \hline
I_1 & 1.2 & [1,2,3] & \text{A} \\
I_2 & 1.3 & [9,7,6] & \text{B} \\
I_3 & 0.8 & [2,8,2] & \text{C} \\
I_4 & 1.1 & [1,3,7] & \text{B} \\
I_5 & 1.2 & [4,3,3] & \text{B} \\
\end{array}
$$

We can change the point of view on the table above from a statistical to a
logical one, called a *logiset*.

This requires the definition of a propositional alphabet $\mathcal{P}$.

Consider $\mathcal{P} = \{p, q, r\}$, with: 

$$p \coloneqq V_1 \geq 1$$
$$q \coloneqq \text{sum}(V_2) < 13$$
$$r \coloneqq V_3 = \text{B}$$

We denote the truth constant with $\top$ (top), and the false constant with
$\bot$ (bot).

The resulting (propositional) logiset $\mathcal{I}_\mathcal{P}$ is:

$$
\begin{array}{c|ccc}
 & p & q & r \\ \hline
I_1 & \top & \top & \bot \\
I_2 & \top & \bot & \top \\
I_3 & \bot & \top & \bot \\
I_4 & \top & \top & \top \\
I_5 & \top & \top & \top \\
\end{array}
$$

$^{[1]}$ We use the term "variable" to denote, in general, a column of the
tabular dataset: this corresponds to a raw attribute or a *feature* (a processed
attribute).

In [28]:
using MLJBase
using SoleData

In [49]:
X_logiset = PropositionalLogiset(data);
X_logiset.tabulardataset

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [None]:
phi = parseformula(
    "sepal_length > 5.8 ∧ sepal_width < 3.0 ∨ target == \"setosa\"";
    atom_parser = a->Atom(parsecondition(SoleData.ScalarCondition, a; featuretype = SoleData.VariableValue)),
    # TODO: this should prevent the warning below, but the dispatch is caught by SoleLogics
    # featvaltype = Real,
    # featuretype = SoleData.VarFeature
)

MethodError: MethodError: no method matching parseformula(::Type{SyntaxTree}, ::String, ::Nothing; atom_parser::var"#59#60", featuretype::Nothing, featvaltype::Nothing)
This error has been manually thrown, explicitly, so the method may exist but be intentionally marked as unimplemented.

Closest candidates are:
  parseformula(::Type{<:SyntaxTree}, ::AbstractString, ::Union{Nothing, AbstractVector}; function_notation, atom_parser, additional_whitespaces, opening_parenthesis, closing_parenthesis, arg_delim) got unsupported keyword arguments "featuretype", "featvaltype"
   @ SoleLogics ~/.julia/packages/SoleLogics/zRJKP/src/utils/parse.jl:115
  parseformula(::Type{<:Formula}, ::AbstractString, ::Any...; kwargs...)
   @ SoleLogics ~/.julia/packages/SoleLogics/zRJKP/src/types/parse.jl:13
  parseformula(::Type{<:SyntaxTree}, ::AbstractString, !Matched::SoleLogics.AbstractLogic; kwargs...)
   @ SoleLogics ~/.julia/packages/SoleLogics/zRJKP/src/utils/parse.jl:521
  ...


In [51]:
ScalarCondition{Real, VariableValue}

ScalarCondition{Real, VariableValue, M} where M<:(ScalarMetaCondition{VariableValue})

### From DecisionTree.jl to SoleModels.jl

### Extracting logical rules using SolePostHoc.jl

In [None]:
using SolePostHoc

# lumen(nomemodello)
# batrees(nomemodello)
# rulecosiplus(nomemodello)