# Symbolic Machine Learning

The main way-to-go for implementing a machine learning pipeline in Julia is via
the [MLJ.jl](https://juliaai.github.io/MLJ.jl/stable/) package.

We are going to work with the `iris` dataset, trying to discover the relation 
between the specific attribute values of an iris flower and the family to which
the same flower belongs to.

More generally, we want to find the relation between the values of the
*attributes* of each instance (`X`) and the corresponding *labels* (`y`). 

In order to do so, we are going to train a (classification) decision tree,
leveraging the `DecisionTree` package, which can be easily integrated within an
`MLJ` pipeline.

Later in this notebook, we will repeat this process leveraging the `Sole.jl`
library, which will allow us to explicitly model the problem through the lens of
logic.

In [None]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.update()

In [None]:
# for reproducibility purposes
using Random
Random.seed!(1605)

## Learning with MLJ.jl

### Data Loading and Description

In [None]:
using MLJ
using RDatasets # used to load the iris dataset


data = RDatasets.dataset("datasets", "iris");

In [None]:
schema(data)

In [None]:
data

In [None]:
y, X = unpack(data, ==(:Species))

In [None]:
# categorical vectors are lighter than raw vectors; can you guess why?
typeof(y)

In [None]:
typeof(X)

In [None]:
# to ensure that classes are balanced
for class in unique(y)
    println("$(class) - $(count(yi -> yi == class, y))")
end

### Data Preprocessing

In the limited scenario of this exercise, there is not much space for complex
preprocessing of our data. For example, we are not dealing with unbalanced
classes, missing data, or complex encodings. 

The usual workflow, at this point, is to partition the data into a training and
a test bucket, keeping a balanced class diversity.

With this distinction, we can train a model on the initial training data and
leverage the test one for simulating a real-world scenario, obtaining reliable
performance.

MLJ makes our work *much* easier, even providing us with a more sophisticated
training strategy, as we will see later.

### Model Training

We will integrate an external model, coming from the `DecisionTree` package,
into the MLJ workflow.

In the next lessons, we will be doing something similar with another model
called `ModalDecisionTree`.

In [None]:
try
    DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
catch
    println("The DecisionTreeClassifier symbol has already been imported.")
end

In [None]:
model = MLJDecisionTreeInterface.DecisionTreeClassifier(
    max_depth=5, 
    min_samples_leaf=1, 
    min_samples_split=2
)

A machine is a binding between a model and the data it works with.

It also keeps track of other information we might want to inspect, such as the
specific parameters learned by a model.

In the cell below, we bind the decision tree model to all the instances we have
available. This is not a good idea, but we will return on the topic in a moment.

In [None]:
mach = machine(model, X, y)

In [None]:
fit!(mach)

In [None]:
y_predict_probabilities = MLJ.predict(mach, X)
y_predict = mode.(y_predict_probabilities)

In [None]:
fitted_params(mach).tree

### Confusion Matrix and Overfitting 

It is common practice to summarize the performance of a model using a
*confusion matrix*, containing the true positives and negatives found by our
model on the test data, as well as the false positives and negatives.

In the case of binary classification, a confusion matrix is shaped as follows.
$$
\begin{array}{c|c|c}
\text{Predicted / Ground truth} & \text{Positive} & \text{Negative} \\ \hline
\text{Positive} & TP & FN \\
\text{Negative} & FP & TN
\end{array}
$$

Among the many, three important measures can be obtained by the matrix above:
accuracy, precision, and recall.
In the binary classification scenario, they are defined as follows.
$$\text{Accuracy} = \frac{TP + TN}{TP + FP + TN +FN}$$
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Recall} = \frac{TP}{TP + FN}$$

In the multi-class scenario, as in our case, we can compute precision and recall
individually for each class. For obtaining a unique scalar, we can average all
the results.

In [None]:
cm = confusion_matrix(y_predict, y)

In [None]:
# wow! our model is so good!
accuracy(cm)

How awful! The model we just trained is bad, for sure.

Can you tell why?

Answer (decode from [base64encode](https://www.base64encode.org/)): `VGhlIGNvZGUgaXMgbm90IGdlbmVyYWxpemluZyEKVGhlIHNwbGl0cyBpbiB0aGUgdHJlZXMganVzdCBiZWNvbWUgYSBzdHJhdGVneSBmb3IgbWVtb3JpemluZyAoYW5kIGNvbXByZXNzaW5nKSB0aGUgZ2l2ZW4gZGF0YS4KUmVtZW1iZXI6IGFuIGludGVsbGlnZW50IGJlaGF2aW91ciBhbHdheXMgc3RlbXMgZnJvbSBnZW5lcmFsaXphdGlvbiBjYXBhYmlsaXRpZXMu`

### Model Evaluation

Imagine projecting the data points on a bidimensional plane: can you provide a graphical sketch 
of what is happening during the inference process of the tree trained above? 

Let us to obtain a more reliable model.

In [None]:
(X_train, X_test), (y_train, y_test) = partition((X, y), 0.7, rng=121, shuffle=true, multi=true);

In [None]:
mach = machine(model, X_train, y_train)
fit!(mach)
y_predict_probabilities = MLJ.predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

We can iterate the process above on multiple *folds*, to assess the overall quality of a 
machine learning training strategy. This technique is commonly called *cross-validation*.

In the following, the iris dataset will be shuffled and divided into training and test in 
different ways, and each time a decision tree will be learned and tested over a different
portion of the data.

In [None]:
acc = evaluate!(
    mach,
    resampling=StratifiedCV(; nfolds = 5, shuffle=true),    # cross validation
    measures=[accuracy]
)

### Training with Hyperparameters Tuning

The arguments of `DecisionTreeClassifier(...)` are said to be `hyperparameters`,
as they are the meta-parameters exploited for creating a specific algorithm
(i.e., the if-else cascade we call decision tree).

Which combination of hyperparameters should we provide?

In this rather lightweight example, we can systematically try many combinations
and keep the one which expresses the highest performance.

This technique goes under the name of *grid search*.

In [None]:
max_depth_range = range(Int, :max_depth, lower=2, upper=10)
min_samples_leaf_range = range(Int, :min_samples_leaf, lower=1, upper=5)
min_samples_split_range = range(Int, :min_samples_split, lower=2, upper=10);

In [None]:
tuned_tree = TunedModel(
    model = MLJDecisionTreeInterface.DecisionTreeClassifier(),
    resampling = StratifiedCV(nfolds = 5, shuffle = true),
    range = [max_depth_range, min_samples_leaf_range, min_samples_split_range],
    measure = accuracy,
    tuning = Grid()
)

In [None]:
# find the best model, exploring different hyperparameterizations with cross validation
mach = machine(tuned_tree, X, y)
fit!(mach)
y_predict_probabilities = MLJ.predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

## Learning with Sole.jl

### Tabular Datasets and Logisets

Symbolic AI treats tabular datasets, such as the iris flower, as sets of
propositional interpretations, onto which formulas of propositional logic are
interpreted.

Look at the (classical) tabular dataset $\mathcal{I}$ below. We denote instances
with $I$, and *variables*$^{[1]}$, as $V_i$.

$$
\begin{array}{c|ccc}
 & V_1 & V_2 & V_3 \\ \hline
I_1 & 1.2 & [1,2,3] & \text{A} \\
I_2 & 1.3 & [9,7,6] & \text{B} \\
I_3 & 0.8 & [2,8,2] & \text{C} \\
I_4 & 1.1 & [1,3,7] & \text{B} \\
I_5 & 1.2 & [4,3,3] & \text{B} \\
\end{array}
$$

We can change the point of view on the table above from a statistical to a
logical one, called a *logiset*.

This requires the definition of a propositional alphabet $\mathcal{P}$.

Consider $\mathcal{P} = \{p, q, r\}$, with: 

$$p \coloneqq V_1 \geq 1$$
$$q \coloneqq \text{sum}(V_2) < 13$$
$$r \coloneqq V_3 = \text{B}$$

We denote the truth constant with $\top$ (top), and the false constant with
$\bot$ (bot).

The resulting (propositional) logiset $\mathcal{I}_\mathcal{P}$ is:

$$
\begin{array}{c|ccc}
 & p & q & r \\ \hline
I_1 & \top & \top & \bot \\
I_2 & \top & \bot & \top \\
I_3 & \bot & \top & \bot \\
I_4 & \top & \top & \top \\
I_5 & \top & \top & \top \\
\end{array}
$$

$^{[1]}$ We use the term "variable" to denote, in general, a column of the
tabular dataset: this corresponds to a raw attribute or a *feature* (a processed
attribute).

In [None]:
using MLJBase
using SoleData

In [None]:
X_logiset = PropositionalLogiset(data);
X_logiset.tabulardataset == data

In [None]:
phi = parseformula(
    "SepalLength > 5.8 âˆ§ SepalWidth < 3.0 âˆ¨ Species == \"setosa\"";
    atom_parser = a->Atom(parsecondition(SoleData.ScalarCondition, a; featuretype = SoleData.VariableValue)),
    # TODO: this should prevent the warning below, but the dispatch is caught by SoleLogics
    # featvaltype = Real,
    # featuretype = SoleData.VarFeature
)

In [None]:
# check(phi, SoleLogics.LogicalInstance(X_logiset, 1))
check(phi, X_logiset, 1)

### From DecisionTree.jl to SoleModels.jl

If we manage to make an existing model compliant with the interface of `SoleModels` package, then we can play with it from a logical standpoint.

In [None]:
using SoleModels

In [None]:
mach = machine(model, X_train, y_train)
fit!(mach)

# \:seedling:
ðŸŒ± = fitted_params(mach).tree

In [None]:
# we encode the model in such a way that it can be investigated via SoleModels
# \:evergreen_tree:
ðŸŒ² = solemodel(ðŸŒ±)
printmodel(ðŸŒ²)

In [None]:
# these are all the logical rules encoded by the tree
listrules(ðŸŒ²)

In [None]:
metricstable(ðŸŒ²)

In [None]:
# show all the testing instances to the tree, and compare the metrics
# with the testing samples
apply!(ðŸŒ², X_test, y_test);

In [None]:
# we can visualize how our model behaved at testing time 
metricstable(
    ðŸŒ²; 
    normalize = true, 
    metrics_kwargs = (; 
        additional_metrics = (; 
            height = r->SoleLogics.height(antecedent(r))
        )
    )
)

In [None]:
# join some rules for the same class into a single, sufficient and necessary 
# condition for the same class
metricstable(joinrules(ðŸŒ²; min_ncovered = 1, normalize = true))

Here, we are just scratching the surface of Sole framework, limiting ourselves to pretty 
printings.

In the next lessons, we will enhance the machine learning pipeline we introduced today,
with spatial reasoning considerations.

Below, there is a little spoiler about a fancy machine learning model, which is general enough
for dealing with more-than-propositional logics.

In [None]:
using ModalDecisionTrees

mdt_model = ModalDecisionTree()
mach = machine(mdt_model, X_test, y_test)
fit!(mach)
y_pred = predict_mode(mach)
cm = confusion_matrix(y_predict, y_test)

### Extracting logical rules using SolePostHoc.jl

In [None]:
using SolePostHoc

lumen(ðŸŒ²)
# batrees(nomemodello)
# rulecosiplus(nomemodello)