In [None]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.status()

In [None]:
using Random

Random.seed!(1234)

### Many-Expert Decision Trees

`ManyExpertDecisionTrees.jl` is still in development and has not been released
yet!

In [None]:
using ManyExpertDecisionTrees

"Many-Expert Decision Trees" sounds like a very general name...

Let's start with some motivation (and only one expert):
- we want to move from "hard" to "soft" decisions
- we want a better treatment of uncertainty

How will we achieve that?
- evaluating all the branches in our tree, and choosing the one(s) with higher
values - i.e., we do not comply to one, strict, crisp decision at each step,
but we take into consideration the contribution of each node
- for each node, we won't have that a feature is "true" or "false"; rather, we
will assign a value between 0 and 1, and combine these values using the t-norm
- at the end, we do not constraint the model to always give a (single) class - 
it can also say "I do not know which class, but """surely""" (for the model) it
is between those classes"

Let's load, once again, the "iris" dataset.

In [None]:
using RDatasets # used to load the iris dataset

data = RDatasets.dataset("datasets", "iris");

In [None]:
using MLJ

y, X = unpack(data, ==(:Species));

And let's split out data into training and test.

In [None]:
(X_train, X_test), (y_train, y_test) = partition(
    (X, y),
    0.8,
    rng=13,
    shuffle=true,
    multi=true
);

Our approach works in the following way:
- we will further divide our training dataset into n+1 slices, were n is the
number of experts (in this first example, just one - so we'll have 2 slices)
- then, we will learn a classical (crisp) decision tree, using the first slice
of the training set
- finally, we will use each of the other `n` slices to train some parameters
characterising a "soft" version of the learnt decision tree

In [None]:
(X_train_dt, X_train_exp), (y_train_dt, y_train_exp) = partition(
    (X_train, y_train),
    0.4,
    rng=42,
    shuffle=true,
    multi=true
);

Let's build a classical (crisp) decision tree on the first slice.

(Remember: we already shuffled out instances when splitting into train/test)

In [None]:
using DecisionTree

# Build a standard decision tree (explicitly)
dt = build_tree(y_train_dt, Matrix(X_train_dt))

In [None]:
# Prune tree: merge leaves having >= 90% combined purity
dt = prune_tree(dt, 0.9)

In [None]:
print_tree(dt)

In [None]:
y_pred = apply_tree(dt, Matrix(X_test))

In [None]:
cm = confusion_matrix(y_test, y_pred)

In [None]:
accuracy(cm)

Let's try to soften this decision tree!

First, we need to define a new structure (we need to add more information about
each node of our decision tree); namely, a `ManyExpertDecisionTree`.

Not only that: we need to specify a `ManyExpertAlgebra` (more on that in a
minute!) specifying a fuzzy logic to use for each expert.

In [None]:
using SoleLogics.ManyValuedLogics

mxa = ManyExpertAlgebra(ProductLogic)

The idea is to soften the original decision tree treating each node as a
"membership" to a "fuzzy set" (note that if we use only `true` and `false`, we
obtain the original split).

Hence, we will leverage membership functions, associating one for each node to
each expert: i.e., the parameters I'm learning are the parameters of the chosen
function; in our case, we will only use Gaussian functions.

For membership functions, we will leverage the `FuzzyLogic.jl` package.

Watch out! Even if it is called `FuzzyLogic.jl`, this package offers classical
tools (like membership functions) to work with fuzzy sets and system, and it is
NOT a package to manipulate mathematical fuzzy logic.

Moreover, since we already have `FuzzyLogic` as a type in our naming space, we
provide an alias to load the package, as follows.

In [None]:
using ManyExpertDecisionTrees: FL   # This is an alias for `FuzzyLogic.jl`
using Plots

hot = FL.GaussianMF(35.0, 5.0)  # temp>25
plot(hot, -10, 50)

In [None]:
cold = FL.GaussianMF(10.0, 7.5) # tempâ‰¤25
plot(cold, -10, 50)

In [None]:
hot(32)

In [None]:
cold(32)

In [None]:
hot(12)

In [None]:
cold(12)

In [None]:
hot(25)

In [None]:
cold(25)

To soften the decision tree, we use the manify function, specifying:
- the original decision tree
- the portion of the training set to use
- a tuple of kind of "membership functions" to use (one for each expert)

In [None]:
medt = manify(dt, X_train_exp, (FL.GaussianMF))

In [None]:
y_pred_mxa = map(eachrow(X_test)) do row
    result = ManyExpertDecisionTrees.apply(
        medt,
        mxa,
        Vector{Float64}(row)
    )
    return length(result) != 1 ? :vague : first(result)
end

In [None]:
n_total = length(y_test)

n_correct = count(i -> y_pred_mxa[i] == y_test[i], 1:n_total)
(n_correct / n_total) * 100

In [None]:
n_vague = count(==(:vague), y_pred_mxa)
(n_vague / n_total) * 100


In [None]:
n_wrong = n_total - n_correct - n_vague
(n_wrong / n_total) * 100

Wow, this was lucky! This improved performance!

Probably, the heuristic didn't choose the "best" attribute at each step: with
softening, we can make up for it!

Let's combare different many-expert algebras...

In [None]:
using Combinatorics

allexperts = (GodelLogic, LukasiewiczLogic, ProductLogic);

# Compute all possible expert compbinations (with replacement)
expertcomb = begin
    c = Vector{Vector{FuzzyLogic}}()
    for i in 1:length(allexperts)
        append!(c, collect(Combinatorics.with_replacement_combinations(allexperts, i)))
    end
    c
end;

# This is useful to read results later 
expertcombreadable = map(expertcomb) do experts
    result = ""
    for expert in experts
        if (expert === GodelLogic)
            result *= "G"
        end
        if (expert === LukasiewiczLogic)
            result *= "L"
        end
        if (expert === ProductLogic)
            result *= "P"
        end
    end

    return result
end;

correct = [[0.0, 0.0] for _ in 1:length(expertcomb)];
wrong = [[0.0, 0.0] for _ in 1:length(expertcomb)];
vague = [[0.0, 0.0] for _ in 1:length(expertcomb)];

n_runs = 10

for i in 1:n_runs
    # Partition set into training and validation
    X_train, y_train, X_test, y_test = begin
        train, test = partition(eachindex(y), 0.8, shuffle=true, rng=i)
        X_train, y_train = X[train, :], y[train]
        X_test, y_test = X[test, :], y[test]
        X_train, y_train, X_test, y_test
    end

    # Build a standard decision tree
    dt = build_tree(y_train, Matrix(X_train))
    dt = prune_tree(dt, 0.9)

    # For each expert combination, build a ManyExpertDecisionTree 
    Threads.@threads for k in eachindex(expertcomb)
        mf_experts = ntuple(_ -> FL.GaussianMF, length(expertcomb[k]))
        MXA = ManyExpertAlgebra(expertcomb[k]...)

        medt = manify(dt, X_train, mf_experts...)

        y_pred = map(eachrow(X_test)) do row
            result = ManyExpertDecisionTrees.apply(
                medt,
                MXA,
                Vector{Float64}(row)
            )
            return length(result) != 1 ? :vague : first(result)
        end

        # Extrapolating statistics
        n_total = length(y_test)

        n_vague = count(==(:vague), y_pred)
        pvague = (n_vague / n_total) * 100

        n_correct = count(i -> y_pred[i] == y_test[i], 1:n_total)
        pcorrect = (n_correct / n_total) * 100

        n_wrong = n_total - n_correct - n_vague
        pwrong = (n_wrong / n_total) * 100

        deltacorrect = (pcorrect - correct[k][1])
        correct[k][1] += deltacorrect / i
        correct[k][2] += deltacorrect * (pcorrect - correct[k][1])

        deltawrong = (pwrong - wrong[k][1])
        wrong[k][1] += deltawrong / i
        wrong[k][2] += deltawrong * (pwrong - wrong[k][1])

        deltavague = (pvague - vague[k][1])
        vague[k][1] += deltavague / i
        vague[k][2] += deltavague * (pvague - vague[k][1])

    end
end

# Process results: extract means and compute standard deviations (sample std)
correct_mean = [x[1] for x in correct]
correct_std = [sqrt(x[2] / (n_runs - 1)) for x in correct]

wrong_mean = [x[1] for x in wrong]
wrong_std = [sqrt(x[2] / (n_runs - 1)) for x in wrong]

vague_mean = [x[1] for x in vague]
vague_std = [sqrt(x[2] / (n_runs - 1)) for x in vague]

df = DataFrame(
    experts=expertcombreadable,
    correct_mean=correct_mean,
    correct_std=correct_std,
    wrong_mean=wrong_mean,
    wrong_std=wrong_std,
    vague_mean=vague_mean,
    vague_std=vague_std
)

**Exercise**: play some more with the iris dataset, trying different
combinations of experts. Which is the one that works better?

**Exercise**: put into practice what you learned using the following dataset!

In [None]:
using CSV
using DataFrames

data = DataFrame(CSV.File("../datasets/penguins.csv"))

We need a bit of data preprocessing...

(We will see more about it tomorrow!)

In [None]:
using Impute

data_nomissing = Impute.filter(data; dims=:rows);

In [None]:
schema(data_nomissing)

In [None]:
data_drop_cols = select!(data_nomissing, Not(:island, :sex))