In [None]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.update()

In [None]:
using Random

Random.seed!(1235)

## Identifying academically vulnerable learners in first-year science programmes 
[
    Identifying academically vulnerable learners in first-year science
    programmes at a South African higher-education institution
](
    https://sacj.cs.uct.ac.za/index.php/sacj/article/view/832
)

In [None]:
using ARFFFiles
using DataFrames

data = ARFFFiles.load(
    DataFrame, 
    joinpath("..", "datasets", "academically-vulnerable-learners.arff")
)

describe(data)

Oh no! Some attributes have maaany missing values!!!

In [None]:
attributes_with_missings = Vector{Tuple{String, Int}}()

for attribute_name in names(data)
    n_missings = count(x -> ismissing(x), data[:, attribute_name])

    if n_missings > 0 
        push!(attributes_with_missings, (attribute_name, n_missings))
    end
end

attxmiss = sort!(attributes_with_missings, by = x -> x[2], rev = false)

We have some preprocessing to do!

Let's start by dropping the columns with more missing values.

In [None]:
# we want to drop these features
colstodrop = [feature for (feature, nmisses) in attxmiss[11:end]]

In [None]:
[select!(data, Not(Symbol(col))) for col in colstodrop];    # remember the bang!
describe(data)

We still have some missings!

Let's remove rows with missing values.

In [None]:
using Impute

data_nomissing = Impute.filter(data; dims=:rows)

In [None]:
using MLJ

schema(data_nomissing)

Let's see which kind of models we could use...

In [None]:
y, X = unpack(data_nomissing, ==(Symbol("Risk Status")))

models(matching(X,y))

Too bad! Most models don't work with categorical values out of the box...

This includes the `DecisionTreeClassifier` from `DecisionTree.jl`!

Hence, we first need to encode these values as numerical values.

One possibility is to convert the type of the associated features from
`Multiclass` to `Continuous` or `OrderedFactor`.

In [None]:
data_preprocessed = coerce(data_nomissing, "Risk Status"=>OrderedFactor)
data_preprocessed = coerce(data_preprocessed, Multiclass=>Continuous)

schema(data_preprocessed)

Let's have a look at the data...

In [None]:
y, X = unpack(data_preprocessed, ==(Symbol("Risk Status")))

Great! We can now use a `DecisionTreeClassifier` like in our example!

In [None]:
models(matching(X,y))

Let's first choose a random sample from our dataset: we will use it later to
evaluate our model.

In [None]:
y, X = unpack(data_preprocessed, ==(Symbol("Risk Status")))

(X_train, X_test), (y_train, y_test) = partition(
    (X, y),
    0.8,
    rng=13,
    shuffle=true,
    multi=true
);

Let's try to work following the pipeline we learned this week!

In [None]:
try
    DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
catch
    println("The DecisionTreeClassifier symbol has already been imported.")
end

In [None]:
model = MLJDecisionTreeInterface.DecisionTreeClassifier()

In [None]:
mach = machine(model, X_train, y_train)

In [None]:
fit!(mach)

In [None]:
ðŸŒ± = fitted_params(mach).tree   # \:seedling:

Let's evaluate performance!

In [None]:
y_predict_probabilities = predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

In [None]:
accuracy(cm)

Let's extract logical rules!

In [None]:
using SoleModels

ðŸŒ² = solemodel(ðŸŒ±)  # \:evergreen_tree:

In [None]:
listrules(ðŸŒ²)

Let's evaluate each formula (or logical rule) separately.

In [None]:
apply!(ðŸŒ², X_test, y_test);
metricstable(
    ðŸŒ²; 
    normalize = true, 
    metrics_kwargs = (; 
        additional_metrics = (; 
            height = r->SoleLogics.height(antecedent(r))
        )
    )
)

Let's summarize our model joining rules associated with the same class!

In [None]:
metricstable(joinrules(ðŸŒ²; min_ncovered = 1, normalize = true))

Let's now try to learn a random forest.

In [None]:
try
    RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree
catch
    println("The RandomForestClassifier symbol has already been imported.")
end

In [None]:
forest = MLJDecisionTreeInterface.RandomForestClassifier(n_trees=10)

In [None]:
forestmach = machine(forest, X_train, y_train)

In [None]:
MLJ.fit!(forestmach, verbosity=0)

In [None]:
ðŸŒ±ðŸŒ±ðŸŒ± = fitted_params(forestmach).forest   # \:seedling:

Let's evaluate its performance.

In [None]:
y_predict_probabilities = MLJ.predict(forestmach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

In [None]:
accuracy(cm)

Let's extract logical rules!

In [None]:
ðŸŒ²ðŸŒ²ðŸŒ² = solemodel(ðŸŒ±ðŸŒ±ðŸŒ±)  # \:evergreen_tree:

In [None]:
listrules(ðŸŒ²ðŸŒ²ðŸŒ²)

This probably caused an `OutOfMemoryError()` (on my machine, it sure did!)

To appreciate this last part, let's play with a smaller model.

In [None]:
forest = MLJDecisionTreeInterface.RandomForestClassifier(max_depth=3, n_trees=3)

In [None]:
forestmach = machine(forest, X_train, y_train)

In [None]:
MLJ.fit!(forestmach, verbosity=0)

In [None]:
ðŸŒ±ðŸŒ± = fitted_params(forestmach).forest # \:seedling:

Let's evaluate its performance.

In [None]:
y_predict_probabilities = MLJ.predict(forestmach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

In [None]:
accuracy(cm)

Let's extract logical rules!

In [None]:
ðŸŒ²ðŸŒ² = solemodel(ðŸŒ±ðŸŒ±)  # \:evergreen_tree:

In [None]:
listrules(ðŸŒ²ðŸŒ²)

Let's summarize our model joining rules associated with the same class!

In [None]:
metricstable(joinrules(ðŸŒ²ðŸŒ²; min_ncovered = 1, normalize = true))