# Real World Applications

In [None]:
using Pkg
Pkg.activate("..")
Pkg.instantiate()
Pkg.update()

## Identifying academically vulnerable learners in first-year science programmes 
[
    Identifying academically vulnerable learners in first-year science
    programmes at a South African higher-education institution
](https://sacj.cs.uct.ac.za/index.php/sacj/article/view/832).

In [None]:
using ARFFFiles
using DataFrames

data = ARFFFiles.load(
    DataFrame, 
    joinpath("..", "dataset", "academically-vulnerable-learners.arff")
)

describe(data)

Oh no! Some attributes have maaany missing values!!!

In [None]:
attributes_with_missings = Vector{Tuple{String, Int}}()

for attribute_name in names(data)
    n_missings = count(x -> ismissing(x), data[:, attribute_name])

    if n_missings > 0 
        push!(attributes_with_missings, (attribute_name, n_missings))
    end
end

sort!(attributes_with_missings, by = x -> x[2], rev = true)

Some preprocessing is required, let's remove columns with missing values

In [None]:
using Impute

data_nomissing = Impute.filter(data; dims=:cols)

describe(data_nomissing)

In [None]:
using MLJ

schema(data_nomissing)

Let's see which kind of models we could use...

In [None]:
y, X = unpack(data_nomissing, ==(Symbol("Risk Status")))

models(matching(X,y))

Too bad! Most models don't work with categorical values out of the box...

This includes the `DecisionTreeClassifier` from `DecisionTree.jl`!

Hence, we first need to encode these values as numerical values.

One possibility is to convert the type of the associated features from
`Multiclass` to `Continuous` or `OrderedFactor`.

In [None]:
data_preprocessed = coerce(data_nomissing, "Risk Status"=>OrderedFactor)
data_preprocessed = coerce(data_preprocessed, Multiclass=>Continuous)

schema(data_preprocessed)

Let's have a look at the data...

In [None]:
y, X = unpack(data_preprocessed, ==(Symbol("Risk Status")))

Great! We can now use a `DecisionTreeClassifier` like in our example!

In [None]:
models(matching(X,y))

Let's first choose a random sample from our dataset: we will use it later to
evaluate our model.

In [None]:
data_shuffled = shuffle(data_preprocessed)  # Let's first shuffle our data
y, X = unpack(data_shuffled, ==(Symbol("Risk Status")))
X_train, y_train = X[1:600, :], y[1:600]
X_test, y_test = X[601:800, :], y[601:800];

Let's try to work following the pipeline we learned this week!

In [None]:
try
    DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
catch
    println("The DecisionTreeClassifier symbol has already been imported.")
end

In [None]:
model = MLJDecisionTreeInterface.DecisionTreeClassifier()

In [None]:
mach = machine(model, X_train, y_train)

In [None]:
fit!(mach)

In [None]:
fitted_params(mach).tree

In [None]:
y_predict_probabilities = predict(mach, X_test)
y_predict = mode.(y_predict_probabilities)
cm = confusion_matrix(y_predict, y_test)

In [None]:
accuracy(cm)