# Fourth Approach: Dimensionality reduction using PCA and re-sampling with SMOTE

In this fouth approach, we will combine the approaches used in the second and third approach. We will preprocess the data by applying **Principal Component Analysis (PCA)** to reduce the dimensionality of the dataset, followed by **Synthetic Minority Over-sampling Technique (SMOTE)** to address class imbalance. First, we will load the data and apply PCA to all features to determine the optimal number of components, which will be selected based on the explained variance ratio. This allows us to retain as much of the data's variance as possible while reducing the dimensionality.

After the PCA transformation, we will apply SMOTE to generate synthetic samples for the minority class, balancing the dataset before training the models. This combination of dimensionality reduction and oversampling is aimed at improving model performance, particularly on imbalanced datasets.

To be able to compare the results with the rest of the approaches, we will use the same configuration of hyperparameters for the models:

- **ANN**:
  - Hidden layers: 1, number of neurons in the hidden layer: $[16, 32, 64]$.
  - Hidden layers: 2, number of neurons in the hidden layers $[(16, 16), (32, 16), (32, 32), (64, 32), (64, 64)]$.
- **Decision Tree**:
  - Maximum depth of the tree $\in \{3, 5, 10, 15, 20, \text{None}\}$
- **SVM**:
  - Kernel $\in \{\text{linear}, \text{poly}, \text{rbf}, \text{sigmoid}\}$
  - C $\in \{0.1, 1, 10\}$
- **KNN\***:
  - $k \in \{1, 3, 5, 7, 9, 11, 13, 15\}$

After training the models, we will train an ensemble model with the three best models. The method used to combine the models will be:

- **Majority voting**
- **Weighted voting**
- **Naive Bayes**
- **Stacking** (using a logistic regression as the meta-model)


**Index**

- [Data loading](#Data-loading)
- [PCA Transformation](#PCA-Transformation)
- [Individual models](#Individual-models)
  - [ANN](#ANN)
  - [Decision Tree](#Decision-Tree)
  - [Support Vector Machine](#Support-Vector-Machine)
  - [K-Nearest Neighbors](#K-Nearest-Neighbors)
- [Ensemble model](#Ensemble-model)
  - [Majority voting](#Majority-voting)
  - [Weighted voting](#Weighted-voting)
  - [Naive Bayes](#Naive-Bayes)
  - [Stacking](#Stacking)


## Data loading

In [1]:
using DataFrames
using CSV
using Serialization

In [2]:
# Load custom functions from provided files
include("preprocessing.jl")
include("metrics.jl")
include("training.jl")
include("plotting.jl")

┌ Info: Running `conda install -y -c anaconda conda` in root environment
└ @ Conda /home/markel/.julia/packages/Conda/zReqD/src/Conda.jl:181


Channels:
 - anaconda
 - defaults
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/markel/.julia/conda/3/x86_64

  added / updated specs:
    - conda


The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge/noarch::certifi-2024.8.30~ --> anaconda/linux-64::certifi-2024.8.30-py312h06a4308_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


┌ Info: Running `conda install -y -c conda-forge 'libstdcxx-ng>=3.4,<13.0'` in root environment
└ @ Conda /home/markel/.julia/packages/Conda/zReqD/src/Conda.jl:181


Channels:
 - conda-forge
 - defaults
 - anaconda
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/markel/.julia/conda/3/x86_64

  added / updated specs:
    - libstdcxx-ng[version='>=3.4,<13.0']


The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            anaconda/linux-64::certifi-2024.8.30-~ --> conda-forge/noarch::certifi-2024.8.30-pyhd8ed1ab_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


generateComparisonTable (generic function with 2 methods)

In [3]:
# Set the random seed for reproducibility
Random.seed!(42)

# Load the dataset
dataset_path = "dataset.csv"
data = CSV.read(dataset_path, DataFrame)
data[1:5, :]

Row,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,Age at enrollment,International,Curricular units 1st sem (credited),Curricular units 1st sem (enrolled),Curricular units 1st sem (evaluations),Curricular units 1st sem (approved),Curricular units 1st sem (grade),Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Float64,Int64,Int64,Int64,Int64,Int64,Float64,Int64,Float64,Float64,Float64,String15
1,1,8,5,2,1,1,1,13,10,6,10,1,0,0,1,1,0,20,0,0,0,0,0,0.0,0,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
2,1,6,1,11,1,1,1,1,3,4,4,1,0,0,0,1,0,19,0,0,6,6,6,14.0,0,0,6,6,6,13.6667,0,13.9,-0.3,0.79,Graduate
3,1,1,5,5,1,1,1,22,27,10,10,1,0,0,0,1,0,19,0,0,6,0,0,0.0,0,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
4,1,8,2,15,1,1,1,23,27,6,4,1,0,0,1,0,0,20,0,0,6,8,6,13.4286,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
5,2,12,1,3,0,1,1,22,28,10,10,0,0,0,1,0,0,45,0,0,6,9,5,12.3333,0,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [4]:
# Separate features and target
target_column = :Target
inputs = select(data, Not(target_column))
targets = data[!, target_column];

In [5]:
inputs = Float32.(Matrix(inputs))

# Define the categories and their mapping
label_mapping = Dict("Dropout" => 0, "Graduate" => 1, "Enrolled" => 2)

# Encode the targets
targets_label_encoded = [label_mapping[label] for label in targets]

println("Encoded targets: ", targets_label_encoded[1:5])

# To decode later, create a reverse mapping
reverse_mapping = Dict(v => k for (k, v) in label_mapping)
decoded_targets = [reverse_mapping[code] for code in targets_label_encoded]

println("Decoded targets: ", decoded_targets[1:5])

Encoded targets: [0, 1, 0, 1, 1]
Decoded targets: ["Dropout", "Graduate", "Dropout", "Graduate", "Graduate"]


## PCA transformation

In [6]:
inputs = Float32.(Matrix(inputs))

# Define the number of folds for cross-validation and obtain the indices
Random.seed!(42)
k = 5
N = size(inputs, 1)
fold_indices = crossValidation(targets, k)
metrics_to_save = [:accuracy, :precision, :recall, :f1_score];

In [7]:
target_column = :Target
println("\nClass Distribution:")
println(combine(groupby(data, target_column), nrow => :Count))


Class Distribution:
[1m3×2 DataFrame[0m
[1m Row [0m│[1m Target   [0m[1m Count [0m
     │[90m String15 [0m[90m Int64 [0m
─────┼─────────────────
   1 │ Dropout    1421
   2 │ Graduate   2209
   3 │ Enrolled    794


## SMOTE Experiments


In [8]:
smote_percentages = [
  Dict("Enrolled" => 200),
  Dict("Enrolled" => 300),
  Dict("Enrolled" => 200, "Dropout" => 200),
  Dict("Enrolled" => 300, "Dropout" => 200),
  Dict("Enrolled" => 200, "Graduate" => 50)
]
k = 5

open("warnings.log", "w") do file
  redirect_stderr(file) do # redirect warnings associated with joblib
    for (i, smote_percentage) in enumerate(smote_percentages)
      println("\nSmote percentages: ", smote_percentage)
      balanced_inputs, balanced_targets = smote(inputs, targets, smote_percentage, k)
      println("Number of instances: ", size(balanced_targets)[1])
      println("Elements of class Dropout: ", sum(balanced_targets .== "Dropout"))
      println("Elements of class Graduate: ", sum(balanced_targets .== "Graduate"))
      println("Elements of class Enrolled: ", sum(balanced_targets .== "Enrolled"))
    end
  end
end


Smote percentages: Dict("Enrolled" => 200)
Number of instances: 5218
Elements of class Dropout: 1421
Elements of class Graduate: 2209
Elements of class Enrolled: 1588

Smote percentages: Dict("Enrolled" => 300)
Number of instances: 6012
Elements of class Dropout: 1421
Elements of class Graduate: 2209
Elements of class Enrolled: 2382

Smote percentages: Dict("Enrolled" => 200, "Dropout" => 200)
Number of instances: 6639
Elements of class Dropout: 2842
Elements of class Graduate: 2209
Elements of class Enrolled: 1588

Smote percentages: Dict("Enrolled" => 300, "Dropout" => 200)
Number of instances: 7433
Elements of class Dropout: 2842
Elements of class Graduate: 2209
Elements of class Enrolled: 2382

Smote percentages: Dict("Enrolled" => 200, "Graduate" => 50)
Number of instances: 4113
Elements of class Dropout: 1421
Elements of class Graduate: 1104
Elements of class Enrolled: 1588


In [None]:
# Best configurations
topology = [64, 32]
topology_scikit_ann = [64]
max_depth = 5
n_neighbors = 5
kernel = "linear"
C = 10

# ANN
hyperparameters_ann = Dict(
  "topology" => topology,
  "learningRate" => 0.01,
  "maxEpochs" => 100,
  "repetitionsTraining" => 10,
  "validationRatio" => 0.15,
  "maxEpochsVal" => 10,
  "minLoss" => 0.0001
)

# scikitANN
hyperparameters_scikit_ann = Dict(
  :hidden_layer_sizes => topology_scikit_ann,
  :learning_rate_init => 0.01,
  :max_iter => 100,
  :early_stopping => true,
  :tol => 0,
  :validation_fraction => 0.15,
  :n_iter_no_change => 10,
  :epsilon => 0.0001,
  :repetitionsTraining => 10
)

# DT
hyperparameters_dt = Dict(
  :max_depth => max_depth,
  :criterion => "gini",
  :min_samples_split => 2,
)

# SVM
hyperparameters_svm = Dict(
  :kernel => kernel,
  :C => C,
  :gamma => "auto",
  :probability => true,
)

# KNN
hyperparameters_knn = Dict(
  :n_neighbors => n_neighbors,
  :weights => "uniform",
  :metric => "euclidean",
)

# Define the hyperparameters for smote
k = 5
smote_percentages = [
  Dict("Enrolled" => 200),
  Dict("Enrolled" => 300),
  Dict("Enrolled" => 200, "Dropout" => 200),
  Dict("Enrolled" => 300, "Dropout" => 200),
  Dict("Enrolled" => 200, "Graduate" => 50),
  Dict{String,Int}()
];

In [None]:
Random.seed!(42)

general_results_ann = []
class_results_ann = []
general_results_scikit_ann = []
class_results_scikit_ann = []
general_results_dt = []
class_results_dt = []
general_results_svm = []
class_results_svm = []
general_results_knn = []
class_results_knn = []

open("warnings.log", "w") do file
  redirect_stderr(file) do # redirect warnings associated with joblib
    for smote_percentage in smote_percentages
      println("\nSmote percentage: ", smote_percentage)

      # ANN
      println("ANN")
      gr, cr = modelCrossValidation(
        :ANN,
        hyperparameters_ann,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_ann, gr)
      push!(class_results_ann, cr)

      # Scikit ANN
      println("scikitANN")
      gr, cr = modelCrossValidation(
        :scikit_ANN,
        hyperparameters_scikit_ann,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )

      push!(general_results_scikit_ann, gr)
      push!(class_results_scikit_ann, cr)

      # DT
      println("DT")
      gr, cr = modelCrossValidation(
        :DT,
        hyperparameters_dt,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_dt, gr)
      push!(class_results_dt, cr)

      # SVM
      println("SVM")
      gr, cr = modelCrossValidation(
        :SVC,
        hyperparameters_svm,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_svm, gr)
      push!(class_results_svm, cr)

      # KNN
      println("KNN")
      gr, cr = modelCrossValidation(
        :KNN,
        hyperparameters_knn,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_knn, gr)
      push!(class_results_knn, cr)
    end
  end
end


Smote percentage: Dict("Enrolled" => 200)
ANN
Mean accuracy: 0.50329 ± 0.03332
	Class 1: 0.40221 ± 0.0872
	Class 2: 0.46878 ± 0.01671
	Class 3: 0.78019 ± 0.01686
Mean precision: 0.18837 ± 0.03023
	Class 1: 0.23631 ± 0.03359
	Class 2: 0.19763 ± 0.06564
	Class 3: 0.07684 ± 0.04322
Mean recall: 0.32559 ± 0.04436
	Class 1: 0.59687 ± 0.1737
	Class 2: 0.24289 ± 0.18126
	Class 3: 0.07023 ± 0.03771
Mean f1_score: 0.19841 ± 0.03879
	Class 1: 0.31123 ± 0.08233
	Class 2: 0.18018 ± 0.12199
	Class 3: 0.04727 ± 0.0204
scikitANN
Mean accuracy: 0.82531 ± 0.00668
	Class 1: 0.82938 ± 0.01385
	Class 2: 0.82743 ± 0.01753
	Class 3: 0.8206 ± 0.01156
Mean precision: 0.70965 ± 0.01797
	Class 1: 0.75658 ± 0.01411
	Class 2: 0.72655 ± 0.0294
	Class 3: 0.51406 ± 0.08018
Mean recall: 0.7387 ± 0.011
	Class 1: 0.86209 ± 0.05551
	Class 2: 0.7516 ± 0.10567
	Class 3: 0.26931 ± 0.08956
Mean f1_score: 0.70963 ± 0.01292
	Class 1: 0.80476 ± 0.02995
	Class 2: 0.72746 ± 0.07328
	Class 3: 0.33862 ± 0.08672
DT
Mean accuracy: 

### Save the results and analysis

In [16]:
# Save the results
results_folder = "results/"
if !isdir(results_folder)
  mkdir(results_folder)
end

filename = results_folder * "4_smote_results.jl"

parameters = Dict("Enrolled" => [200, 300, 200, 300, 200, 100], "Dropout" => [100, 100, 200, 200, 100, 100], "Graduate" => [100, 100, 100, 100, 50, 100])

# Create a dictionary with the results of ANN, DT, SVM, and KNN
obj = Dict(
  :ANN => Dict(
    "num_trained_models" => length(general_results_ann),
    "parameters" => parameters,
    "general_results" => general_results_ann,
    "class_results" => class_results_ann
  ),
  :scikit_ANN => Dict(
    "num_trained_models" => length(general_results_scikit_ann),
    "parameters" => parameters,
    "general_results" => general_results_scikit_ann,
    "class_results" => class_results_scikit_ann
  ),
  :DT => Dict(
    "num_trained_models" => length(general_results_dt),
    "parameters" => parameters,
    "general_results" => general_results_dt,
    "class_results" => class_results_dt
  ),
  :SVM => Dict(
    "num_trained_models" => length(general_results_svm),
    "parameters" => parameters,
    "general_results" => general_results_svm,
    "class_results" => class_results_svm
  ),
  :KNN => Dict(
    "num_trained_models" => length(general_results_knn),
    "parameters" => parameters,
    "general_results" => general_results_knn,
    "class_results" => class_results_knn
  )
)

# Save the results
open(filename, "w") do file
  serialize(file, obj)
end

In [17]:
results_folder = "results/"
filename = results_folder * "4_smote_results.jl"

# Load the results
loaded_obj = open(filename, "r") do file
  deserialize(file)
end;

In [18]:
# Generate tables for each algorithm sorted by f1 score
generateAlgorithmTables(loaded_obj, sort_by=:F1_Score, rev=true, output_dir="./tables/Approach4/smote/")


Comparison of Hyperparameter Configurations for DT (Sorted by F1_Score):
┌────────────────────────────────────────────┬──────────┬───────────┬──────────┬──────────┐
│[1m                              Configuration [0m│[1m Accuracy [0m│[1m Precision [0m│[1m   Recall [0m│[1m F1-Score [0m│
├────────────────────────────────────────────┼──────────┼───────────┼──────────┼──────────┤
│ Enrolled: 100, Graduate: 100, Dropout: 100 │ 0.798931 │  0.710279 │ 0.703955 │ 0.695744 │
│ Enrolled: 300, Graduate: 100, Dropout: 100 │ 0.783759 │  0.693434 │ 0.680226 │ 0.676134 │
│ Enrolled: 200, Graduate: 100, Dropout: 100 │ 0.791587 │  0.680377 │ 0.703955 │ 0.672764 │
│ Enrolled: 200, Graduate: 100, Dropout: 200 │ 0.769236 │  0.677612 │ 0.667797 │ 0.649467 │
│ Enrolled: 300, Graduate: 100, Dropout: 200 │ 0.766547 │  0.670212 │ 0.661017 │ 0.643035 │
│  Enrolled: 200, Graduate: 50, Dropout: 100 │  0.68154 │  0.611539 │ 0.529944 │ 0.539169 │
└────────────────────────────────────────────┴──────────┴─

## ANN

We are going to start with our implementation for Artificial Neural Networks. To augment the robustness of the model, we will train each architecture 10 times with each fold of the cross-validation.

We trained 8 models, 3 with one hidden layer and 5 with two hidden layers. The used topology for the hidden layers are:

- **One hidden layer**:
  - 16 neurons
  - 32 neurons
  - 64 neurons
- **Two hidden layers**:
  - (16, 16) neurons
  - (32, 16) neurons
  - (32, 32) neurons
  - (64, 32) neurons
  - (64, 64) neurons


In [19]:
# Set the random seed for reproducibility
Random.seed!(42)

topologies = [ [64, 32], [16], [32], [64], [16, 16], [32, 16], [32, 32], [64, 64]]

smote_percentage = Dict("Enrolled" => 300, "Graduate" => 100, "Dropout" => 100)

general_results_ann = []
class_results_ann = []

for topology in topologies
  hyperparameters = Dict(
    "topology" => topology,
    "learningRate" => 0.01,
    "maxEpochs" => 100,
    "repetitionsTraining" => 10,
    "validationRatio" => 0.15,
    "maxEpochsVal" => 10,
    "minLoss" => 0.0001
  )

  println("Training ANN with topology: ", topology)

  gr, cr = modelCrossValidation(
    :ANN,
    hyperparameters,
    inputs,
    targets,
    fold_indices;
    metricsToSave=metrics_to_save,
    normalizationType=:zeroMean,
    applyPCA=true,
    pcaComponents=0.95,
    applySmote=true,
    smotePercentages=smote_percentage,
    smoteNeighbors=k,
    verbose=false
  )

  push!(general_results_ann, gr)
  push!(class_results_ann, cr)
end

Training ANN with topology: [64, 32]
Mean accuracy: 0.65977 ± 0.05269
	Class 1: 0.63897 ± 0.04789
	Class 2: 0.62719 ± 0.08745
	Class 3: 0.78764 ± 0.02467
Mean precision: 0.45517 ± 0.06839
	Class 1: 0.41495 ± 0.07752
	Class 2: 0.61874 ± 0.07673
	Class 3: 0.07217 ± 0.04485
Mean recall: 0.5269 ± 0.05381
	Class 1: 0.42557 ± 0.15466
	Class 2: 0.76927 ± 0.09785
	Class 3: 0.0341 ± 0.03568
Mean f1_score: 0.46031 ± 0.07918
	Class 1: 0.37994 ± 0.12718
	Class 2: 0.66231 ± 0.08904
	Class 3: 0.04225 ± 0.03938
Training ANN with topology: [16]
Mean accuracy: 0.63504 ± 0.05098
	Class 1: 0.58392 ± 0.08432
	Class 2: 0.62649 ± 0.04824
	Class 3: 0.75031 ± 0.01248
Mean precision: 0.45757 ± 0.04323
	Class 1: 0.40101 ± 0.08343
	Class 2: 0.62315 ± 0.0512
	Class 3: 0.09812 ± 0.02221
Mean recall: 0.48036 ± 0.0681
	Class 1: 0.46347 ± 0.0851
	Class 2: 0.64305 ± 0.14884
	Class 3: 0.05796 ± 0.02214
Mean f1_score: 0.45506 ± 0.05797
	Class 1: 0.4148 ± 0.06289
	Class 2: 0.61984 ± 0.08681
	Class 3: 0.0687 ± 0.02134
Tra

## ScikitLearn ANN

We will use the MLPClassifier from ScikitLearn to train the ANN models. The hyperparameters used in the models are the same as in the previous ANN implementation:

We trained 8 models, 3 with one hidden layer and 5 with two hidden layers. The used topology for the hidden layers are:

- **One hidden layer**:
  - 16 neurons
  - 32 neurons
  - 64 neurons
- **Two hidden layers**:
  - (16, 16) neurons
  - (32, 16) neurons
  - (32, 32) neurons
  - (64, 32) neurons
  - (64, 64) neurons

In [17]:
# Set the random seed for reproducibility
Random.seed!(42)

topologies = [[16], [32], [64], [16, 16], [32, 16], [32, 32], [64, 32], [64, 64]]

general_results_scikit_ann = []
class_results_scikit_ann = []

for topology in topologies
  hyperparameters = Dict(
    :hidden_layer_sizes => topology,
    :learning_rate_init => 0.01,
    :max_iter => 100,
    :early_stopping => true,
    :tol => 0,
    :validation_fraction => 0.15,
    :n_iter_no_change => 10,
    :epsilon => 0.0001,
    :repetitionsTraining => 10
  )

  println("Training ANN with topology: ", topology)

  gr, cr = modelCrossValidation(
    :scikit_ANN,
    hyperparameters,
    inputs,
    targets,
    fold_indices;
    metricsToSave=metrics_to_save,
    normalizationType=:zeroMean,
    applyPCA=true,
    pcaComponents=0.95,
    applySmote=true,
    smotePercentages=smote_percentage,
    smoteNeighbors=k,
    verbose=false
  )

  push!(general_results_scikit_ann, gr)
  push!(class_results_scikit_ann, cr)
end

Training ANN with topology: [16]
Mean accuracy: 0.75216



 ± 0.02319
	Class 1: 0.78176 ± 0.03157
	Class 2: 0.70719 ± 0.0751
	Class 3: 0.73961 ± 0.0565
Mean precision: 0.72335 ± 0.01168
	Class 1: 0.76048 ± 0.08484
	Class 2: 0.49098 ± 0.1725
	Class 3: 0.68027 ± 0.20461
Mean recall: 0.61428 ± 0.03788
	Class 1: 0.60217 ± 0.05112
	Class 2: 0.67842 ± 0.06936
	Class 3: 0.60225 ± 0.06697
Mean f1_score: 0.64065 ± 0.03297
	Class 1: 0.65984 ± 0.03476
	Class 2: 0.53153 ± 0.09351
	Class 3: 0.60854 ± 0.09976
Training ANN with topology: [32]
Mean accuracy: 0.76077 ± 0.01534
	Class 1: 0.78775 ± 0.03401
	Class 2: 0.71821 ± 0.06984
	Class 3: 0.74596 ± 0.03613
Mean precision: 0.7264 ± 0.01239
	Class 1: 0.75992 ± 0.07409
	Class 2: 0.50448 ± 0.18979
	Class 3: 0.67788 ± 0.18411
Mean recall: 0.62596 ± 0.0275
	Class 1: 0.59994 ± 0.05341
	Class 2: 0.66048 ± 0.06378
	Class 3: 0.63215 ± 0.04713
Mean f1_score: 0.65152 ± 0.02196
	Class 1: 0.65748 ± 0.02786
	Class 2: 0.53508 ± 0.10111
	Class 3: 0.62837 ± 0.08874
Training ANN with topology: [64]
Mean accuracy: 0.76486



 ± 0.00688
	Class 1: 0.77469 ± 0.02775
	Class 2: 0.73657 ± 0.0451
	Class 3: 0.75897 ± 0.03552
Mean precision: 0.71618 ± 0.01143
	Class 1: 0.74495 ± 0.07911
	Class 2: 0.51116 ± 0.16486
	Class 3: 0.66479 ± 0.18691
Mean recall: 0.63511 ± 0.01307
	Class 1: 0.62757 ± 0.05998
	Class 2: 0.63835 ± 0.02899
	Class 3: 0.63038 ± 0.01824
Mean f1_score: 0.65724 ± 0.00989
	Class 1: 0.668 ± 0.04966
	Class 2: 0.53911 ± 0.0902
	Class 3: 0.62696 ± 0.10236
Training ANN with topology: [16, 16]
Mean accuracy: 0.75418 ± 0.02539
	Class 1: 0.76089 ± 0.0557
	Class 2: 0.73787 ± 0.03385
	Class 3: 0.73794 ± 0.05118
Mean precision: 0.71801 ± 0.00895
	Class 1: 0.70362 ± 0.14261
	Class 2: 0.54732 ± 0.12888
	Class 3: 0.66829 ± 0.20858
Mean recall: 0.61835 ± 0.04307
	Class 1: 0.65319 ± 0.04056
	Class 2: 0.61436 ± 0.0214
	Class 3: 0.60767 ± 0.07612
Mean f1_score: 0.64267 ± 0.03751
	Class 1: 0.65265 ± 0.09094
	Class 2: 0.54361 ± 0.06292
	Class 3: 0.60484 ± 0.10349
Training ANN with topology: [32, 16]
Mean accuracy: 0.758



0.76177 ± 0.01935
	Class 1: 0.79173 ± 0.03008
	Class 2: 0.71638 ± 0.0695
	Class 3: 0.75006 ± 0.04349
Mean precision: 0.72551 ± 0.0108
	Class 1: 0.78191 ± 0.05543
	Class 2: 0.48599 ± 0.17368
	Class 3: 0.67607 ± 0.19202
Mean recall: 0.62909 ± 0.03279
	Class 1: 0.62367 ± 0.04922
	Class 2: 0.6525 ± 0.05636
	Class 3: 0.62348 ± 0.05625
Mean f1_score: 0.65379 ± 0.02656
	Class 1: 0.68393 ± 0.03194
	Class 2: 0.52257 ± 0.09139
	Class 3: 0.62376 ± 0.09856
Training ANN with topology: [64, 32]
Mean accuracy: 



0.75965 ± 0.01033
	Class 1: 0.78418 ± 0.04086
	Class 2: 0.73084 ± 0.06108
	Class 3: 0.74034 ± 0.03589
Mean precision: 0.71859 ± 0.01083
	Class 1: 0.74021 ± 0.09591
	Class 2: 0.54293 ± 0.16919
	Class 3: 0.63612 ± 0.20661
Mean recall: 0.62768 ± 0.0193
	Class 1: 0.63518 ± 0.03328
	Class 2: 0.64665 ± 0.04331
	Class 3: 0.61514 ± 0.03226
Mean f1_score: 0.65193 ± 0.01485
	Class 1: 0.67246 ± 0.05289
	Class 2: 0.55779 ± 0.09673
	Class 3: 0.59632 ± 0.10356
Training ANN with topology: [64, 64]
Mean accuracy: 



0.75824 ± 0.02448
	Class 1: 0.77921 ± 0.02841
	Class 2: 0.73094 ± 0.06898
	Class 3: 0.73929 ± 0.05156
Mean precision: 0.71985 ± 0.01846
	Class 1: 0.71991 ± 0.08259
	Class 2: 0.54764 ± 0.16469
	Class 3: 0.66039 ± 0.20999
Mean recall: 0.62472 ± 0.03985
	Class 1: 0.62754 ± 0.04751
	Class 2: 0.64454 ± 0.04674
	Class 3: 0.61555 ± 0.05833
Mean f1_score: 0.64889 ± 0.03398
	Class 1: 0.65369 ± 0.03327
	Class 2: 0.55636 ± 0.09558
	Class 3: 0.60764 ± 0.10709


### Decision Tree

The Decision Tree model will be trained with the following hyperparameters:

- Maximum depth of the tree $\in \{3, 5, 10, 15, 20, \text{nothing}\}$


In [18]:
max_depths = [3, 5, 10, 15, 20, nothing]

general_results_dt = []
class_results_dt = []

for max_depth in max_depths
  hyperparameters = Dict(
    :max_depth => max_depth,
    :criterion => "gini",
    :min_samples_split => 2,
  )

  println("Training DT model with max_depth: ", max_depth)

  gr, ct = modelCrossValidation(
    :DT,
    hyperparameters,
    inputs,
    targets,
    fold_indices;
    metricsToSave=metrics_to_save,
    normalizationType=:zeroMean,
    applyPCA=true,
    pcaComponents=0.95,
    applySmote=true,
    smotePercentages=smote_percentage,
    smoteNeighbors=k,
    verbose=false
  )

  push!(general_results_dt, gr)
  push!(class_results_dt, ct)
end

Training DT model with max_depth: 3
Mean accuracy: 0.68289 ± 0.0509
	Class 1: 0.71131 ± 0.0564
	Class 2: 0.64725 ± 0.15231
	Class 3: 0.67724 ± 0.08727
Mean precision: 0.63093 ± 0.02223
	Class 1: 0.6256 ± 0.2113
	Class 2: 0.4554 ± 0.3221
	Class 3: 0.61786 ± 0.18871
Mean recall: 0.51789 ± 0.09049
	Class 1: 0.59017 ± 0.21087
	Class 2: 0.51024 ± 0.12778
	Class 3: 0.408 ± 0.13039
Mean f1_score: 0.5344 ± 0.07349
	Class 1: 0.56803 ± 0.15457
	Class 2: 0.41391 ± 0.14491
	Class 3: 0.4846 ± 0.14356
Training DT model with max_depth: 5
Mean accuracy: 0.69532 ± 0.03691
	Class 1: 0.7145 ± 0.06762
	Class 2: 0.66554 ± 0.09342
	Class 3: 0.68468 ± 0.09599
Mean precision: 0.65031 ± 0.03044
	Class 1: 0.55628 ± 0.26307
	Class 2: 0.61799 ± 0.22934
	Class 3: 0.56997 ± 0.29323
Mean recall: 0.53236 ± 0.05737
	Class 1: 0.5526 ± 0.08853
	Class 2: 0.58244 ± 0.11997
	Class 3: 0.45602 ± 0.10096
Mean f1_score: 0.55564 ± 0.05432
	Class 1: 0.52045 ± 0.14676
	Class 2: 0.565 ± 0.1555
	Class 3: 0.45218 ± 0.11016
Training 

### Support Vector Machine

The SVM model will be trained with all the possible combinations of the following hyperparameters:

- Kernel $\in \{\text{linear}, \text{poly}, \text{rbf}, \text{sigmoid}\}$
- C $\in \{0.1, 1, 10\}$


In [19]:
kernel_C = [
  ("linear", 0.1),
  ("linear", 1.0),
  ("linear", 10.0),
  ("poly", 0.1),
  ("poly", 1.0),
  ("poly", 10.0),
  ("rbf", 0.1),
  ("rbf", 1.0),
  ("rbf", 10.0),
  ("sigmoid", 0.1),
  ("sigmoid", 1.0),
  ("sigmoid", 10.0)
]

general_results_svm = []
class_results_svm = []

for (kernel, C) in kernel_C
  hyperparameters = Dict(
    :kernel => kernel,
    :C => C,
    :gamma => "auto",
    :probability => true,
  )

  println("Training SVM model with kernel: ", kernel, " and C: ", C)

  gr, cr = modelCrossValidation(
    :SVC,
    hyperparameters,
    inputs,
    targets,
    fold_indices;
    metricsToSave=metrics_to_save,
    normalizationType=:zeroMean,
    applyPCA=true,
    pcaComponents=0.95,
    applySmote=true,
    smotePercentages=smote_percentage,
    smoteNeighbors=k,
    verbose=false
  )

  push!(general_results_svm, gr)
  push!(class_results_svm, cr)
end

Training SVM model with kernel: linear and C: 0.1
Mean accuracy: 0.77448 ± 0.01921
	Class 1: 0.78683 ± 0.02724
	Class 2: 0.74802 ± 0.06558
	Class 3: 0.7708 ± 0.04208
Mean precision: 0.71598 ± 0.01267
	Class 1: 0.7093 ± 0.1815
	Class 2: 0.59796 ± 0.25739
	Class 3: 0.62383 ± 0.23252
Mean recall: 0.65282 ± 0.03295
	Class 1: 0.66226 ± 0.05655
	Class 2: 0.6722 ± 0.07187
	Class 3: 0.58959 ± 0.03386
Mean f1_score: 0.67091 ± 0.02527
	Class 1: 0.67382 ± 0.11197
	Class 2: 0.61198 ± 0.17808
	Class 3: 0.58759 ± 0.12233
Training SVM model with kernel: linear and C: 1.0
Mean accuracy: 0.77743 ± 0.02491
	Class 1: 0.79542 ± 0.0258
	Class 2: 0.74757 ± 0.07241
	Class 3: 0.77126 ± 0.03852
Mean precision: 0.71798 ± 0.01118
	Class 1: 0.69394 ± 0.17251
	Class 2: 0.51551 ± 0.24773
	Class 3: 0.72964 ± 0.1881
Mean recall: 0.65712 ± 0.04162
	Class 1: 0.64698 ± 0.04412
	Class 2: 0.64493 ± 0.08425
	Class 3: 0.63965 ± 0.08549
Mean f1_score: 0.6745 ± 0.03272
	Class 1: 0.65991 ± 0.10319
	Class 2: 0.55455 ± 0.17735
	

### K-Nearest Neighbors

The KNN model will be trained with the following hyperparameters:

- $k \in \{1, 3, 5, 7, 9, 11, 13, 15\}$


In [20]:
n_neighbors = [1, 3, 5, 7, 9, 11, 13, 15]

general_results_knn = []
class_results_knn = []

open("warnings.log", "w") do file
  redirect_stderr(file) do # redirect warnings associated with joblib
    for n in n_neighbors
      hyperparameters = Dict(
        :n_neighbors => n,
        :weights => "uniform",
        :metric => "euclidean",
      )

      println("Training KNN model with n_neighbors: ", n)

      gr, cr = modelCrossValidation(
        :KNN,
        hyperparameters,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applyPCA=true,
        pcaComponents=0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )

      push!(general_results_knn, gr)
      push!(class_results_knn, cr)
    end
  end
end

Training KNN model with n_neighbors: 1
Mean accuracy: 0.73751 ± 0.01565
	Class 1: 0.7545 ± 0.04164
	Class 2: 0.71773 ± 0.05964
	Class 3: 0.73033 ± 0.05876
Mean precision: 0.66871 ± 0.01447
	Class 1: 0.67884 ± 0.18015
	Class 2: 0.56378 ± 0.26448
	Class 3: 0.55609 ± 0.23756
Mean recall: 0.60128 ± 0.02615
	Class 1: 0.59629 ± 0.06084
	Class 2: 0.59107 ± 0.08374
	Class 3: 0.5657 ± 0.02985
Mean f1_score: 0.62198 ± 0.02152
	Class 1: 0.62215 ± 0.10402
	Class 2: 0.55846 ± 0.18406
	Class 3: 0.54109 ± 0.14531
Training KNN model with n_neighbors: 3
Mean accuracy: 0.73626 ± 0.01877
	Class 1: 0.75473 ± 0.03991
	Class 2: 0.72835 ± 0.07282
	Class 3: 0.70592 ± 0.04884
Mean precision: 0.67211 ± 0.01515
	Class 1: 0.6805 ± 0.20443
	Class 2: 0.64106 ± 0.22487
	Class 3: 0.47295 ± 0.24338
Mean recall: 0.5945 ± 0.03068
	Class 1: 0.59318 ± 0.06569
	Class 2: 0.59466 ± 0.06098
	Class 3: 0.55313 ± 0.03988
Mean f1_score: 0.61785 ± 0.02564
	Class 1: 0.61957 ± 0.1261
	Class 2: 0.60096 ± 0.15506
	Class 3: 0.48472 ± 0

### Save the results

In order to be able to compare the results of the models without running the training again, we will save the results in a dictionary. The dictionary will have the following structure:

```julia
{
    :model: {
        'num_trained_models': int,
        'parameters': Dict{String, Any},
        'general_results': [
            {
                'accuracy': AbstractVector{Float64},
                'precision': AbstractVector{Float64},
                'recall': AbstractVector{Float64},
                'f1_score': AbstractVector{Float64},
            },
            ... # One element for each trained model
        ],
        'class_results': [
            [
                {
                    'accuracy': AbstractVector{Float64},
                    'precision': AbstractVector{Float64},
                    'recall': AbstractVector{Float64},
                    'f1_score': AbstractVector{Float64},
                },
                ... # One element for each class
            ],
            ... # One element for each trained model
        ]
    }
}
```

The results of all approaches will be avaiable in the `results` dictionary. The filename with the results of individual models of this first approach will be `1_individual_results.jl`.


In [25]:
results_folder = "results/"
if !isdir(results_folder)
  mkdir(results_folder)
end

filename = results_folder * "4_individual_results.jl"

# Separete the kernel and C values of the hyperparameter list for SVM
kernels = [item[1] for item in kernel_C]
C_values = [item[2] for item in kernel_C]

# Create a dictionary with the results of ANN, DT, SVM, and KNN
obj = Dict(
  :ANN => Dict(
    "num_trained_models" => length(general_results_ann),
    "parameters" => Dict(
      "topology" => topologies
    ),
    "general_results" => general_results_ann,
    "class_results" => class_results_ann
  ),
  :scikit_ANN => Dict(
    "num_trained_models" => length(general_results_scikit_ann),
    "parameters" => Dict(
      "hidden_layer_sizes" => topologies
    ),
    "general_results" => general_results_scikit_ann,
    "class_results" => class_results_scikit_ann
  ),
  :DT => Dict(
    "num_trained_models" => length(general_results_dt),
    "parameters" => Dict(
      "max_depth" => max_depths
    ),
    "general_results" => general_results_dt,
    "class_results" => class_results_dt
  ),
  :SVM => Dict(
    "num_trained_models" => length(general_results_svm),
    "parameters" => Dict(
      "kernel" => kernels,
      "C" => C_values
    ),
    "general_results" => general_results_svm,
    "class_results" => class_results_svm
  ),
  :KNN => Dict(
    "num_trained_models" => length(general_results_knn),
    "parameters" => Dict(
      "n_neighbors" => n_neighbors
    ),
    "general_results" => general_results_knn,
    "class_results" => class_results_knn
  )
)

# Save the results
open(filename, "w") do file
  serialize(file, obj)
end

## Base model plots

In [None]:
results_folder = "results/"
filename = results_folder * "4_individual_results.jl"

# Load the results
loaded_obj = open(filename, "r") do file
  deserialize(file)
end

model_names, metrics, metric_means, metric_stds, metric_maxes = aggregateMetrics(loaded_obj)

# Plot metrics for each algorithm
plotMetricsAlgorithm(loaded_obj, output_dir="./plots/Approach4", ylim=(0.6, 0.9))

In [None]:
generateAlgorithmTables(loaded_obj, sort_by=:F1_Score, rev=true, output_dir="./tables/Approach4/")


Comparison of Hyperparameter Configurations for DT (Sorted by F1_Score):
┌────────────────────┬──────────┬───────────┬──────────┬──────────┐
│[1m      Configuration [0m│[1m Accuracy [0m│[1m Precision [0m│[1m   Recall [0m│[1m F1-Score [0m│
├────────────────────┼──────────┼───────────┼──────────┼──────────┤
│      max_depth: 10 │ 0.738521 │  0.644719 │  0.60904 │    0.622 │
│ max_depth: nothing │ 0.721066 │   0.61858 │ 0.586441 │ 0.598321 │
│       max_depth: 3 │ 0.732337 │  0.653709 │ 0.613559 │ 0.595028 │
│       max_depth: 5 │ 0.720757 │  0.686536 │ 0.577401 │ 0.589566 │
│      max_depth: 15 │  0.71221 │  0.626006 │ 0.566102 │ 0.586023 │
│      max_depth: 20 │ 0.711139 │  0.625714 │ 0.563986 │  0.58397 │
└────────────────────┴──────────┴───────────┴──────────┴──────────┘
Results for DT saved to ./tables/Approach4/.

Comparison of Hyperparameter Configurations for KNN (Sorted by F1_Score):
┌─────────────────┬──────────┬───────────┬──────────┬──────────┐
│[1m   Configuration

In [None]:
plotCombinedMetrics(model_names, metrics, metric_means, metric_stds, output_dir="./plots/Approach4/", show=true, ylim=(0.6, 0.9))

In [None]:
generateComparisonTable(model_names, metrics, metric_maxes; sort_by=:accuracy, rev=true)

## Ensemble models

After training the individual models, we will train an ensemble model with the three best models. The method used to combine the models will be:

- **Majority voting**
- **Weighted voting**
- **Naive Bayes**
- **Stacking** (using a logistic regression as the meta-model)


In [None]:
# Select the best models
estimators = [:ANN, :SVC, :KNN]
hyperparameters = Vector{Dict}([
  Dict(
    :hidden_layer_sizes => (32),
    :learning_rate_init => 0.01,
    :max_iter => 100,
    :early_stopping => true,
    :tol => 0,
    :validation_fraction => 0.15,
    :n_iter_no_change => 10,
    :epsilon => 0.0001
  ),
  Dict(
  :kernel => "rbf",
  :C => 1.0,
  :gamma => "auto",
  :probability => true,
  ),
  Dict(
    :n_neighbors => 11,
    :weights => "uniform",
    :metric => "euclidean",
  )])

# Define ensembles
ensembles = [
  Dict(
    :type => :Voting,
    :hyperparameters => Dict(
    )
  ),
  Dict(
    :type => :Voting,
    :hyperparameters => Dict(
      :voting => "soft",
      :weights => [0.5, 0.2, 0.3]
    )
  ),
  Dict(
    :type => :Stacking,
    :hyperparameters => Dict(
      :final_estimator => LogisticRegression()
    )
  )
]

In [None]:
for (index, ensemble) in enumerate(ensembles)
    println("Training ensemble ", ensemble[:type])
    metrics, class_results = trainClassEnsemble(
        estimators,
        hyperparameters,
        (inputs, targets_label_encoded),
        fold_indices;
        ensembleType = ensemble[:type],
        ensembleHyperParameters = ensemble[:hyperparameters],
        metricsToSave = metrics_to_save,
        repetitionsTraining = 5,
        applyPCA = true,
        pcaComponents = 0.95,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
    )
    ensemble[:results] = metrics
    ensemble[:class_results] = class_results
    println("------------------------------------")
end

In [None]:
results_folder = "results/"
if !isdir(results_folder)
  mkdir(results_folder)
end

filename = results_folder * "4_ensemble_results.jl"

# Create a dictionary with the results of ANN, DT, SVM, and KNN
obj = Dict(
  :Voting_Hard => Dict(
    "general_results" => ensembles[1][:results],
    "class_results" => ensembles[1][:class_results]
  ),
  :Voting_Soft => Dict(
    "general_results" => ensembles[2][:results],
    "class_results" => ensembles[2][:class_results]
  ),
  :Stacking => Dict(
    "general_results" => ensembles[3][:results],
    "class_results" => ensembles[3][:class_results]
  )
)

# Save the results
open(filename, "w") do file
  serialize(file, obj)
end

In [None]:
filename = results_folder * "4_ensemble_results.jl"

# Load the results
loaded_obj = open(filename, "r") do file
  deserialize(file)
end

model_names, metrics, metric_means, metric_stds, metric_means_class, metric_stds_class, metric_maxes, metric_maxes_class = aggregateMetrics(loaded_obj, ensemble=true)

In [None]:
plotCombinedMetrics(model_names, metrics, metric_maxes_class, metric_stds, output_dir="./plots/Approach4/ensembles", show=true)

In [None]:
generateComparisonTable(model_names, metrics, metric_maxes_class; sort_by=:f1_score, rev=true)