## Third Approach: Re-sampling with SMOTE

In this approach, we will use the **SMOTE (Synthetic Minority Over-sampling Technique)** algorithm to balance the dataset. SMOTE generates synthetic samples for the minority class to balance the dataset. First, we will oversample the minority class **Enrolled**, then we will train a model using the balanced dataset. Additionally, we will perform other experiments, such as oversampling the other minority class **Dropout** and undersampling the majority class **Graduated**.

The SMOTE algorithm is described in the following paper: [SMOTE: Synthetic Minority Over-sampling Technique](https://www.jair.org/index.php/jair/article/view/10302/24590). In this paper, the authors present the algorithm and demonstrate how this technique of oversampling the minority class is superior to traditional oversampling techniques, which simply duplicate the samples of the minority class.

Below is the pseudo-code for a two-class problem:

```julia
Algorithm SMOTE(T, N, k)
Input:
    T = Number of minority class samples
    N = Percentage of oversampling (SMOTE percentage)
    k = Number of nearest neighbors

Output:
    (N/100) * T synthetic minority class samples

1. If N is less than 100%, randomize the minority class samples, as only a random percentage of them will be SMOTEd.
2. If N < 100 then
    3. Randomize the T minority class samples
    4. T = (N / 100) * T
    5. N = 100
6. End if
7. N = (int)(N / 100) * T  (*The amount of SMOTE is assumed to be in integral multiples of 100.*)
8. k = Number of nearest neighbors
9. numattrs = Number of attributes
10. Sample[][]: Array for original minority class samples
11. newindex: Counter for number of synthetic samples, initialized to 0
12. Synthetic[][]: Array for synthetic samples

13. For i = 1 to T
    14. Compute k nearest neighbors for sample i, and save the indices in nnarray
    15. Populate(N, i, nnarray)
16. End for

Function Populate(N, i, nnarray):
17. While N > 0
    18. Choose a random number between 1 and k, call it nn. This step selects one of the k nearest neighbors of sample i.
    19. For each attribute (attr) from 1 to numattrs:
        20. Compute the difference: dif = Sample[nnarray[nn]][attr] - Sample[i][attr]
        21. Compute a random gap: gap = random number between 0 and 1
        22. Synthetic[newindex][attr] = Sample[i][attr] + gap * dif
    20. End for
    23. Increment newindex
    24. Decrement N
25. End while

26. Return synthetic samples
End of pseudo-code.
```


### Description of the used models

To be able to compare the results with the first and second approaches, we will use the same configuration of hyperparameters for the models:

- **ANN**:
  - Hidden layers: 1, number of neurons in the hidden layer: $[16, 32, 64]$.
  - Hidden layers: 2, number of neurons in the hidden layers $[(16, 16), (32, 16), (32, 32), (64, 32), (64, 64)]$.
- **Decision Tree**:
  - Maximum depth of the tree $\in \{3, 5, 10, 15, 20, \text{None}\}$
- **SVM**:
  - Kernel $\in \{\text{linear}, \text{poly}, \text{rbf}, \text{sigmoid}\}$
  - C $\in \{0.1, 1, 10\}$
- **KNN\***:
  - $k \in \{1, 3, 5, 7, 9, 11, 13, 15\}$

After training the models, we will train an ensemble model with the three best models. The method used to combine the models will be:

- **Majority voting**
- **Weighted voting**
- **Naive Bayes**
- **Stacking** (using a logistic regression as the meta-model)


- Imbalanced-learn (revision): https://www.sciencedirect.com/science/article/pii/S0957417416307175?casa_token=lyglFt_Ye0YAAAAA:Apv_dixqX-GQm04rHLrN6wBhIRJHhxCFlqUS5WXXbuD-iJCO9FUBZ9VLAxgRDwUTKdpPTGgHIA
  - Aqui falan de SMOTE (tecnica para over-sampling), tecnicas de re-sampling híbridas e undersampling.
  - Tamén falan de que aplicando PCA e tecnicas de reducción da dimensionalidade se pode reducir o efecto negativo do desbalanceamento.
  - Tamén comentan que as técnicas de ensamblado se utilizan para estas situacións, pero me da que non vai poder ser aplicable ao noso caso (utilizan AdaBoost e esas vainas).
  - Por último, comentan que se poden utilizar técnicas para ponderar os erros. Guai se utilizamos unha rede neuronal, posto que so hai que cambiar a función de custo, pero implementa ti isto en SVM ou Decision Tree. É posible, de feito, hai formas e explicanse, pero implicaría cambiar o codigo drasticamente.
- Learning from imbalanced data (non o lin, pero creo que tamén describe técnicas de resampling): https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5128907


**Index**

- [Data loading](#Data-loading)
- [SMOTE configurations](#SMOTE-experiments)
- [Individual models](#Individual-models)
  - [ANN](#ANN)
  - [Decision Tree](#Decision-Tree)
  - [Support Vector Machine](#Support-Vector-Machine)
  - [K-Nearest Neighbors](#K-Nearest-Neighbors)
- [Ensemble model](#Ensemble-model)
  - [Majority voting](#Majority-voting)
  - [Weighted voting](#Weighted-voting)
  - [Naive Bayes](#Naive-Bayes)
  - [Stacking](#Stacking)


## Data loading


In [4]:
using DataFrames
using CSV
using Random
using Serialization

In [75]:
# Load custom functions from provided files
include("preprocessing.jl")
include("metrics.jl")
include("training.jl")
include("plotting.jl")



generateComparisonTablePerClass (generic function with 1 method)

In [76]:
# Set the random seed for reproducibility
Random.seed!(42)

# Load the dataset
dataset_path = "dataset.csv"
data = CSV.read(dataset_path, DataFrame);

# Separate features and target
target_column = :Target
inputs = select(data, Not(target_column))
targets = data[!, target_column];

In [77]:
inputs = Float32.(Matrix(inputs))

# Define the number of folds for cross-validation and obtain the indices
Random.seed!(42)
k = 5
N = size(inputs, 1)
fold_indices = crossValidation(targets, k)
metrics_to_save = [:accuracy, :precision, :recall, :f1_score];

## SMOTE experiments

In the first approach we see that we detected some problems in some metrics because of the imbalance of the dataset. For example, for the ANN model, we get high values of mean accuracy and low values of mean F1-score. This happens because the precision and recall of the class `Enrolled` are very low.

<div style="display: flex; justify-content: center;">
<image src="plots/Approach1/ANN/accuracy_performance_bar.png" width="600"/>
<image src="plots/Approach1/ANN/f1_score_performance_bar.png" width="600"/>
</div>

To address this problem, we will use the SMOTE algorithm to balance the dataset. We will conduct 5 experiments:

- Oversampling the minority class `Enrolled` at 200%.
- Oversampling the minority class `Enrolled` at 300%.
- Oversampling the minority class `Dropout` at 200% and oversampling the minority class `Enrolled` at 200%.
- Oversampling the minority class `Dropout` at 200% and oversampling the minority class `Enrolled` at 300%.
- Oversampling the minority class `Enrolled` at 200% and undersampling the majority class `Graduated` at 50%.

We are going to fix the number of nearest neighbors to 5.

To avoid retraining all the models multiple times, we will perform the experiments only with the base models: ANN, Decision Tree, SVM, and KNN, using the best hyperparameters identified in the first approach. Subsequently, we will train the models in the same manner as in the first approach, but with the balanced dataset that yielded the best results.


In [5]:
target_column = :Target
println("\nClass Distribution:")
println(combine(groupby(data, target_column), nrow => :Count))


Class Distribution:
[1m3×2 DataFrame[0m
[1m Row [0m│[1m Target   [0m[1m Count [0m
     │[90m String15 [0m[90m Int64 [0m
─────┼─────────────────
   1 │ Dropout    1421
   2 │ Graduate   2209
   3 │ Enrolled    794


In [79]:
smote_percentages = [
  Dict("Enrolled" => 200),
  Dict("Enrolled" => 300),
  Dict("Enrolled" => 200, "Dropout" => 200),
  Dict("Enrolled" => 300, "Dropout" => 200),
  Dict("Enrolled" => 200, "Graduate" => 50)
]
k = 5

open("warnings.log", "w") do file
  redirect_stderr(file) do # redirect warnings associated with joblib
    for (i, smote_percentage) in enumerate(smote_percentages)
      println("\nSmote percentages: ", smote_percentage)
      balanced_inputs, balanced_targets = smote(inputs, targets, smote_percentage, k)
      println("Number of instances: ", size(balanced_targets)[1])
      println("Elements of class Dropout: ", sum(balanced_targets .== "Dropout"))
      println("Elements of class Graduate: ", sum(balanced_targets .== "Graduate"))
      println("Elements of class Enrolled: ", sum(balanced_targets .== "Enrolled"))
    end
  end
end


Smote percentages: Dict("Enrolled" => 200)
Number of instances: 5218
Elements of class Dropout: 1421
Elements of class Graduate: 2209
Elements of class Enrolled: 1588

Smote percentages: Dict("Enrolled" => 300)
Number of instances: 6012
Elements of class Dropout: 1421
Elements of class Graduate: 2209
Elements of class Enrolled: 2382

Smote percentages: Dict("Enrolled" => 200, "Dropout" => 200)
Number of instances: 6639
Elements of class Dropout: 2842
Elements of class Graduate: 2209
Elements of class Enrolled: 1588

Smote percentages: Dict("Enrolled" => 300, "Dropout" => 200)
Number of instances: 7433
Elements of class Dropout: 2842
Elements of class Graduate: 2209
Elements of class Enrolled: 2382

Smote percentages: Dict("Enrolled" => 200, "Graduate" => 50)
Number of instances: 4113
Elements of class Dropout: 1421
Elements of class Graduate: 1104
Elements of class Enrolled: 1588


In [85]:
# Best configurations
topology = [64, 23]
max_depth = 5
n_neighbors = 5
kernel = "linear"
C = 10

# ANN
hyperparameters_ann = Dict(
  "topology" => topology,
  "learningRate" => 0.01,
  "maxEpochs" => 100,
  "repetitionsTraining" => 10,
  "validationRatio" => 0.15,
  "maxEpochsVal" => 10,
  "minLoss" => 0.0001
)

# DT
hyperparameters_dt = Dict(
  :max_depth => max_depth,
  :criterion => "gini",
  :min_samples_split => 2,
)

# SVM
hyperparameters_svm = Dict(
  :kernel => kernel,
  :C => C,
  :gamma => "auto",
  :probability => true,
)

# KNN
hyperparameters_knn = Dict(
  :n_neighbors => n_neighbors,
  :weights => "uniform",
  :metric => "euclidean",
)

# Define the hyperparameters for smote
k = 5
smote_percentages = [
  Dict("Enrolled" => 200),
  Dict("Enrolled" => 300),
  Dict("Enrolled" => 200, "Dropout" => 200),
  Dict("Enrolled" => 300, "Dropout" => 200),
  Dict("Enrolled" => 200, "Graduate" => 50)
];

In [86]:
general_results_ann = []
class_results_ann = []
general_results_dt = []
class_results_dt = []
general_results_svm = []
class_results_svm = []
general_results_knn = []
class_results_knn = []

open("warnings.log", "w") do file
  redirect_stderr(file) do # redirect warnings associated with joblib
    for smote_percentage in smote_percentages
      println("\nSmote percentage: ", smote_percentage)
      println("ANN")
      # ANN
      gr, cr = modelCrossValidation(
        :ANN,
        hyperparameters_ann,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_ann, gr)
      push!(class_results_ann, cr)

      println("DT")
      # DT
      gr, cr = modelCrossValidation(
        :DT,
        hyperparameters_dt,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_dt, gr)
      push!(class_results_dt, cr)

      println("SVM")
      # SVM
      gr, cr = modelCrossValidation(
        :SVC,
        hyperparameters_svm,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_svm, gr)
      push!(class_results_svm, cr)

      println("KNN")
      # KNN
      gr, cr = modelCrossValidation(
        :KNN,
        hyperparameters_knn,
        inputs,
        targets,
        fold_indices;
        metricsToSave=metrics_to_save,
        normalizationType=:zeroMean,
        applySmote=true,
        smotePercentages=smote_percentage,
        smoteNeighbors=k,
        verbose=false
      )
      push!(general_results_knn, gr)
      push!(class_results_knn, cr)
    end
  end
end


Smote percentage: Dict("Enrolled" => 200)
ANN
Mean accuracy: 0.62157 ± 0.18943
	Class 1: 0.55493 ± 0.26749
	Class 2: 0.60242 ± 0.20243
	Class 3: 0.79377 ± 0.02978
Mean precision: 0.3934 ± 0.28605
	Class 1: 0.45109 ± 0.27874
	Class 2: 0.42331 ± 0.32407
	Class 3: 0.20693 ± 0.20248
Mean recall: 0.47556 ± 0.2422
	Class 1: 0.71097 ± 0.06816
	Class 2: 0.45308 ± 0.42357
	Class 3: 0.1171 ± 0.08753
Mean f1_score: 0.39636 ± 0.2883
	Class 1: 0.51687 ± 0.2247
	Class 2: 0.41224 ± 0.3902
	Class 3: 0.13653 ± 0.12426
DT
Mean accuracy: 0.81371 ± 0.00671
	Class 1: 0.80967 ± 0.0291
	Class 2: 0.82912 ± 0.02316
	Class 3: 0.81149 ± 0.01161
Mean precision: 0.68317 ± 0.02682
	Class 1: 0.73601 ± 0.02983
	Class 2: 0.75679 ± 0.02088
	Class 3: 0.41052 ± 0.13996
Mean recall: 0.72513 ± 0.01032
	Class 1: 0.86051 ± 0.10921
	Class 2: 0.82361 ± 0.11098
	Class 3: 0.09443 ± 0.02465
Mean f1_score: 0.67921 ± 0.01092
	Class 1: 0.78948 ± 0.03855
	Class 2: 0.78476 ± 0.04019
	Class 3: 0.15234 ± 0.03839
SVM
Mean accuracy: 0.83