# Cross-validation

With the code developed so far, it is possible to train an ANN and provide an estimate of the results it would offer in its real execution (with unseen patterns, represented by a test set). However, in this last aspect there are two factors to consider, as a consequence of the non-deterministic nature of the process we are following:

- The partitioning of the set of patterns into training/test is random (hold out), and is therefore overly dependent on good or bad luck in choosing training and test patterns.
- ANN training is not deterministic, as the initialisation of the weights is random. As before, it is too dependent on good or bad luck to start the training at a good or bad starting point.

For these two reasons, the test result of a single training is not significant when assessing the goodness of fit of the model in the presence of unseen patterns. To solve this problem, the experiment is repeated several times and the results are averaged. This can be implemented in a simple way by means of a loop; however, it is necessary to do this in an orderly way as there are two different sources of randomness.

Firstly, to minimise the randomness due to the partitioning of the data set, it is necessary to have a method that ensures that each data is used for training at least once, and for testing at least once. The most commonly used method is cross-validation. In this method, the data set is split into k disjoint subsets and k experiments are performed. In the k-th experiment, the k subset is separated for testing, and the remaining k-1 substes are used for training, performing a k-fold cross-validation. A common value is k=10, which gives a 10-fold cross-validation. Finally, the test value corresponding to the appropriate metric will be the average value of the values of the k experiments.

A widely used variant of cross-validation is stratified cross-validation. In this case, each subset is created in such a way as to keep the proportion of patterns of each class the same (or similar) as in the original dataset. This is particularly used when the data set is imbalanced.

It is usual to save not only the mean, but also the k values, in order to subsequently perform a paired hypothesis test with another model. To do this, it is necessary that both models have been trained using the same training and test sets.

This way of evaluating the model is often considered to be slightly pessimistic, i.e. the results obtained in tests are slightly worse than those that would be obtained from real training with all available data. In a hold out experiment, as mentioned above, several data are separated for testing. This means that the model is trained with less data than is available, and that by chance the data separated for testing can be of great importance (especially if there is little data). For this reason, when training with less data and possibly no "important" data, hold out is considered a pessimistic assessment. In the same way, cross-validation also separates data for testing, so it does not train on all available data, and is therefore also pessimistic. However, it is guaranteed that all data are used at least once in training and once in testing, thus trying to minimise the impact of chance in separating data, so it is considered only a slightly pessimistic evaluation.

Doing this is as simple as splitting the data set and performing a loop with k iterations in which at the k-th iteration a model is trained and evaluated with the corresponding sets. However, if the model is not deterministic, the result obtained at the k-th iteration will not be meaningful, since it is again dependent on chance. In this case, what needs to be done is a second nested loop within iteration k in which the model is repeatedly trained, and finally an average of the results is made to finally output the result of iteration k. The number of trainings must be high for the average results to be really significant, at least 50 trainings.

### Question

If this second loop is performed with a deterministic model, what will be the standard deviation of the test results obtained? Is there a difference between performing this second loop and averaging the results, or doing a single training?

`If the second loop is performed with a deterministic model, the standard deviation of the test results obtained would be zero. This is because, by definition, a deterministic model will always produce the same result given the same training and test data, which means that each repeated training will yield identical test results. In conclusion, with a deterministic model, there is no difference between performing repeated training and averaging the results or doing a single training.`

In this way, it is possible to evaluate a model together with its hyperparameters in solving a problem. A very common situation is to compare several models (or the same model with different hyperparameters), for which this scheme has to be applied with an important caveat: the sets used in the cross-validation must be the same for each model. Since the distribution of patterns in different sets is random, having the same subsets in different runs is achieved by setting the random seed at the beginning of the program to be executed. Setting the random seed not only allows the same subsets to be generated, but is also important in order to be able to repeat the results in different runs.

It is also important to bear in mind that this methodology allows estimating the real performance of a model (although slightly pessimistic). The final model that would be used in production would be the result of training it with all the available patterns, since, as seen in the theory class, and very generally speaking, the more patterns you train with, the better the model will be.

In this assignment, you are asked to:

1. Develop a function called `crossvalidation` that receives a value `N` (equal to the number of patterns), and a value `k` (number of subsets into which the dataset is to be split), and returns a vector of length N, where each element indicates in which subset that pattern should be included.

    To do this function, one possibility is to perform the following steps:
    
    - Create a vetor with k sorted elements, from 1 to k.
    - Create a new vector with repetitions of the previous vector until its length is greater than or equal to N. The functions `repeat` and `ceil` can be used for this purpose.
    - Take the first N values of this vector.
    - Shuffle this vector (using the function `shuffle!` and return it. To use this function, the module `Random` should be loaded.
    
    No loop function should be used in the developed function.

In [2]:
using Random

function crossvalidation(N::Int64, k::Int64)
    subsets = collect(1:k)
    
    num_repeats = ceil(Int, N / k)
    extended_subsets = repeat(subsets, num_repeats)
    
    selected_subsets = extended_subsets[1:N]
    
    shuffle!(selected_subsets)
    
    return selected_subsets
end


crossvalidation (generic function with 1 method)

In [3]:
# Example execution
N = 12  # Number of patterns
k = 4   # Number of subsets

println("Running crossvalidation(N=$N, k=$k)...")
result = crossvalidation(N, k)

println("Result of crossvalidation:")
println(result)


Running crossvalidation(N=12, k=4)...
Result of crossvalidation:
[4, 4, 3, 1, 2, 3, 3, 1, 2, 4, 2, 1]


2. Create a new function called `crossvalidation`, which in this case receives as first argument `targets` of type `AbstractArray{Bool,2}` with the desired outputs, and as second argument a value `k` (number of subsets in which the dataset will be split), and returns a vector of length N (equal to the number of rows of targets), where each element indicates in which subset that pattern must be included. This partition has also to be stratified. To do this, the following steps can be followed:

    - Create a vector of indices, with as many values as rows in the `target` matrix.
    - Write a loop that iterates over the classes (columns in the `target` matrix), and does the following:
        - Take the number of elements belonging to that class. This can be done by making a call to the `sum` function applied to the corresponding column.
        - Make a call to the `crossvalidation` function developed earlier passing as parameters this number of elements and the value of k.
        - Update the index vector positions indicated by the corresponding column of the `targets` matrix with the values of the vector resulting from the call to the `crossvalidation` function.
        
        ### Question
        
        Could you perform these 3 operations in a single line of code?
        
        ```index_vector[findall(x -> x == true, targets[:, class])] .= crossvalidation(sum(targets[:, class]), k) ```
        `sum(targets[:,class]) computes the count of patterns belonging to that class`
        `crossvalidation call generates the indices for the class`
        `index_vector[findall(x -> x == true, targets[:, class])] updates the positions in index_vector that correspond to the indices of the current class.`
        </br>
    - Return the vector of indices.
    
    As it can be seen in this explanation, a loop iterating all classes can be used in this function. However, you need to make sure that each class has at least k patterns. A usual value is k=10. Therefore, it is important to make sure that you have at least 10 patterns of each class.
        
    ### Question
    
    What would happen if any class has a number of patterns less than k? What would be the consequences for calculating metrics?
    
    ```Answer here```
    
    > If, for whatever reason, it is impossible to ensure that you have at least 10 patterns of each class, one possibility would be to lower the value of k. In this case, consult with the teacher to assess this option, and what impact it might have on the final result of the trained models. In this case, consult with the teacher to assess this option, and what impact it might have on the final result of the trained models.

In [4]:
function crossvalidation(targets::AbstractArray{Bool, 2}, k::Int64)
    @assert k>1
    N = size(targets, 1)  # Number of patterns (rows)
    num_classes = size(targets, 2)  # Number of classes (columns)
    
    index_vector = zeros(Int, N)  # Create a vector to store the subset indices
    
    for class in 1:num_classes
        index_vector[findall(x -> x == true, targets[:, class])] .= crossvalidation(sum(targets[:, class]), k)
    end
    
    return index_vector
end


crossvalidation (generic function with 2 methods)

In [5]:
import Pkg; Pkg.add("StatsBase")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


##### TEST FUNCTION

In [6]:
using StatsBase
function create_synthetic_data(n_samples::Int, n_classes::Int)
    targets = rand(Bool, n_samples, n_classes) 
    return targets
end

n_samples = 100  
n_classes = 3    

targets = create_synthetic_data(n_samples, n_classes)

println("Targets (first 10 rows):")
println(targets[1:10, :])

k = 10  
indices = crossvalidation(targets, k)

println("\nIndices for cross-validation (first 10):")
println(indices[1:10])



Targets (first 10 rows):
Bool[1 1 0; 1 0 0; 1 0 0; 1 0 1; 0 1 0; 0 0 1; 1 0 1; 0 0 0; 1 0 1; 0 1 0]

Indices for cross-validation (first 10):
[5, 10, 2, 2, 6, 1, 1, 0, 9, 6]


3. Perform a final function called crossvalidation, but in this case with the first parameter `targets` of type `AbstractArray{<:Any,1}` (i.e. a vector with heterogeneous elements), the same second argument, and perform stratified cross-validation.

    In this case, the steps to follow in this function are not specified. However, they are similar to the previous one. A simple way to do it would be to call the function `oneHotEncoding` passing the vector `targets` as an argument.
    
      ### Question
      
      Could you develop this function without calling oneHotEncoding?
      
      ```Answer here```

In [7]:
function crossvalidation(targets::AbstractArray{<:Any, 1}, k::Int64)
    unique_classes = unique(targets) 
    num_classes = length(unique_classes) 
    N = length(targets)
    
    binary_matrix = zeros(Bool, N, num_classes)  
    
    for (i, target) in enumerate(targets)
        binary_matrix[i, findfirst(x -> x == target, unique_classes)] = true
    end

    index_vector = zeros(Int, N) 
    
    for class in 1:num_classes
        class_indices = findall(x -> x == true, binary_matrix[:, class])  
        class_count = length(class_indices) 
        
        if class_count >= k
            index_vector[class_indices] .= crossvalidation(class_count, k)
        else
            println("Warning: Class $class has only $class_count instances, which is less than k ($k).")
        end
    end
    
    return index_vector
end

crossvalidation (generic function with 3 methods)

In [8]:
targets = ["cat", "dog", "horse", "horse", "cat", 
"horse", "cat", "dog", "horse", "cat"]  
    k = 3 

    println("Targets:")
    println(targets)

    indices = crossvalidation(targets, k)
    
    println("\nIndices for cross-validation:")
    println(indices)
    
    println("\nDistribution of indices:")
    println(countmap(indices))

Targets:
["cat", "dog", "horse", "horse", "cat", "horse", "cat", "dog", "horse", "cat"]

Indices for cross-validation:
[1, 0, 1, 2, 3, 1, 2, 0, 3, 1]

Distribution of indices:
Dict(0 => 2, 2 => 2, 3 => 2, 1 => 4)


4. Integrate these functions into the code developed so far and define two functions to train ANNs following the stratified cross-validation strategy. To do this:

- First, it is necessary to set the random seed to ensure that the experiments are repeatable. This can be done with the `seed!` function of the `Random` module.
- Once the data is loaded and encoded, generate an index vector by calling the `crossvalidation` function.
- Create a function called `trainClassANN`, which receives as parameters the topology, the training set and the indices used for cross-validation. Optionally, it can receive the rest of the parameters used in previous assignments. Inside this function, the following steps may be followed:
    - Create a vector with k elements, which will contain the test results of the cross-validation process with the selected metric. If more than one metric is to be used, create one vector per metric.
    - Make a loop with k iterations (k folds) where, within each iteration, 4 matrices are created from the desired input and output matrices by means of the index vector resulting from the previous function. Namely, the desired inputs and outputs for training and test. As always, do this process of creating new matrices without loops.
    - Within this loop, add a call to generate the model with the training set, and test with the corresponding test set according to the value of k. This can be done by calling the `trainClassANN` function developed in previous assignments, passing as parameters the corresponding sets.
    - As indicated in the previous assignment, the training of ANNs is not deterministic, so that, for each iteration k of the cross-validation, it will be necessary to train several ANNs and return the average of the test results (with the selected metric or metrics) in order to have the test value corresponding to this k.
    - Furthermore, in the case of training ANNs, the training set can be split into training and validation if the ratio of patterns to be used for the validation set is greater than 0. To do this, use the `holdOut` function developed in a previous assignment.
    - Once the model has been trained (several times) on each fold, take the result and fill in the vector(s) created earlier (one for each metric).
    - Finally, provide the result of averaging the values of these vectors for each metric together with their standard deviations.
    - As a result of this call, at least the test value in the selected metric(s) should be returned. If the model is not deterministic (as is the case for the ANNs), it will be the average of the results of several trainings.
- Once this function is done, develop a second one, of the same name, so that it accepts as desired outputs a vector instead of an array, as in a previous assignment, and its operation is simply to make a call to this newly developed function.

> **Remarks**:
> - Although we have only seen how to train ANNs, in the next assignment we will use other models contained in another library (Scikit-Learn). The idea is to use the same code used for cross-validation with this global loop, changing only the line in which the model is generated.
> - Note that other Machine Learning models are deterministic, so they do not need the inner loop (whenever they are trained with the same data they return the same outputs), but only the loop for each fold.

In [9]:
using Statistics
function trainClassANN(topology::AbstractArray{<:Int,1}, 
    trainingDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}, 
    kFoldIndices::	Array{Int64,1}; 
    transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology)), 
    maxEpochs::Int=1000, minLoss::Real=0.0, learningRate::Real=0.01, repetitionsTraining::Int=1, 
    validationRatio::Real=0.0, maxEpochsVal::Int=20)
    
    numFolds = maximum(kFoldIndices); 
    inputs, targets = trainingDataset;

    numMetrics = 7;
    metrics = Matrix{Float64}(undef, numFolds, numMetrics);


    for numFold in 1:numFolds
            
        trainingInputs  =       inputs[kFoldIndices.!=  numFold,:]; 
        testInputs      =       inputs[kFoldIndices.==  numFold,:];
        
        trainingTargets =       targets[kFoldIndices.!= numFold,:];
        testTargets     =       targets[kFoldIndices.== numFold,:];

                                
      
        testAccuraciesEachRepetition = Array{Float64,1}(undef, repetitionsTraining);
        testF1EachRepetition = Array{Float64,1}(undef, repetitionsTraining);

        metricsFold = Matrix{Float64}(undef, repetitionsTraining, numMetrics);

        for numTraining in 1:repetitionsTraining

            if validationRatio>0
                (trainingIndices, validationIndices) = holdOut(size(trainingInputs,1),
                validationRatio*size(trainingInputs,1)/size(inputs,1));

                ann, = trainClassANN(topology,
                                      (trainingInputs[trainingIndices], trainingTargets[trainingIndices]),
                                      validationDataset=(trainingInputs[validationIndices], trainingTargets[validationIndices]),
                                      testDataset=(testInputs, testTargets),  
                                      transferFunctions=transferFunctions, maxEpochs=maxEpochs, minLoss=minLoss,
                                      learningRate=learningRate, maxEpochsVal=maxEpochsVal)
                    
            else

                ann, = trainClassANN(topology, (trainingInputs, trainingTargets),
                                       testDataset=(testInputs, testTargets), 
                                       transferFunctions=transferFunctions, maxEpochs=maxEpochs, minLoss=minLoss,
                                       learningRate=learningRate, maxEpochsVal=maxEpochsVal)
            end;

            metricsIter = collect(confusionMatrix(ann(testInputs')', testTargets)[1:7])
            metricsFold[numTraining, :] .= metricsIter      
        end;
        metrics[numFold, :] .= mean(metricsFold, dims=1)[1, :]
    end;
    metricsAvg = mean(metrics, dims=1)
    metricsStd = std(metrics, dims=1)
    return (metricsAvg[1, :], metricsStd[1, :])
end

trainClassANN (generic function with 1 method)

In [10]:
using CSV
using DataFrames
using Random
using Statistics

iris_df = CSV.read("./data/iris/iris.data", DataFrame)


Row,5.1,3.5,1.4,0.2,Iris-setosa
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa


In [11]:
include("./functionsLibrary.jl")

trainClassANN (generic function with 4 methods)

In [12]:

inputs = Matrix(iris_df[:, 1:4])  # Take first 4 columns as input features
labels = iris_df[:, :5]     # Take the Species column as the target

# One-hot encoding of labels (Iris Setosa, Versicolour, and Virginica)
classes = unique(labels)
targets = oneHotEncoding(labels, classes)


149×3 Matrix{Bool}:
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 1  0  0
 ⋮     
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1

In [13]:
# Define topology: 4 input features -> 5 hidden neurons -> 3 output classes
topology = [4, 5]

# Define other parameters
learning_rate = 0.01
max_epochs = 100
repetitions_training = 3
validation_ratio = 0.2  # 20% of training data used for validation


0.2

In [14]:
Pkg.add("Flux")
using Flux

(metricsAvg, metricsStd) = trainClassANN(topology, (inputs, targets), crossvalidation(targets,10))


[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`
│   The input will be converted, but any earlier layers may be very slow.
│   layer = Dense(4 => 4, σ)
│   summary(x) = 4×134 adjoint(::Matrix{Float64}) with eltype Float64
└ @ Flux /home/djove/.julia/packages/Flux/htpCe/src/layers/stateless.jl:59


Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.09289962>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.10137406>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.07237235>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.11441504>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.08565611>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.08221638>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.07716699>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss 0.07182079>minLoss 0.0
numEpochsValidation 1<maxEpochsVal 20
Stop criteria:
numEpoch 1000<maxEpochs 1000
trainingLoss

([0.9595238095238097, 0.040476190476190464, 1.0, 1.0, 1.0, 1.0, 1.0], [0.046939597087686476, 0.046939597087686476, 0.0, 0.0, 0.0, 0.0, 0.0])

In [15]:
println("Average metrics across folds: ", metricsAvg)
println("Standard deviation of metrics across folds: ", metricsStd)


Average metrics across folds: [0.9595238095238097, 0.040476190476190464, 1.0, 1.0, 1.0, 1.0, 1.0]
Standard deviation of metrics across folds: [0.046939597087686476, 0.046939597087686476, 0.0, 0.0, 0.0, 0.0, 0.0]
