# Data validation methods

This notebook contains a few methods to split the data into training and validation sets, one stage of development of the model is the validation, to ensure that the model is as supposed to be, we need to validate it with a set of data that the model has never seen before, this is the validation set. the holdout, k-fold and leave-one-out methods are used to split the data into training and validation sets. 

In [16]:
import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo 


First, we import the necessary libraries and load the data. For this example, we will use the iris dataset and wine quality dataset, both datasets are by University of California, and the principal difference between them is quantity of samples and features.

In [17]:
# Cargar el dataset de Wine Quality
wine_data = fetch_ucirepo(id=186)
wine = pd.DataFrame(wine_data.data.features, columns=wine_data.data.features.columns)
wine['target'] = wine_data.data.targets.values.ravel()

# Cargar el dataset de Iris
iris_data = fetch_ucirepo(id=53) 
iris = pd.DataFrame(iris_data.data.features, columns=iris_data.data.features.columns)
iris['target'] = iris_data.data.targets.values.ravel()


X_iris = iris.iloc[:, :-1].values 
y_iris = iris.iloc[:, -1].values  

X_wine = wine.iloc[:, :-1].values  
y_wine = wine.iloc[:, -1].values   

# Mostrar información de los datasets
print('Dataset de Iris:')
print(iris.head())
print('\nDataset de Wine Quality:')
print(wine.head())


Dataset de Iris:
   sepal length  sepal width  petal length  petal width       target
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Dataset de Wine Quality:
   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free_sulfur_dioxide  total_sulfur_dioxide  density    pH  sulphates  \


One of the benefits of using this methods is that we can use the same data for training and validation, but in different stages of the model development, but we need to be careful to ensure that the same data is not used for training and validation at the same time,for this reason, we need to search and drop the duplicate data before split the data.

In [18]:

#Eliminiar duplicados en el dataset de Iris
iris = iris.drop_duplicates()
X_iris = iris.iloc[:, :-1].values  
y_iris = iris.iloc[:, -1].values   

print('Dataset de Iris:')
print("Shape: ", X_iris.shape, y_iris.shape)


#Eliminiar duplicados en el dataset de Wine
wine = wine.drop_duplicates()
X_wine = wine.iloc[:, :-1].values 
y_wine = wine.iloc[:, -1].values   

print('Dataset de Wine Quality:')
print("Shape: ", X_wine.shape, y_wine.shape)

Dataset de Iris:
Shape:  (147, 4) (147,)
Dataset de Wine Quality:
Shape:  (5318, 11) (5318,)


To explore each method, we use both datasets, and we will split the data into training and validation sets using the holdout, k-fold and leave-one-out methods. The holdout method is the simplest method, it splits the data into training and validation sets by a percentage (test_size). 

K-fold cross-validation is more complex than the holdout method, it splits the data into k-folds, after, it trains the model k times, each time using a different fold as the validation set and the remaining folds as the training set. This method is useful when the dataset is small, and we need to use all the data to train the model.

Leave-one-out cross-validation is a variation of k-fold cross-validation, it splits the data into k-folds, but the number of folds is equal to the number of samples in the dataset. This method is useful when the dataset is small, and we need to use all the data to train the model, if the dataset is large, the leave-one-out method can be computationally expensive.

In [19]:
def hold_out(X, y, test_size):
    split_index = int(len(X) * (1 - test_size))
    X_train = X[:split_index]
    X_test = X[split_index:]
    y_train = y[:split_index]
    y_test = y[split_index:]
    return X_train, X_test, y_train, y_test

def k_fold_cross_validation(X, y, K):
    fold_size = len(X) // K
    indices = list(range(len(X)))
    
    for i in range(K):
        test_indices = indices[i * fold_size:(i + 1) * fold_size]
        train_indices = indices[:i * fold_size] + indices[(i + 1) * fold_size:]
        
        X_train = [X[idx] for idx in train_indices]
        X_test = [X[idx] for idx in test_indices]
        y_train = [y[idx] for idx in train_indices]
        y_test = [y[idx] for idx in test_indices]
        
        yield X_train, X_test, y_train, y_test

def leave_one_out(X, y):
    indices = list(range(len(X)))
    
    for i in indices:
        X_train = [X[idx] for idx in indices if idx != i]
        X_test = [X[i]]
        y_train = [y[idx] for idx in indices if idx != i]
        y_test = [y[i]]
        
        yield X_train, X_test, y_train, y_test



We execute the holdout, taking well-know parameters, we use 30% of the data for validation, and the remaining 70% for training. After, we execute the k-fold method, using 5 folds, and the leave-one-out method. 

In [20]:
# Hold-Out (usamos r=0.3 para dividir 70%/30%)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = hold_out(X_iris, y_iris, test_size=0.3)
X_train_wine, X_test_wine, y_train_wine, y_test_wine = hold_out(X_wine, y_wine, test_size=0.3)

print("Iris Hold-Out:")
print("X_train shape:", X_train_iris.shape, "X_test shape:", X_test_iris.shape)
print("y_train shape:", y_train_iris.shape, "y_test shape:", y_test_iris.shape)

print("\nWine Hold-Out:")
print("X_train shape:", X_train_wine.shape, "X_test shape:", X_test_wine.shape)
print("y_train shape:", y_train_wine.shape, "y_test shape:", y_test_wine.shape)


Iris Hold-Out:
X_train shape: (102, 4) X_test shape: (45, 4)
y_train shape: (102,) y_test shape: (45,)

Wine Hold-Out:
X_train shape: (3722, 11) X_test shape: (1596, 11)
y_train shape: (3722,) y_test shape: (1596,)


In the same way, we execute the k-fold method, using 5 and 10 folds for iris and wine quality datasets, respectively. 

In [21]:
import numpy as np

def mostrar_k_fold_cross_validation(X, y, K, dataset_name):
    print(f"\nK-Fold Cross Validation - {dataset_name} Dataset")
    for i, (X_train, X_test, y_train, y_test) in enumerate(k_fold_cross_validation(X, y, K), 1):
        X_train = np.array(X_train)
        X_test = np.array(X_test)
        y_train = np.array(y_train)
        y_test = np.array(y_test)
        print(f"\nFold {i}:")
        print(f"  - Tamaño de X_train: {X_train.shape}")
        print(f"  - Tamaño de X_test: {X_test.shape}")
        print(f"  - Tamaño de y_train: {y_train.shape}")
        print(f"  - Tamaño de y_test: {y_test.shape}")

# Mostrar K-Fold Cross Validation para el dataset de Iris
mostrar_k_fold_cross_validation(X_iris, y_iris, K=5, dataset_name="Iris")

# Mostrar K-Fold Cross Validation para el dataset de Wine
mostrar_k_fold_cross_validation(X_wine, y_wine, K=10, dataset_name="Wine")



K-Fold Cross Validation - Iris Dataset

Fold 1:
  - Tamaño de X_train: (118, 4)
  - Tamaño de X_test: (29, 4)
  - Tamaño de y_train: (118,)
  - Tamaño de y_test: (29,)

Fold 2:
  - Tamaño de X_train: (118, 4)
  - Tamaño de X_test: (29, 4)
  - Tamaño de y_train: (118,)
  - Tamaño de y_test: (29,)

Fold 3:
  - Tamaño de X_train: (118, 4)
  - Tamaño de X_test: (29, 4)
  - Tamaño de y_train: (118,)
  - Tamaño de y_test: (29,)

Fold 4:
  - Tamaño de X_train: (118, 4)
  - Tamaño de X_test: (29, 4)
  - Tamaño de y_train: (118,)
  - Tamaño de y_test: (29,)

Fold 5:
  - Tamaño de X_train: (118, 4)
  - Tamaño de X_test: (29, 4)
  - Tamaño de y_train: (118,)
  - Tamaño de y_test: (29,)

K-Fold Cross Validation - Wine Dataset

Fold 1:
  - Tamaño de X_train: (4787, 11)
  - Tamaño de X_test: (531, 11)
  - Tamaño de y_train: (4787,)
  - Tamaño de y_test: (531,)

Fold 2:
  - Tamaño de X_train: (4787, 11)
  - Tamaño de X_test: (531, 11)
  - Tamaño de y_train: (4787,)
  - Tamaño de y_test: (531,)

Fold

Finally, we use leave-one-out method for iris only, because the wine quality dataset is too large, and the leave-one-out method is computationally expensive for this dataset.

In [22]:
# Leave-One-Out para el dataset Iris

print("Leave-One-Out Validation - Iris Dataset")
for i, (X_train, X_test, y_train, y_test) in enumerate(leave_one_out(X_iris, y_iris), 1):
    X_train = np.array(X_train)
    X_test = np.array(X_test)
    y_train = np.array(y_train)
    y_test = np.array(y_test)
    print(f"\nIteration {i}:")
    print("X_train shape:", X_train.shape, "X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape, "y_test shape:", y_test.shape)

Leave-One-Out Validation - Iris Dataset

Iteration 1:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 2:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 3:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 4:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 5:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 6:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 7:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 8:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 9:
X_train shape: (146, 4) X_test shape: (1, 4)
y_train shape: (146,) y_test shape: (1,)

Iteration 10:
X_train shape: (146, 4) X_test shape: (1, 4)
