# Introduction to K-Fold Cross-Validation

In this notebook, we will explore K-Fold Cross-Validation, a popular technique for evaluating the performance of machine learning models. K-Fold Cross-Validation helps in assessing model performance by splitting the dataset into multiple training and testing sets, ensuring a more reliable estimate of model accuracy.

### Key Concepts:
- **K-Fold Cross-Validation**: How it works and why it's used.
- **Performance Metrics**: Evaluating models based on multiple training/test splits.

### Step 1. Import Necessary Libraries

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier  # Example model
from sklearn.metrics import accuracy_score  # For evaluating performance

### Step 2. Loading Data and Setting Up

In [25]:
pd.set_option('expand_frame_repr', False)

# Load the Iris dataset
iris = load_iris()

# Combine the data into a DataFrame
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add the target variable to the DataFrame
iris_df['target'] = iris.target

In [26]:
X = iris_df.drop('target', axis=1) # Features
y = iris_df['target'] # Target Variable

In [27]:
# Initialize the model (K-Nearest Neighbors in this case)
model = KNeighborsClassifier(n_neighbors=3)

### Step 3. Splitting the Data into Train (80%) and Test (20%)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 4. Applying the Resampling Procedure of Cross-Validation.

**What is K-Fold Cross-Validation?**

K-Fold Cross-Validation is a technique that splits the dataset into 'K' subsets (or folds). The model is trained on 'K-1' subsets and tested on the remaining fold. This process is repeated 'K' times, with each fold serving as the test set once. The average performance across all folds gives a better estimate of the model's accuracy.

In this notebook, we will use 3-Fold Cross-Validation although the usual practice is 5 or 10 folds.


In [31]:
color_red, color_reset = "\033[31m", "\033[0m"

k = 3
kf = KFold(n_splits=k, shuffle=True, random_state=42)
accuracy_list = []

# Iterate through the k splits
for fold_num, (train_index, test_index) in enumerate(kf.split(X_train)):
    print(f"Fold {fold_num+1}:\n")
    print(f"{color_red}Train Index:{color_reset} \n{train_index}\n")
    print(f"{color_red}Test Index:{color_reset} \n{test_index}\n")
    # Split the data into training and testing sets for this fold
    X_train1, X_test1 = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train1, y_test1 = y_train.iloc[train_index], y_train.iloc[test_index]
    
    # Train the model on this fold's training data
    model.fit(X_train1, y_train1)
    
    # Evaluate the model on this fold's validation data
    y_pred = model.predict(X_test1)
    fold_accuracy = accuracy_score(y_test1, y_pred)
    accuracy_list.append(fold_accuracy)
    print(f"Fold {fold_num+1} Accuracy: {accuracy_list[fold_num]:.2f}\n")

average_fold_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Average cross-validation accuracy on training data: {average_fold_accuracy:.2f}')
    

Fold 1:

[31mTrain Index:[0m 
[  1   2   3   5   6   7   8  13  14  16  17  19  20  21  23  25  27  28
  29  32  33  34  35  37  38  39  41  43  46  48  49  50  51  52  54  57
  58  59  60  61  63  66  67  68  69  71  72  74  75  77  79  80  81  82
  83  84  85  86  87  90  92  93  94  95  98  99 100 101 102 103 105 106
 108 111 112 113 115 116 117 119]

[31mTest Index:[0m 
[  0   4   9  10  11  12  15  18  22  24  26  30  31  36  40  42  44  45
  47  53  55  56  62  64  65  70  73  76  78  88  89  91  96  97 104 107
 109 110 114 118]

Fold 1 Accuracy: 0.95

Fold 2:

[31mTrain Index:[0m 
[  0   1   2   4   9  10  11  12  14  15  18  20  21  22  23  24  26  29
  30  31  32  36  37  40  41  42  44  45  47  48  51  52  53  55  56  57
  58  59  60  61  62  63  64  65  70  71  73  74  75  76  78  79  81  82
  86  87  88  89  91  92  93  96  97  99 101 102 103 104 105 106 107 108
 109 110 112 114 115 116 118 119]

[31mTest Index:[0m 
[  3   5   6   7   8  13  16  17  19  25  27  28  

### Step 5. Final Model Training:

After cross-validation, the model is trained on the entire training data (X_train and y_train).


In [32]:
model.fit(X_train, y_train)


Test accuracy: 1.00


### Step 6. Evaluation on the Test Set:

Finally, we evaluate the trained model on the test set (X_test and y_test), which was never seen during cross-validation.
This test accuracy reflects the model's ability to generalize to new, unseen data.

In [33]:
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f'Test accuracy: {test_accuracy:.2f}')

Test accuracy: 1.00


### Interpretation of the Results:
- **Cross-Validation Accuracy (0.95):**

    - Cross-validation accuracy represents the model's performance during K-Fold Cross-Validation, which evaluates how well the model generalizes on different subsets of the **training data**.
    - An accuracy of 0.95 (95%) indicates that the model performs well on the training set's different folds and generalizes well across these subsets.

- **Test Accuracy (1.00):**

    - Test accuracy represents the model's performance on the held-out test set. This test set was never seen by the model during training or cross-validation, so this is the final measure of how well the model generalizes to completely unseen data.
    - A test accuracy of 1.00 (100%) means the model correctly classified all the test set samples.

### Conclusion:
The model is performing extremely well, with a cross-validation accuracy of 95% and a perfect test accuracy of 100%. This indicates that the model has learned the structure of the Iris dataset very well and is likely to generalize effectively to similar data. However, the perfect test score could be a reflection of the simplicity of the Iris dataset, and achieving this in more complex, real-world data may not be as easy.