# K-Fold Cross-Validation with the Iris Dataset

In this notebook, we'll explore **K-fold cross-validation** and apply it to the classic **Iris dataset** using `scikit-learn`. K-fold cross-validation is an essential technique for evaluating the performance of a machine learning model by splitting the data into training and testing sets multiple times.

---

## What is K-Fold Cross-Validation?

K-fold cross-validation is a method used to:
- Assess the generalization performance of a model.
- Reduce the risk of overfitting by averaging the model’s performance across multiple splits.
  
In K-fold cross-validation, the data is divided into `K` subsets (or "folds"). The model is trained on `K-1` folds and evaluated on the remaining fold, repeating this process `K` times so that each fold serves as the test set exactly once. The final evaluation score is the average performance over all `K` folds.

---

## Step 1: Import Libraries

Let's start by importing the necessary libraries.


In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np


## Step 2: Load the Iris Dataset

We'll use the Iris dataset, a popular dataset in machine learning. It consists of 150 samples of iris flowers, each described by four features: `sepal length`, `sepal width`, `petal length`, and `petal width`. The target variable represents three species of iris: *Setosa*, *Versicolour*, and *Virginica*.


In [12]:
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable

## Step 3: Set Up K-Fold Cross-Validation

We'll use `KFold` from `scikit-learn` to split our data into `K=5` folds. This means the data will be split into 5 subsets, and the model will be trained on 4 folds and evaluated on the remaining 1 fold, repeating this process 5 times.

- `n_splits=5` specifies the number of folds.
- `shuffle=True` shuffles the data before splitting to ensure each fold is representative of the overall dataset.


In [3]:
# Define the number of folds
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)


## Example of K-Fold Splits

In **K-fold cross-validation**, each fold serves as a test set exactly once, while the remaining folds serve as the training set. 

Since we have **150 samples** in the Iris dataset (50 samples for each of the three species) and we set `k=5` folds, here’s how the data is divided:

- **Total samples in each fold**: With 5 folds, each fold will have $ \frac{150}{5} = 30 $ samples as the test set.
- **Training samples**: The remaining samples (150 - 30 = 120) will be used for training in each fold.

For each fold:
- **Training set**: 120 samples (approximately 40 from each species).
- **Test set**: 30 samples (approximately 10 from each species).

Here’s a table to illustrate how data might be split across 5 folds:

| Fold Number | Training Data Indexes (120 Samples) | Test Data Indexes (30 Samples) |
|-------------|-------------------------------------|--------------------------------|
| Fold 1      | Example: 1, 2, 3, ..., 150          | Example: 21, 44, ..., 130      |
| Fold 2      | Example: 1, 3, 4, ..., 149          | Example: 2, 5, ..., 78         |
| Fold 3      | Example: 1, 2, 5, ..., 148          | Example: 3, 9, ..., 143        |
| Fold 4      | Example: 1, 2, 4, ..., 147          | Example: 10, 12, ..., 135      |
| Fold 5      | Example: 2, 4, 5, ..., 150          | Example: 1, 7, ..., 119        |

This structure ensures each sample is used for both training and testing, providing a more comprehensive evaluation of the model’s performance.

---



## Step 4: Train and Evaluate the Model with K-Fold Cross-Validation

We'll use the **K-Nearest Neighbors (KNN)** algorithm with `k=3` neighbors to classify the iris flowers. For each fold, we'll:
1. Split the data into training and test sets.
2. Train the model on the training set.
3. Predict on the test set.
4. Calculate and store the accuracy score.

Finally, we'll calculate the **average accuracy** across all folds.


In [9]:

k_neighbors = 3
accuracies = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    knn = KNeighborsClassifier(n_neighbors=k_neighbors)
    knn.fit(X_train, y_train)
    
    y_pred = knn.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

average_accuracy = np.mean(accuracies)
print(f"Accuracies for each fold: {accuracies}")
print(f"Average Accuracy across all folds: {average_accuracy:.2f}")


Accuracies for each fold: [1.0, 0.9666666666666667, 0.9666666666666667, 0.9333333333333333, 0.9666666666666667]
Average Accuracy across all folds: 0.97


### Analyzing the Results

The output shows:
- The **accuracy for each fold** individually.
- The **average accuracy** across all folds, giving an overall performance measure for the model.

---

## Advantages of K-Fold Cross-Validation

1. **Reduced Bias**: K-fold cross-validation provides a better generalization estimate by training and testing on multiple subsets of data.
2. **Efficient Use of Data**: All data points are used for both training and testing, maximizing data utilization.
3. **Versatility**: Works well for both small and large datasets.
