# Self-study try-it activity 7.2: Implementing k-fold cross-validation in Python

##### This notebook is divided into two parts:

- Part one guides you through a manual implementation of k-fold cross-validation.

- Part two demonstrates how to achieve the same using built-in tools from `scikit-learn`.

In [None]:
import numpy as np
import pandas as pd
import copy
import matplotlib.pyplot as plt
import time

from sklearn.datasets import load_diabetes,load_iris
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import KFold, cross_val_score
from sklearn import preprocessing

from sklearn.linear_model import LogisticRegression



## Part one: Implementing a manual approach to k-fold cross-validation

### k-fold cross validation with kernel ridge regression (KRR)

In this section, you will implement k-fold cross-validation using a kernel ridge regression (KRR) model. You are provided with the pre-defined `fit_and_predict` functions to generate model predictions.

The primary goal of this exercise is to practise implementing k-fold cross-validation. The section is structured into five tasks:

1. Create a function to split the data into k folds.

2. Develop a performance metric function using RMSE.

3. Write a function to run cross-validation on the k-folds.

4. Run cross validation for k={2, ..., 100}, measuring the execution time of each.

5. Answer the questions in a markdown cell.

**Note:** All required Python packages have already been imported for you in this notebook. 


## Data

The `load_diabetes` data set from `scikit-learn` is a regression data set that contains medical data from 442 diabetes patients. It includes ten baseline clinical features, such as age, sex, body mass index (BMI), average blood pressure and six blood serum measurements. The target variable represents a quantitative measure of disease progression one year after the baseline. This data set is commonly used for regression tasks, particularly for predicting diabetes progression based on clinical input variables.

You can find more details about the data set here: https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset.

In [None]:
#Do not edit this cell

X_, y_ = load_diabetes(return_X_y=True)

#Standardise the data to help you fit the data (this will be covered later in the programme)

scaler = preprocessing.StandardScaler().fit(X_)
X = scaler.transform(X_)[:300, :]

scaler_y = preprocessing.StandardScaler().fit(y_.reshape(-1, 1))
y = scaler_y.transform(y_.reshape(-1, 1))[:300, :]
print(y.shape)

#To ensure the data stays in the correct order, you will work with data frames

columns = [f'x_{i}' for i in range(X.shape[1])] + ['y']
x_columns = [f'x_{i}' for i in range(X.shape[1])]
data = pd.DataFrame(data= np.concatenate([X, y.reshape(-1, 1)], axis=1), columns=columns)
print(data.head())

### Kernel ridge regression (KRR)

KRR is a regression method that extends simple linear regression to handle non-linear data by using kernels. It applies ridge regularisation to manage model complexity and prevent overfitting. In this case, the model uses the `RBF kernel` (radial basis function) with `gamma=0.1`to transform the input data into a higher-dimensional space where a linear model can be fitted more effectively to capture non-linear patterns.


In [None]:
#Do not edit

def fit_and_predict_KRR(train, validate):
    """fit a Kernel Ridge Regression Model on the training data and predict the y values of the validation X data.
    :param train: pandas dataframe containing the training data
    :param validate: pandas dataframe containing the validation data
    :return: predictions at the validation X points Mx1 numpy array"""
    X_train = train[x_columns].to_numpy()
    y_train = train['y'].to_numpy()
    X_val = validate[x_columns].to_numpy()

    KRR = KernelRidge(alpha=0.1, kernel='rbf', gamma=0.2,  degree=100)
    KRR.fit(X_train, y_train)
    return KRR.predict(X_val)




### To-Do: Create a function to split data into k folds

- Fill in the gaps to create a function that split the data into k folds. 

- Assign the results to `len_folds`. 

Hint: Use `np.array_split()`.


In [None]:

def k_folds(data, k):
    """function that returns a list of k folds of the data"""

    ############################
    #Create list of how long each fold should be. The folds should be as even as possible in number, but they
    #need to have an extra data point if the total number of data points isn't divisible by n.
    len_folds = [int(sum(x)) for x in np.array_split(np.ones(len(data)), k)]
    ############################

    folds = []
    for i in range(k):
        data_ss = data.sample(n=len_folds[i], random_state=20)
        data = data.drop(data_ss.index)
        folds.append(data_ss)

    return folds

### To-Do: Develop a performance metric function using RMSE

Define the function to compute the RMSE between the predicted and actual target (y) values. Both inputs must be `NumPy arrays` and the function returns a single float. RMSE is calculated as the square root of the mean of the squared differences between predicted values and the true values.


In [None]:
def rmse(prediction, true):
    return np.sqrt(np.mean(np.square(prediction-true)))



### To-Do: Write a function to run cross-validation on the k-folds

Use the function `cross-validation(folds)` to run the cross-validation on both of the models. This functions returns the average RMSE for the KRR model.

Hint: To create the training sets, use the concatenation on `(folds[:i]+folds[(i+1):])`.



In [None]:
def cross_validation(folds):
    folds = copy.copy(folds) #This creates a new variable, which is a copy of folds

    rmses_KRR = []  #This is a list to collect the RMSEs for each fold for the KRR

    for i, fold in enumerate(folds):

        ############################
        #Write code to create the training and validation sets as data frames

        train = pd.concat(folds[:i]+folds[(i+1):])
#         train = pd.concat([folds.pop(i)])
        validate = fold

        ############################

        ############################
        #Use the fit_and_predict function to create new columns in the validation set for the predictions
        #for KRR model with the heading ['KRR_predictions'].

        validate['KRR_predictions'] = fit_and_predict_KRR(train, validate)

        ############################

        ############################
        #Calculate the RMSE and append it to rmses_KRR

        rmses_KRR.append(rmse(validate['KRR_predictions'].to_numpy(), validate['y'].to_numpy()))

        ############################

    RMSE_KRR = np.mean(rmses_KRR) # calculate the average RMSEs for kernel ridge regression

    return RMSE_KRR



For k = 100, calculate the RMSE for the KRR model and print the solution.

Hint: Use `cross_validation(copy.copy(folds))` and assign it to RMSE_KRR.



In [None]:
folds =  k_folds(data, 100)
print(len(folds))
RMSE_KRR = cross_validation(copy.copy(folds))
print(RMSE_KRR)

### To-Do: Run cross-validation for different values of k

- For k={2, ..., 100}, split the data into k folds and run the cross-validation. 

- Save the results from each run in a list. Then, create a plot with k values on the x-axis and RMSE on the y-axis.

- Use the time function (see the example below) to measure how long the cross-validation takes for each value of k. Plot the time against the value of k.

**Try the following values of k: 5, 7, 10, 50, and 100**

In [None]:
#Here is a time function example

start = time.time()
print('hello')
end = time.time()
print(end - start)

In [None]:
K=5
KRRs = []

times = []
for k in range(2, K):
    start = time.time()
    folds = k_folds(data, k)
    RMSE_KRR = cross_validation(folds)
    end = time.time()
    KRRs.append(RMSE_KRR)

    times.append(end - start)


plt.plot(list(range(2,K)), KRRs, label='KRR')

plt.legend()
plt.title('RMSEs')
plt.ylabel('RMSE')
plt.xlabel('k')

fig = plt.figure()
plt.plot(list(range(2,K)), times)
plt.title('time to compute')
plt.ylabel('time (s)')
plt.xlabel('k')

### To-Do: Answer the question

Answer the following question in a markdown cell:

Repeat the above code for k values of 5, 10, 50 and 100 and note the RMSEs and the times taken. Looking at the plots you made, what are the benefits and drawbacks of increasing k?



### Answer:

In [None]:
# Write your answer here:
# As k increases, you get more consistent results, which is good for reproducibility. However, this is at the cost of computational time.

## Part two: Using `sklearn` for k-fold cross-validation 

### To-Do: Using `sklearn` for k-fold cross-validation with a `LogisticRegression` classifier

Here, the `Iris` data set, a toy data set provided by `sklearn`, is used for a classification task. The accuracy score used comes from cross-validation.

In [None]:


#Load the Iris data set (classification)
X, y = load_iris(return_X_y=True)

#Define a logistic regression model
model = LogisticRegression(max_iter=200, random_state=42)

#Define five-fold cross-validation with shuffle for randomness
kf = KFold(n_splits=5, shuffle=True, random_state=42)

#Evaluate the model using cross-validation (default scoring is accuracy for classification)
scores = cross_val_score(model, X, y, cv=kf)

print("Cross-Validation accuracy scores:", scores)
print("Average CV accuracy score:", np.mean(scores))



### To-Do: Using `sklearn` for k-fold cross-validation for regression


- Use the `load_diabetes` data set and the following model: `model = KernelRidge(alpha=0.1, kernel='rbf', gamma=1.0)`.
- Apply k-fold cross-validation with five splits and compute the average cross-validation score.

Note: You can use any of the regression metrics, such as MAE, MSE, $R^2$ or RMSE.

In [None]:


#Load the Diabetes data set (regression)
X, y = load_diabetes(return_X_y=True)

#Define the KRR model with RBF kernel
model = KernelRidge(alpha=0.1, kernel='rbf', gamma=1.0)

#Define five-fold cross-validation with shuffle for randomness
kf = KFold(n_splits=5, shuffle=True, random_state=42)

#Evaluate the model using cross-validation (default scoring is R² for regression)
scores = cross_val_score(model, X, y, cv=kf)

print("Cross-Validation R² scores:", scores)
print("Average CV R² score:", np.mean(scores))
