## 6.3 K-Fold Cross-Validation
In this notebook, you will see how to perform k-fold splitting and cross-validation. This notebook was partly taken from [Satish Gunjal's Kaggle Tutorial](https://www.kaggle.com/code/satishgunjal/tutorial-k-fold-cross-validation#Model-Tuning-using-KFold-), freely distributed on Kaggle under the Apache licence.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn import linear_model, tree, ensemble

In [2]:
# Open the dataset

import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("yasserh/housing-prices-dataset")
#path = kagglehub.dataset_download("ignacioazua/world-gdp-population-and-co2-emissions-dataset")

print("Path to dataset files:", path)

print("Path to dataset files:", path) # Path to the downloaded folder 
filename = os.listdir(path)
print(filename) # Shows content of the folder
#filepath=os.path.join(path, "World_GDP_Population_CO2_Emissions_Dataset.csv")
filepath=os.path.join(path, "Housing.csv")
print(filepath)

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
['Housing.csv']
/home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1/Housing.csv


In [3]:
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [None]:
numerical_cols = df.select_dtypes(include=[np.number])
y = numerical_cols["price"]
X = numerical_cols[["area", "bedrooms", "stories", "bathrooms", "parking"]]

print("Shape of input data: {} and shape of target variable: {}".format(X.shape, y.shape))

X.head()

Shape of input data: (545, 5) and shape of target variable: (545,)


Unnamed: 0,area,bedrooms,stories,bathrooms,parking
0,7420,4,3,2,2
1,8960,4,4,4,3
2,9960,3,2,2,2
3,7500,4,2,2,3
4,7420,4,2,1,2


In [None]:
# Choose amount of splits and initialize k-fold
kf =KFold(n_splits=5, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt += 1

Fold:1, Train set: 436, Test set:109
Fold:2, Train set: 436, Test set:109
Fold:3, Train set: 436, Test set:109
Fold:4, Train set: 436, Test set:109
Fold:5, Train set: 436, Test set:109


In [13]:
def rmse(score):
    rmse = np.sqrt(abs(score))
    print(f'rmse= {"{:.2f}".format(rmse)}')

We will now use the `cross_val_score` method in sklearn.
For each fold, this method:

- trains the model on Kâˆ’1 folds

- evaluates it on the remaining fold

- Returns an array of scores, one for each fold: in this case, we choose MSE.

In [17]:
score = cross_val_score(linear_model.LinearRegression(), X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold: {score}')
rmse(score.mean())

Scores for each fold: [-2.29272155e+12 -1.72786589e+12 -1.09418238e+12 -1.19537597e+12
 -1.61932184e+12]
rmse= 1259322.64


> Important: `cross_val_score` is usually based on the concept that higher values are better, but for the MSE, the contrary is true. Therefore, `cross_val_score` returns the **negative** MSE.