<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Cross Validation

Why do we need cross-validation? 

* in case the test data isn't representative of the overall test


We're going to split our data into 5 groups or "folds". 

* Fold 1 will be a test set. 
* Folds 2,3,4,5 will be used to fit our model
* Compute the Error = either $R^2$ or MSE

Then we do this 5 times in total, each time changing the test set, so that folds 1, 2, 3, 4, and 5 each get their turn at being the test set, with the data fitted on the other folds.

This is 5-fold cross validation. Although the number of folds you choose is up to you. In general, we call this k-folds cross validation (k-folds CV).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score, KFold


In [2]:
tips_df = pd.read_csv('../../Data/tips.csv')

X = tips_df['tip'].values.reshape(-1,1)  # this could have been done: happiness_df[['income']].values
y = tips_df['total_bill'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)

model = LinearRegression()
model.fit(X, y)
model.fit(X_train, y_train)
model.score(X_test, y_test) # The R^2 value for this model

0.38686327140864907

In [3]:
kf = KFold(n_splits=6, shuffle=True, random_state=42)
model = LinearRegression()
cv_results = cross_val_score(model, X, y, cv=kf) # R^2 is the default

In [4]:
print(np.mean(cv_results), np.std(cv_results))

0.42640467656975733 0.17243184651433097


In [5]:
print(np.quantile(cv_results, [0.025, 0.975])) # The 95% confidence interval of R^2

[0.11023129 0.56851096]
