<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Cross Validation

Why do we need cross-validation? 

* in case the test data isn't representative of the overall test


We're going to split our data into 5 groups or "folds". 

* Fold 1 will be a test set. 
* Folds 2,3,4,5 will be used to fit our model
* Compute the Error = either $R^2$ or MSE

Then we do this 5 times in total, each time changing the test set, so that folds 1, 2, 3, 4, and 5 each get their turn at being the test set, with the data fitted on the other folds.

This is 5-fold cross validation. Although the number of folds you choose is up to you. In general, we call this k-folds cross validation (k-folds CV).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score, KFold


In [3]:
tips_df = pd.read_csv('../../Data/tips.csv')
tips_df

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.50,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4
...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3
240,27.18,2.00,Yes,Sat,Dinner,2
241,22.67,2.00,Yes,Sat,Dinner,2
242,17.82,1.75,No,Sat,Dinner,2


In [4]:
X = tips_df['tip'].values.reshape(-1,1)  # this could have been done: happiness_df[['income']].values
y = tips_df['total_bill'].values

# Select rows for testing

Select rows based on random_state 21, and score the model

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test) # The R^2 value for this model

0.3868632714086492

# Repeat: Selecting different Rows for testing

Select rows based on random_state 1, and score the model

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test) # The R^2 value for this model

0.4236394953603122

# Conclusion - There's a problem

As we've seen above, the accuracy of a model can vary quite a bit depending on how we split our data into training and test. 

The problem can be exaggerated if the training data has been sorted in Excel. 

E.g. Perhaps the training data was sorted by target value (using Excel) before we read it in? The training dataset would contain the smaller values, and the test set would contain larger values.  

Even if the training data is organised randomly, there's still a risk that bias can be introduced unintentionally.


# Solution - Cross Validation!

The solution to our problem is to perform cross validation. 

There are a few different approaches to Cross Validation, this type of Cross Validation is called "Kfolds Cross Validation". 

In cross validation, the data is split into k folds. For the purposes of this example, we'll set k=5 (5 folds)
* Imagine that have 1000 rows of data (1000 observations). 
* We split the data into 5 Folds. Each time the training set will be 800 rows, and the test set wil be $\frac{1000} {5} = 200$ rows
* using cross validation we then train and test the model on each of the 5-folds in turn

![K Folds](../../Images/kfolds.png)

The advantage of KFolds is that the entire dataset is used for training the model. Using this method, we can the calculate accuracy as an average across all the individual folds. 

In [8]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LinearRegression()
cv_results = cross_val_score(model, X, y, cv=kf) 
cv_results # R^2 is the default way to measure accuracy


array([0.51345454, 0.25264114, 0.61200554, 0.47696091, 0.26079405])

Calculate the mean and standard deviation of $R^2$ 

In [17]:
print(np.mean(cv_results), np.std(cv_results))

0.42640467656975733 0.17243184651433097


We can also calculate the 95% confidence interval

In [18]:
print(np.quantile(cv_results, [0.025, 0.975])) # The 95% confidence interval of R^2

[0.11023129 0.56851096]


And finally the min and max $R^2$ for the 6 folds.

In [25]:
print(np.min(cv_results), np.max(cv_results))

0.0766752872832972 0.573307727387445


# Conclusion

Discuss: 
* what are your thoughts on Cross Validation in general?
* what are your thoughts on Cross Validation as applied to this particular dataset? 