# Cross Validation

Cross validation is a technique used to evaluate the performance of a machine learning model. It is a resampling procedure used to evaluate a model if we have a limited data. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Types of Cross Validation:
1. K-Fold Cross Validation
2. Stratified K-Fold Cross Validation
3. Leave One Out Cross Validation
4. Time Series Cross Validation
5. Group Cross Validation

## K-Fold Cross Validation

In K-Fold cross validation, the data is divided into k subsets. The holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model.


#### Summary

In summary, the k-fold cross-validation technique involves splitting the training dataset into k subsets called folds. The value of k should be less than the number of rows in the training dataset. The model is then trained on k-1 of the folds and evaluated on the kth fold. This process is repeated k times, with each of the k folds used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation of model effectiveness.

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris

In [35]:
%%time
df = load_iris()

nb = GaussianNB()
scores = cross_val_score(nb, df.data, df.target, cv=5, scoring='accuracy')

# print the scores to evaluate the model
print('score for each fold', scores)
print('Mean score:', (scores.mean()))
print('Standard deviation:', (scores.std()))

score for each fold [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
Mean score: 0.9533333333333334
Standard deviation: 0.02666666666666666
CPU times: total: 62.5 ms
Wall time: 68 ms


In [36]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [37]:
X = tips[['total_bill', 'tip', 'size']]
y = tips['sex']

nb = GaussianNB()
scores = cross_val_score(nb, X, y, cv=10, scoring='accuracy')

# print the scores to evaluate the model
print('score for each fold', scores)
print('Mean score:', (scores.mean()))
print('Standard deviation:', (scores.std()))

score for each fold [0.56       0.48       0.6        0.48       0.83333333 0.58333333
 0.625      0.75       0.70833333 0.33333333]
Mean score: 0.5953333333333333
Standard deviation: 0.1381581219714088
