# K-Fold Cross Validation

Idea is to split all data into training dataset and test dataset. Train the model using training dataset and evaluate its performance using the test dataset. Train/test still have limitations, can overfit the data.

K-Fold Cross Validation divids data into K buckets and for each bucket, use it as test dataset and use remaining as training dataset. Train on the combined remaining K-1 segments and measure their performance against the test set, repeat for each segment. Then, take the average for the K r-squared scores. 

Import Iris dataset

In [6]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()

A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [7]:
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)   

0.9666666666666667

It does really well! Over 96 % of the time, the model is able to correctly predict the species of an Iris that it has never seen before just based on its measurements. 

It's a small dataset so it's possible that there is overfitting happening. 

K-Fold cross validation; let's use a K of 5:

In [8]:
# cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


The model is even better than we thought! (Looking at the 98% of the time, the model is able to predict the species of an Iris)

Can we do better? Let's try a different kernel (poly):

In [9]:
# build SVC model using different kernel
clf = svm.SVC(kernel='poly', C=1)

# cross_val_score the model
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. 

But this could not have been found with a single train/test split:

In [10]:
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)   

0.9

That's the same score as with a single train/test split on the linear kernel.