# Cross Validation
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally. Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
import os, sys, plotly.graph_objects as go
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path) 
from erudition.learning.helpers.plots.plotly_render import render, scatter

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

In [7]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)    

0.9666666666666667

In [8]:
from sklearn.model_selection import cross_val_score

clf1 = svm.SVC(kernel='linear', C=1)
scores  = cross_val_score(clf1, X_test, y_test, cv = 5)
scores

array([1.        , 1.        , 1.        , 0.90909091, 0.90909091])

In [9]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.96 (+/- 0.09)


In [10]:
Cs = np.logspace(-10, 0,10)

In [11]:
Cs

array([1.00000000e-10, 1.29154967e-09, 1.66810054e-08, 2.15443469e-07,
       2.78255940e-06, 3.59381366e-05, 4.64158883e-04, 5.99484250e-03,
       7.74263683e-02, 1.00000000e+00])

In [25]:
scores_mean = []
scores_std = []

digits = datasets.load_digits()

X = digits.data
y = digits.target

for c in Cs:
    clf1 = svm.SVC(kernel='linear', C=c)
    scores  = cross_val_score(clf1, X, y, cv = 5)
    scores_mean.append(np.mean(scores))
    scores_std.append(np.std(scores))


1e-10
1.2915496650148826e-09
1.6681005372000592e-08
2.1544346900318867e-07
2.782559402207126e-06
3.5938136638046256e-05
0.0004641588833612782
0.005994842503189421
0.07742636826811278
1.0


In [32]:
plot_mean = scatter(Cs, scores_mean, 'C Scores Mean', mode='lines')
plot_plus_std = scatter(Cs, np.array(scores_mean)+np.array(scores_std), '+std', mode='lines')
plot_minus_std = scatter(Cs, np.array(scores_mean)-np.array(scores_std), '-std', mode='lines')
fig = go.Figure(data=[plot_mean, plot_plus_std, plot_minus_std])
fig.update_layout(xaxis_type="log", yaxis_type="log")
render(fig, 'Cross-validation with an SVM on the Digits dataset.', width=1400, height=800, x_axis_title='C Parameter', y_axis_title='CV Score')