### Understanding validation

Reference: https://scikit-learn.org/stable/modules/cross_validation.html

In [2]:
# load the required libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [3]:
# load the dataset
iris = datasets.load_iris()

In [4]:
# check data structure
iris.data.shape, iris.target.shape

((150, 4), (150,))

now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

In [6]:
X_train.shape, y_train.shape

((90, 4), (90,))

In [7]:
X_test.shape, y_test.shape

((60, 4), (60,))

In [8]:
# Train a classifier function on training data 
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

In [9]:
# Test the classifier on testing data
clf.score(X_test, y_test)

0.9666666666666667

The simplest way to use cross-validation is to call the `cross_val_score` helper function on the estimator and the dataset.

In [10]:
from sklearn.model_selection import cross_val_score

In [11]:
clf = svm.SVC(kernel='linear', C=1)

In [12]:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

In [13]:
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

The mean score and the 95% confidence interval of the score estimate are hence given by:

In [14]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)
