### 3.1. Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. N

![Alt Text](https://scikit-learn.org/stable/_images/grid_search_workflow.png)

In [5]:
# import libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [6]:
X, y = datasets.load_iris(return_X_y=True)
print('The columns is:',X.shape[1])
print('The rows is:',X.shape[0])

The columns is: 4
The rows is: 150


- We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=30)
X_train.shape , y_train.shape


((90, 4), (90,))

In [8]:
X_test.shape , y_test.shape

((60, 4), (60,))

In [9]:
clf = svm.SVC(kernel='rbf', C=3).fit(X_train, y_train)
clf.score(X_test, y_test)                   

0.9666666666666667

## 3.1.1. Computing cross-validated metrics

In [10]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
score = cross_val_score(clf, X, y, cv=5)
print(score)

print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
print("%0.2f accuracy with a standard deviation of %0.2f" % (score.mean(), score.std()))

# Accuracy in percentage
print("Accuracy:%0.2f " % (score.mean()*100))
print('Standard Deviation:%0.2f' % (score.std()*100))

[0.96666667 1.         0.96666667 0.96666667 1.        ]
Accuracy: 0.98 (+/- 0.03)
0.98 accuracy with a standard deviation of 0.02
Accuracy:98.00 
Standard Deviation:1.63


In [11]:
from sklearn import metrics
score = cross_val_score(clf, X, y, cv=3, scoring='f1_micro')
print(score)

[1.   1.   0.98]


In [16]:
from sklearn.model_selection import ShuffleSplit
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=6, test_size=0.3, random_state=0)
cross_val_score(clf, X, y, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [17]:
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)
clf.score(X_test_transformed, y_test)

0.9333333333333333