## Review: Why use training and testing data?

* Gives estimate of performance on an independent dataset;
* Serves as check on overfitting;

## Splitting data between training and test data

* [sklearn cross validation](https://scikit-learn.org/stable/modules/cross_validation.html)

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

In [11]:
features_train, features_test, labels_train, labels_test = \
train_test_split(iris.data, iris.target, test_size=0.4, \
                 random_state=0)
features_train.shape, features_test.shape, \
labels_train.shape, labels_test.shape

((90, 4), (60, 4), (90,), (60,))

In [12]:
clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)

print(clf.score(features_test, labels_test))

0.9666666666666667


## Training, transforms, predicting

train/test split (training_features, test_features) > PCA (`pca.fit`, `pca.transform`) > SVM (`svc.fit`, `svc.predict`)

**train**
```
pca.fit(training_features)
pca.transform(training_features)
svc.train(training_features)
```

**test** (no `pca.fit`, use the same PCs as train data)
```
pca.transform(test_features)
svc.predict(test_features)
```

## K-fold cross validation

* Problems with splitting into training and testing data;
* Partition the data set into k bins of equal size;
* Run k separate learning experiments (run multiple times, k=10);
    * Pick one of those k subsets as testing data;
    * The remaining k minus one bins are put together into the training set.
    
Pick testing set > train > test on testing set

Average test result from those k experiments

**train/test**: min training time;

**10-fold c.v.**: min run time, max accuracy.

* [K-fold on sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html);
* Split the data into a number of data sets wanted;
* It returns two lists, the first is all the indices from data points to use as training set. The second is the indice list of data points to use in the test data set;
* Shuffle can be used by adding the argument `shuffle=True`.

## GridSearchCV

Analize multiple combinations of parameters to determine the best performance

* [GridSearchCV from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [1]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()

In [3]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='scale')
clf = GridSearchCV(svr, parameters, cv=5)
clf.fit(iris.data, iris.target)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [5]:
clf.best_params_

{'C': 1, 'kernel': 'linear'}