# Cross-Validation

To evaluate our supervised models, we have far more choice than just using the train_test_split function. 

Cross-validation is a method of evaluating generalization performance that
is more stable and thorough than using a split into a training and a test set. In crossvalidation,
the data is instead split repeatedly and multiple models are trained. For example, when performing five-fold cross-validation,
the data is first partitioned into five parts, called folds.
Next, a sequence of models is trained. The first model is trained using the first fold as
the test set, and the remaining folds as the training set. The model is
built using the data in folds 2–5, and then the accuracy is evaluated on fold 1. This is repeated for all the other folds.
For each of these five splits of the data into training and test sets, we compute the
accuracy

Cross-validation is implemented in scikit-learn using the cross_val_score function
from the model_selection module. The parameters of the cross_val_score
function are the model we want to evaluate, the training data, and the labels.

Let’s evaluate LogisticRegression classification model on the iris dataset:

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
logreg = LogisticRegression(max_iter=1000)

scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [0.96666667 1.         0.93333333 0.96666667 1.        ]


In [7]:
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Average cross-validation score: 0.97


Using the mean cross-validation we can conclude that we expect the model to be
around 97% accurate on average. Looking at all five scores produced by the five-fold
cross-validation, we can also conclude that there is a relatively high variance in the
accuracy between folds, ranging from 100% accuracy to 93% accuracy.

Cross-validation is especially useful when we don't have enough data to get a decent test set. Also the range of the accuracies returned can give us some kind of idea about the worst and best case secenario for the accuracy the model can handle.

### Stratified k-Fold Cross-Validation

Our iris dataset looks like this:

In [9]:
from sklearn.datasets import load_iris
iris = load_iris()
print("Iris labels:\n{}".format(iris.target))

Iris labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


If we do the standard cross validation, then the first fifth of the data will be used for test and therefore our test data will only be made up of examples of only one class and so on for the other folds. This will lead to some reallly bad accuracies.

Luckily for us, scikit-learn automatically uses stratified k-fold cross validation for classification case, this means it choses examples for each set that is propotional to the whole dataset. sciki-learn, however, only uses the standard cross validation for the regression case as it is akward and does not mak too mach to stratify regression labels.

### More control over cross-validation

We can adjust the number of folds that are used in
cross_val_score using the cv parameter. However, scikit-learn allows for much
finer control over what happens during the splitting of the data by providing a crossvalidation
splitter as the cv parameter. For most use cases, the defaults of k-fold crossvalidation
for regression and stratified k-fold for classification work well, but there
are some cases where you might want to use a different strategy.

For example, we may not want scikit-learn to automatically use the startified k-cross validation and we want to use the standard one only. Then, we import the KFold splitter class from the model_selection module and instantiate it with the number of folds we want to use:

In [10]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)

Then, we can pass the kfold splitter object as the cv parameter to cross_val_score:

In [11]:
print("Cross-validation scores:\n{}".format(cross_val_score(logreg, iris.data, iris.target, cv=kfold)))

Cross-validation scores:
[1.         1.         0.86666667 0.93333333 0.83333333]


We can randomize the order of the data points by setting the shuffle parameter of KFold to True.

In [13]:
kfold = KFold(n_splits=3, shuffle=True, random_state=0)
print("Cross-validation scores:\n{}".format(cross_val_score(logreg, iris.data, iris.target, cv=kfold)))

Cross-validation scores:
[0.98 0.96 0.96]


### Leave-one-out cross-validation

You can think of leave-one-out cross-validation as k-fold cross-validation where each fold is a single sample. 
For each split, you pick a single data point to be the test set. This can be very
time consuming though, particularly for large datasets but can give us better evaluation.

In [14]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)

print("Number of cv iterations: ", len(scores))
print("Mean accuracy: {:.2f}".format(scores.mean()))

Number of cv iterations:  150
Mean accuracy: 0.97


### Shuffle-split cross-validation

In shuffle-split cross-validation, each split samples a specified number of points for the
training set and the test set. This splitting is repeated a specified number of times.

One important thing to note is that, the size of training set and test set combined need not add to the whole dataest as in other techniques. For example, we can only use 50% od data for training and 20% of data for test at each iteration and leave the rest of the 30% alone. This is especially hepful for very large datasets where we want to save some time.

In [15]:
from sklearn.model_selection import ShuffleSplit
shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)

scores = cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)

print("Cross-validation scores:\n{}".format(scores))

Cross-validation scores:
[0.94666667 0.96       0.96       0.96       0.97333333 0.94666667
 0.96       0.97333333 0.97333333 0.97333333]


The stratified version is in StratifiedShuffleSplit, which is better suited for classification tasks.

### Cross-validation with groups

Another very common setting for cross-validation is when there are groups in the
data that are highly related.

For example, you want to build a model that recognizes images from an image. The problem arises when more likely or not you will have a lot of images a few people. Therefore, when you split the data for validation, some pictures of a person will be in the training set but some will also be in the test set. So, the model would give high accuracy "falsely" as it may have just memorized the features of that person. What you really want is to set the model on image of people it has never seen before.

Another common application is in medicine, where you
might have multiple samples from the same patient, but are interested in generalizing
to new patients. Similarly, in speech recognition, you might have multiple recordings
of the same speaker in your dataset, but are interested in recognizing speech of new
speakers.

To achieve this, we can use GroupKFold, which takes an array of groups as argument
that we can use to indicate which person is in the image. The groups array here indicates
the group the corresponding data (at that index) belongs to. The size of the groups array will be the same size as the dataset.

In [16]:
from sklearn.model_selection import GroupKFold

from sklearn.datasets import make_blobs
# Create synthetic dataset of size 12
X, y = make_blobs(n_samples=12, random_state=0)

# Assume the first three samples belong to the same group, then the next four, etc.
groups = [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]

scores = cross_val_score(logreg, X, y, groups, cv=GroupKFold(n_splits=3))
print("Cross-validation scores:\n{}".format(scores))

Cross-validation scores:
[0.75       0.6        0.66666667]




Ignore the Error message, this was just fake synthetic data for illustration purposes.

There are even more cross-validation techniques implemented in scikit-learn but the ones mentioned above are the most commonly used ones.