# Model evaluation

So far our model evaluation was relatively simplistic, using a split into training and test data as shown in Figure TODO.

![:scale 100%](images/train_test_split_new.png)

This is a common scheme, but has several limitations that we'll address now.
The first issue is model selection. As we discussed before, many models have hyper-parameters that we need to specify, such as ``k_neighbors`` in the ``KNeighborsClassifier``.
We also might want to choose between different families of models or different algorithms. For simplicity, let's focus on the case of a single hyper-parameter, but there's no difference between that and selecting a different algorithm in principle.
A natural way to find a good parameter to use would be to try different values, i.e. fit a model for each value of the hyper-parameter on the training set, and evaluate it on the test set.
It seems that the hyper-parameters performing best on the test set should be a good choice. That's true, however, with a caveat that's illustrated in figure TODO.
![:scale 80%](images/overfitting_validation_set_3.png)

todo explain figure more; should say test set in the figure?

The figure assumes we know what the "real" generalization performance of each hyper-parameter setting would be, a quantity that's only theoretically knowable, but the quantity that we're actually interested. What we have is the performance on the test set, which can be understood as a noisy version of the true generalization ability of the model.
Now if we pick the best performing hyper-parameters, as indicated by the red dot, it provides a good estimate of what the optimum hyper-parameter value should be, in other words on the x-axis, the red dot is close to the maximum of the idealized generalization performance (the orange line).
However, because we took the maximum of a noisy value, the actual accuracy at this point, i.e. the y axis value of the red dot, is overly optimistic. While the test set accuracy is an unbiased estimate of generalization, i.e. it's unbiased on average, taking the maximum over all hyper-parameters results in an optimistic bias.

```{note} Bias in validation and multiple testing
The issue described above of using the test set both for finding the optimum hyper-parameters and for estimating the accuracy of these parameters can be linked to the concept of multiple testing errors in statistics.
The underlying idea is that if I try out many different things (say many hypothesis or many hyper-parameters), then at some point, I'll get a 'good' result by accident. In a process involving randomness and uncertainty, trying 100 times and getting a good result is not the same as trying once and getting a good result. Luckily the fix for our issue is much easier than the fix for multiple hypothesis testing.
```

So while we now know a good setting for our hyper-parmeter, we have no way of estimating how well the model with this hyper-parameter actually performs. If we want to use this model in production, that's a pretty big issue.
However, it has a pretty simple solution: using an additional hold-out set, as shown in Figure todo.

## The Threefold-split

![:scale 100%](images/train_test_validation_split.png)

This use of three separate sets, a training set for model building, a validation set for model selection, and a test set for final model evaluation, is probably the most commonly used method for model selection and evaluation, and (apart from some modifications that we'll discuss below), it's a best practice that you should use whenever possible. The result is illustrated in Figure todo.
![:scale 80%](images/overfitting_validation_set_4.png)

We now use the validation set to find the value (i.e. x-axis location) of the optimum hyper-parameter, and the test set to find the corresponding y-value. Because we haven't used the test set for estimating the best hyper-parameter, the test set provides an unbiased estimate of generalization performance when following this method.
One aspect of this is critically important, though: the test set only provides an unbiased result if we are using it once. We apply this selection procedure several times, compare several test-set results, and pick the best-performing model, we end up with the situation we started with, and our estimate will be biased. Therefore, **you should be really careful in when to use the test set**, and ideally set it aside and **use it only once, after you decided on the final model you want to use**.

[Preventing Overfitting in cross-validation - Ng 1997](20http://robotics.stanford.edu/~ang/papers/cv-final.pdf)]

## Implementing the threefold split
Before we go into more detail of tools for doing model-evaluation with scikit-learn, let's do the procedure *by foot* first, to clarify the process.
Here is a simple implementation of using the three-fold split strategy for selecting the number of ``n_neighbors`` in ``KNeighborsClassifier`` on the TODO iris dataset.
We split our dataset into three parts, by first talking 25% as the test set, and then taking 25% of the remainder as validation set (so about 19% of the original data).
Then we define the candidate values of the parameter we want to adjust. This often requires knowledge of the algorithm and potentially the dataset.
Here, we're using a range from 1 to 14 in steps of 2. For ``n_neighbors``, often uneven numbers are used to avoid ties. The upper range is picked somewhat arbitrarily, but we certainly wouldn't want to use more than 50, the number of samples in each class in this dataset.

Then, we build a model for each value of ``n_neighbors`` in our list, evaluate it on the validation set, and store the result. Finally, we find the value that gave us the best result.
Now we have basically made it to computing the red dot in figure TODO, and we only need to evaluate the purple dot.
The easiest way to do this would be to just predict on the test set, which is certainly possible. In partice, we often prefer rebuilding the model using the training data together with the validation data.
After we found the best value of ``n_neighbors`` the validation set is no longer useful, and so we can just use it as more data to build our model with. So we train a new model, using the best value of ``n_neighbors`` as determined by the validation set, use as training data both the original training set and the validation set, and evaluate the model on the test set.
This provides us with an unbiased estimate of how well this final model will generalize.

The step of retraining a model using both the training and validation set is optional, and if model training is very expensive, or we can assume the training dataset is large enough for our model, we might skip this step. Here, we have very little data, however, and we should use as much as we can get. We'll also see more reasons to use this retraining technique later.

In [9]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load the data and split it into three parts
X, y = load_iris(return_X_y=True)
# first split off test set
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=23)
# then split of validation set
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state=23)

# create a list to hold validation scores for each
# hyper-parameter setting
val_scores = []
# Specify a list of values we want to try for n_neighbors
# This might require some knowledge of the dataset and the model
# or potentially some trial-and-error
neighbors = np.arange(1, 15, 2)
# for each potential value of n_neighbors
for i in neighbors:
    # build a model
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    # score validation set accuracy
    val_scores.append(knn.score(X_val, y_val))
# using max tells us the best score
print(f"best validation score: {np.max(val_scores):.3}")
# with argmax we can find the best value of n_neighbors used
best_n_neighbors = neighbors[np.argmax(val_scores)]
print("best n_neighbors:", best_n_neighbors)
# Now,
knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(X_trainval, y_trainval)
print(f"test-set score: {knn.score(X_test, y_test):.3f}")

best validation score: 1.0
best n_neighbors: 3
test-set score: 0.974


Using a three-fold split into training, validation and test set is essential whenever you're evaluating more than one model-which is always.
However, there's another aspect we might want to improve, which is the reliance on the particular splits.
If we change the random state in the splitting (todo?) we might end up with different results. Ideally we want the parameters we pick and our assessment of generalization ability not to be impacted by
the initial splitting of the data. In fact, having large variety between outcomes depending on the data split is probably a bad sign and means our model is not very robust.
In some cases, when the data is particularly small, such as for the iris dataset that we're using here, this can be hard to overcome, but there's still one tool
that's invaluable to have in your toolbox to make model evaluation more robust: cross-validation.

## K-Fold Cross-validation



Here is an implementation of the three-fold split for selecting the
number of neighbors.
For each number of neighbors that we want to try, we build a model on
the training set, and evaluate it on the validation set.
We then pick the best validation set score, here that’s 97.2%, achieved
when using three neighbors.
We then retrain the model with this parameter, and evaluate on the test set.
The retraining step is somewhat optional. We could also just use the best
model. But retraining allows us to make better use of all the data.

Still, depending on the test-set size we might be using only 70% or 80%
of the data, and our results depend on how exactly we split the datasets.
So how can we make this more robust?

## Cross-validation
.center[
![:scale 80%](images/cross_validation_new.png)
]


The answer is of course cross-validation. In cross-validation, you split
your data into multiple folds, usually 5 or 10, and built multiple models.
You start by using fold1 as the test data, and the remaining ones as the
training data. You build your model on the training data, and evaluate
it on the test fold.
For each of the splits of the data, you get a model evaluation and a
score. In the end, you can aggregate the scores, for example by taking
the mean.
What are the pros and cons of this?
Each data point is in the test-set exactly once!
Takes 5 or 10 times longer!
Better data use (larger training sets).
Does that solve all problems? No, it replaces only one of the splits,
usually the inner one!
--
.smaller[
pro: more stable, more data

con: slower
]

class: center, some-space
## Cross-validation + test set

![:scale 105%](images/grid_search_cross_validation_new.png)



Here is how the workflow looks like when we are using five-fold
cross-validation together with a test-set split for adjusting parameters.
We start out by splitting of the test data, and then we perform
cross-validation on the training set.
Once we found the right setting of the parameters, we retrain on the
whole training set and evaluate on the test set.

## Grid-Search with Cross-Validation

.smaller[
```python
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y)

cross_val_scores = []

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(knn, X_train, y_train, cv=10)
    cross_val_scores.append(np.mean(scores))
    
print(f"best cross-validation score: {np.max(cross_val_scores):.3}")
best_n_neighbors = neighbors[np.argmax(cross_val_scores)]
print(f"best n_neighbors: {best_n_neighbors}")

knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(X_train, y_train)
print(f"test-set score: {knn.score(X_test, y_test):.3f}")
```

```
best cross-validation score: 0.967
best n_neighbors: 9
test-set score: 0.965
```
]



Here is an implementation of this  for k nearest neighbors.

We split the data, then we iterate over all parameters and for each of
them we do cross-validation.

We had seven different values of n_neighbors, and we are running 10 fold
cross-validation. How many models to we train in total?
10 * 7 + 1 = 71 (the one is the final model)

class: center, middle
![:scale 80%](images/gridsearch_workflow.png)



Here is a conceptual overview of this way of tuning parameters, we start
of with the dataset and a candidate set of parameters we want to try,
labeled parameter grid, for example the number of neighbors.

We split the dataset in to training and test set. We use cross-validation
and the parameter grid to find the best parameters.
We use the best parameters and the training set to build a model with
the best parameters,
and finally evaluate it on the test set.


Because this is such a common pattern, there is a helper class for this
in scikit-learn, called GridSearch CV, which does most of these steps
for you.

## GridSearchCV
.smaller[
```python
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)


param_grid = {'n_neighbors':  np.arange(1, 30, 2)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10,
                   return_train_score=True)
grid.fit(X_train, y_train)
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")
```

```
best mean cross-validation score: 0.967
best parameters: {'n_neighbors': 9}
test-set score: 0.993
```
]


Here is an example.
We still need to split our data into training and test set.
We declare the parameters we want to search over as a dictionary. In
this example the parameter is just n_neighbors and the values we want
to try out are a range. The keys of the dictionary are the parameter
names and the values are the parameter settings we want to try. If you
specify multiple parameters, all possible combinations are tried. This
is where the name grid-search comes from - it’s an exhaustive search
over all possible parameter combinations that you specify.

GridSearchCV is a class, and it behaves just like any other model in
scikit-learn, with a fit, predict and score method.
It’s what we call a meta-estimator, since you give it one estimator,
here the KneighborsClassifier, and from that GridSearchCV constructs a
new estimator that does the parameter search for you.
You also specify the parameters you want to search, and the
cross-validation strategy.
Then GridSearchCV does all the other things we talked about, it does the
cross-validation and parameter selection, and retrains a model with the
best parameter settings that were found.
We can check out the best cross-validation score and the best parameter
setting with the best_score_ and best_params_ attributes.
And finally we can compute the accuracy on the test set, simply but
using the score method! That will use the retrained model under the hood.

class: compact

## GridSearchCV Results
.tiny[
```python
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
results.columns
```
```
Index(['mean_fit_time', 'mean_score_time', 'mean_test_score',
       'mean_train_score', 'param_n_neighbors', 'params', 'rank_test_score',
       'split0_test_score', 'split0_train_score', 'split1_test_score',
       'split1_train_score', 'split2_test_score', 'split2_train_score',
       'split3_test_score', 'split3_train_score', 'split4_test_score',
       'split4_train_score', 'split5_test_score', 'split5_train_score',
       'split6_test_score', 'split6_train_score', 'split7_test_score',
       'split7_train_score', 'split8_test_score', 'split8_train_score',
       'split9_test_score', 'split9_train_score', 'std_fit_time',
       'std_score_time', 'std_test_score', 'std_train_score'],
      dtype='object')
```

```python
results.params
```
```
0     {'n_neighbors': 1}
1     {'n_neighbors': 3}
2     {'n_neighbors': 5}
3     {'n_neighbors': 7}
4     {'n_neighbors': 9}
5    {'n_neighbors': 11}
6    {'n_neighbors': 13}
Name: params, dtype: object
```
]



FIXME text size
GridSearchCV also computes a lot of interesting statistics for you, which
are stored in the cv_results_ attribute. That attribute is a dictionary,
but it’s easiest to convert it to a pandas dataframe to look at it.
Here you can see the columns. Theres mean fit time, mean score time,
mean test scores, mean training scores, standard deviations and scores
for each individual split of the data.
And there is one row for each setting of the parameters we tried out.

class: center
## n_neighbors Search Results

![:scale 70%](images/grid_search_n_neighbors.png)


We can use this for example to plot the results of cross-validation over
the different parameters.
Here are the mean training score and mean test score together with one
standard deviation.

class: spacious
## Nested Cross-Validation

- Replace outer split by CV loop
- Doesn’t yield single model
(inner loop might have different best parameter settings)
- Takes a long time, not that useful in practice



We could additionally replace the outer split of the data by
cross-validation. That would yield what’s known as nested
cross-validation.
This is sometimes interesting when comparing different models, but it will
not actually yield one final model. It will yield one model for each loop
of the outer fold, which might have different settings of the parameters.
Also, this takes a really long time to train, by an additional factor
of 5 or 10, so this is not used very commonly in practice.

But let’s dive into the cross-validation a bit more.

## Cross-validation

![:scale 80%](images/cross_validation_new.png)



The answer is of course cross-validation. In cross-validation, you split
your data into multiple folds, usually 5 or 10, and built multiple models.
You start by using fold1 as the test data, and the remaining ones as the
training data. You build your model on the training data, and evaluate
it on the test fold.
For each of the splits of the data, you get a model evaluation and a
score. In the end, you can aggregate the scores, for example by taking
the mean.
What are the pros and cons of this?
Each data point is in the test-set exactly once!
Takes 5 or 10 times longer!
Better data use (larger training sets).
Does that solve all problems? No, it replaces only one of the splits,
usually the inner one!

.smaller[
pro: more stable, more data

con: slower
]

## Cross-validation + test set

![:scale 105%](images/grid_search_cross_validation_new.png)



Here is how the workflow looks like when we are using five-fold
cross-validation together with a test-set split for adjusting parameters.
We start out by splitting of the test data, and then we perform
cross-validation on the training set.
Once we found the right setting of the parameters, we retrain on the
whole training set and evaluate on the test set.

## Grid-Search with Cross-Validation

.smaller[
```python
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y)

cross_val_scores = []

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    scores = cross_val_score(knn, X_train, y_train, cv=10)
    cross_val_scores.append(np.mean(scores))
    
print(f"best cross-validation score: {np.max(cross_val_scores):.3}")
best_n_neighbors = neighbors[np.argmax(cross_val_scores)]
print(f"best n_neighbors: {best_n_neighbors}")

knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(X_train, y_train)
print(f"test-set score: {knn.score(X_test, y_test):.3f}")
```

```
best cross-validation score: 0.967
best n_neighbors: 9
test-set score: 0.965
```
]



Here is an implementation of this  for k nearest neighbors.

We split the data, then we iterate over all parameters and for each of
them we do cross-validation.

We had seven different values of n_neighbors, and we are running 10 fold
cross-validation. How many models to we train in total?
10 * 7 + 1 = 71 (the one is the final model)

![:scale 80%](images/gridsearch_workflow.png)



Here is a conceptual overview of this way of tuning parameters, we start
of with the dataset and a candidate set of parameters we want to try,
labeled parameter grid, for example the number of neighbors.

We split the dataset in to training and test set. We use cross-validation
and the parameter grid to find the best parameters.
We use the best parameters and the training set to build a model with
the best parameters,
and finally evaluate it on the test set.


Because this is such a common pattern, there is a helper class for this
in scikit-learn, called GridSearch CV, which does most of these steps
for you.

## GridSearchCV
.smaller[
```python
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)


param_grid = {'n_neighbors':  np.arange(1, 30, 2)}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=10,
                   return_train_score=True)
grid.fit(X_train, y_train)
print(f"best mean cross-validation score: {grid.best_score_}")
print(f"best parameters: {grid.best_params_}")
print(f"test-set score: {grid.score(X_test, y_test):.3f}")
```

```
best mean cross-validation score: 0.967
best parameters: {'n_neighbors': 9}
test-set score: 0.993
```
]


Here is an example.
We still need to split our data into training and test set.
We declare the parameters we want to search over as a dictionary. In
this example the parameter is just n_neighbors and the values we want
to try out are a range. The keys of the dictionary are the parameter
names and the values are the parameter settings we want to try. If you
specify multiple parameters, all possible combinations are tried. This
is where the name grid-search comes from - it’s an exhaustive search
over all possible parameter combinations that you specify.

GridSearchCV is a class, and it behaves just like any other model in
scikit-learn, with a fit, predict and score method.
It’s what we call a meta-estimator, since you give it one estimator,
here the KneighborsClassifier, and from that GridSearchCV constructs a
new estimator that does the parameter search for you.
You also specify the parameters you want to search, and the
cross-validation strategy.
Then GridSearchCV does all the other things we talked about, it does the
cross-validation and parameter selection, and retrains a model with the
best parameter settings that were found.
We can check out the best cross-validation score and the best parameter
setting with the best_score_ and best_params_ attributes.
And finally we can compute the accuracy on the test set, simply but
using the score method! That will use the retrained model under the hood.


## GridSearchCV Results
.tiny[
```python
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
results.columns
```
```
Index(['mean_fit_time', 'mean_score_time', 'mean_test_score',
       'mean_train_score', 'param_n_neighbors', 'params', 'rank_test_score',
       'split0_test_score', 'split0_train_score', 'split1_test_score',
       'split1_train_score', 'split2_test_score', 'split2_train_score',
       'split3_test_score', 'split3_train_score', 'split4_test_score',
       'split4_train_score', 'split5_test_score', 'split5_train_score',
       'split6_test_score', 'split6_train_score', 'split7_test_score',
       'split7_train_score', 'split8_test_score', 'split8_train_score',
       'split9_test_score', 'split9_train_score', 'std_fit_time',
       'std_score_time', 'std_test_score', 'std_train_score'],
      dtype='object')
```

```python
results.params
```
```
0     {'n_neighbors': 1}
1     {'n_neighbors': 3}
2     {'n_neighbors': 5}
3     {'n_neighbors': 7}
4     {'n_neighbors': 9}
5    {'n_neighbors': 11}
6    {'n_neighbors': 13}
Name: params, dtype: object
```
]



FIXME text size
GridSearchCV also computes a lot of interesting statistics for you, which
are stored in the cv_results_ attribute. That attribute is a dictionary,
but it’s easiest to convert it to a pandas dataframe to look at it.
Here you can see the columns. Theres mean fit time, mean score time,
mean test scores, mean training scores, standard deviations and scores
for each individual split of the data.
And there is one row for each setting of the parameters we tried out.

## n_neighbors Search Results

![:scale 70%](images/grid_search_n_neighbors.png)


We can use this for example to plot the results of cross-validation over
the different parameters.
Here are the mean training score and mean test score together with one
standard deviation.

## Nested Cross-Validation

- Replace outer split by CV loop
- Doesn’t yield single model
(inner loop might have different best parameter settings)
- Takes a long time, not that useful in practice



We could additionally replace the outer split of the data by
cross-validation. That would yield what’s known as nested
cross-validation.
This is sometimes interesting when comparing different models, but it will
not actually yield one final model. It will yield one model for each loop
of the outer fold, which might have different settings of the parameters.
Also, this takes a really long time to train, by an additional factor
of 5 or 10, so this is not used very commonly in practice.

But let’s dive into the cross-validation a bit more.

## Cross-Validation Strategies



So I mentioned k-fold cross validation, where k is usually 5 or ten,
but there are many other strategies.

One of the most commonly ones is stratified k-fold cross-validation.

.center[
![:scale 90%](images/kfold_cv.png)
]

.center[
![:scale 90%](images/stratified_cv.png)
]
.smallest[
Stratified:
Ensure relative class frequencies in each fold reflect relative class
frequencies on the whole dataset.]



The idea behind stratified k-fold cross-validation is that you want the
test set to be as representative of the dataset as possible.
StratifiedKFold preserves the class frequencies in each fold to be the
same as of the overall dataset.
Here is and example of a dataset with three classes that are ordered. If
you apply standard three-fold to this, the first third of the data would
be in the first fold, the second in the second fold and the third in
the third fold. Because this data is sorted, that would be particularly
bad. If you use stratified cross-validation it would make sure that each
fold has exactly 1/3 of the data from each class.

This is also helpful if your data is very imbalanced. If some of the
classes are very rare, it could otherwise happen that a class is not
present at all in a particular fold.

## Importance of Stratification
.smaller[
```python
y.value_counts()
```
```
0    60
1    40
```
```python
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.dummy import DummyClassifier

dc = DummyClassifier('most_frequent')
skf = StratifiedKFold(n_splits=5, shuffle=True)
res = cross_val_score(dc, X, y, cv=skf)
np.mean(res), res.std()
```
```
(0.6, 0.0)
```
```python
kf = KFold(n_splits=5, shuffle=True)
res = cross_val_score(dc, X, y, cv=kf)
np.mean(res), res.std()
```
```
(0.6, 0.063)
```
]

## Repeated KFold and LeaveOneOut

- LeaveOneOut : KFold(n_folds=n_samples) 

High variance, takes a long time 

.tiny[(see [Raschka](https://arxiv.org/pdf/1811.12808.pdf) for a review and [Varoquaux](https://hal.inria.fr/hal-01545002/file/paper.pdf) for empirical evaluation)]

- Better: ShuffleSplit (aka Monte Carlo) 

Repeatedly sample a test set with replacement

- Even Better: RepeatedKFold. 

Apply KFold or StratifiedKFold multiple times with shuffled data.



If you want even better estimates of the generalization performance,
you could try to increase the number of folds, with the extreme
of creating one fold per sample. That’s called “LeaveOneOut
cross-validation”. However, because the test-set is so small every time,
and the training sets all have very large overlap, this method has very
high variance.
A better way to get a robust estimate is to run 5-fold or 10-fold
cross-validation multiple times, while shuffling the dataset.

.center[
![:scale 100%](images/shuffle_split_cv.png)
]
.smaller[Number of iterations and test set size independent]


Another interesting variant is shuffle split and stratified shuffle
split. In shuffle split, we repeatedly sample disjoint training and test
sets randomly.
You only have to specify the number of iterations, the training set size
and the test set size. This also allows you to run many iterations with
reasonably large test-sets.
It’s also great if you have a very large training set and you want to
subsample it to get quicker results.

.center[
![:scale 100%](images/repeated_stratified_kfold.png)
]
.smaller[
Potentially less variance than StratifiedShuffleSplit.

Five times five fold or at most ten times ten fold is sufficient.
]

## Defaults in scikit-learn

- 5-fold in 0.22 (used to be 3 fold)
- For classification cross-validation is stratified
- train_test_split has stratify option:
train_test_split(X, y, stratify=y)

- No shuffle by default!



By default, all cross-validation strategies are five fold.
If you do cross-validation for classification, it will be stratified
by default.
Because of how the interface is done, that’s not true for
train_test_split and if you want a stratified train_test_split, which
is always a good idea, you should use stratify=y
Another thing that’s important to keep in mind is that by default
scikit-learn doesn’t shuffle! So if you run cross-validation twice
with the default parameters, it will yield exactly the same results.

## Cross-Validation with non-iid data

## Grouped Data
### Assume have data (medical, product, user...) from 5 cities
- New York, San Francisco, Los Angeles, Chicago, Houston.

We can assume data within a city is more correlated then between cities.

### Usage Scenarios
- Assume all future users will be in one of these cities: i.i.d.
- Assume we want to generalize to predict for a new city: not i.i.d.



Shipped product in 4 cities. Might ship in another one?
States: you have all the states, no new state will start to exist

Similar thing for multiple measurements per patient.
Or geospacial data.

![:scale 100%](images/group_kfold.png)



A somewhat more complicated approach is group k-fold.
This is actually for data that doesn’t fulfill our IID assumption and
has correlations between samples.
The idea is that there are several groups in the data that each contain
highly correlated samples.
You could think about patient data where you have multiple samples for
each patient, then the groups would be which patient a measurement was
taken from.
If you want to know how well your model generalizes to new patients,
you need to ensure that the measurements from each patient are either
all in the training set, or all in the test set.
And that’s what GroupKFold does.
In this example, there are four groups, and we want three folds. The
data is divided such that each group is contained in exactly one fold.
There are several other cross-validation methods in scikit-learn that
use these groups.

## Correlations in time (and/or space)

![:scale 70%](images/time_series1.png)



Not necessarily obvious that there is a time component!
Data collection usually happens over time!

## Correlations in time (and/or space)

![:scale 70%](images/time_series2.png)



Not necessarily obvious that there is a time component!
Data collection usually happens over time!

## Correlations in time (and/or space)

![:scale 70%](images/time_series3.png)



Not necessarily obvious that there is a time component!
Data collection usually happens over time!

![:scale 100%](images/time_series_walk_forward_cv.png)


Another common case of data that’s not independent is time
series. Usually todays stock price is correlated with yesterdays and
tomorrows. If you randomly split time series, this makes predictions
deceivingly simple. In applications, you usually have data up to some
point, and then try to make predictions for the future, in other words,
you’re trying to make a forecast.
The TimeSeriesSplit in scikit-learn simulates that, by taking increasing
chunks of data from the past and making predictions on the next
chunk. This is quite different from the other was to do cross-validation,
in that the training sets are all overlapping, but it’s more appropriate
for time-series.

![:scale 100%](images/time_series_cv.png)



Another common case of data that’s not independent is time
series. Usually todays stock price is correlated with yesterdays and
tomorrows. If you randomly split time series, this makes predictions
deceivingly simple. In applications, you usually have data up to some
point, and then try to make predictions for the future, in other words,
you’re trying to make a forecast.
The TimeSeriesSplit in scikit-learn simulates that, by taking increasing
chunks of data from the past and making predictions on the next
chunk. This is quite different from the other was to do cross-validation,
in that the training sets are all overlapping, but it’s more appropriate
for time-series.

## Using Cross-Validation Generators

.tiny[
```python
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit, RepeatedStratifiedKFold
kfold = KFold(n_splits=5)
skfold = StratifiedKFold(n_splits=5, shuffle=True)
ss = ShuffleSplit(n_splits=20, train_size=.4, test_size=.3)
rs = RepeatedStratifiedKFold(n_splits=5, n_repeats=10)

print("KFold:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=kfold))

print("StratifiedKFold:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=skfold))

print("ShuffleSplit:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=ss))

print("RepeatedStratifiedKFold:")
print(cross_val_score(KNeighborsClassifier(), X, y, cv=rs))
```

```

KFold:
[0.93 0.96 0.96 0.98 0.96]
StratifiedKFold:
[0.98 0.96 0.96 0.97 0.96]
ShuffleSplit:
[0.98 0.96 0.96 0.98 0.94 0.96 0.95 0.98 0.97 0.92 0.94 0.97 0.95 0.92
 0.98 0.98 0.97 0.94 0.97 0.95]
RepeatedStratifiedKFold:
[0.99 0.96 0.97 0.97 0.95 0.98 0.97 0.98 0.97 0.96 0.97 0.99 0.94 0.96
 0.96 0.98 0.97 0.96 0.96 0.97 0.97 0.96 0.96 0.96 0.98 0.96 0.97 0.97
 0.97 0.96 0.96 0.95 0.96 0.99 0.98 0.93 0.96 0.98 0.98 0.96 0.96 0.95
 0.97 0.97 0.96 0.97 0.97 0.97 0.96 0.96]
```
]


Ok, so how do we use these cross-validation generators? We can simply
pass the object to the cv parameter of the cross_val_score function,
instead of passing a number. Then that generator will be used.
Here are some examples for k-neighbors classifier.
We instantiate a Kfold object with the number of splits equal to 5,
and then pass it to cross_val_score.
We can do the same with StratifiedKFold, and we can also shuffle if we
like, or we can use Shuffle split.

## cross_validate function
.smaller[
```python
from sklearn.model_selection import cross_validate
res = cross_validate(KNeighborsClassifier(), X, y, return_train_score=True,
                     scoring=["accuracy", "roc_auc"])
res_df = pd.DataFrame(res)
```

```
fit_time	score_time	test_accuracy	test_roc_auc	train_accuracy	train_roc_auc
0.000839	0.010204    0.965217	    0.996609	    0.980176	    0.997654
0.000870	0.014424    0.956522	    0.983689	    0.975771	    0.998650
0.000603	0.009298    0.982301	    0.999329	    0.971491	    0.996977
0.000698	0.006670    0.955752	    0.984071	    0.978070	    0.997820
0.000611	0.006559    0.964602	    0.994634	    0.978070	    0.998026
```
]


FIXME alignment


## Questions ?

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import sklearn
sklearn.set_config(print_changed_only=True)

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

In [None]:
cross_val_score(KNeighborsClassifier(),
                X_train, y_train, cv=5)

In [None]:
from sklearn.model_selection import KFold, RepeatedStratifiedKFold

In [None]:
cross_val_score(KNeighborsClassifier(),
                X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))

In [None]:
cross_val_score(KNeighborsClassifier(),
                X_train, y_train,
                cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=42))

## Grid Searches


Grid-Search with build-in cross validation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

Define parameter grid:

In [None]:
import numpy as np

param_grid = {'C': 10. ** np.arange(-3, 3),
              'gamma' : 10. ** np.arange(-5, 0)}

np.set_printoptions(suppress=True)
print(param_grid)

In [None]:
grid_search = GridSearchCV(SVC(), param_grid, verbose=3)

A GridSearchCV object behaves just like a normal classifier.

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.predict(X_test)

In [None]:
grid_search.score(X_test, y_test)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_estimator_

In [None]:
# We extract just the scores

scores = grid_search.cv_results_['mean_test_score']
scores = np.array(scores).reshape(6, 5)

plt.matshow(scores)
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(5), param_grid['gamma'])
plt.yticks(np.arange(6), param_grid['C']);

## Exercises
Use GridSearchCV to adjust n_neighbors of KNeighborsClassifier.



## Model complexity

![:scale 75%](images/knn_model_complexity.png)


We can look at this in more detail by comparing training and test set
scores for the different numbers of neighbors.
Here, I did a random 75%/25% split again. This is a very noisy plot as
the dataset is very small and I only did a random split, but you can
see a trend here.
You can see that for a single neighbor, the training score is 1 so perfect
accuracy, but the test score is only 70%.  If we increase the number of
neighbors we consider, the training score goes down, but the test score
goes up, with an optimum at 19 and 21, but then both go down again.

This is a very typical behavior, that I sketched in a schematic for you.


here is a cartoon version of how this chart looks in general, though
it's horizontally flipped to the one with saw for knn.
This chart has accuracy on the y axis, and the abstract concept of model
complexity on the x axis.
If we make our machine learning models more complex, we will get better
training set accuracy, as the model will be able to capture more of the
variations in the data.


But if we look at the generalization performance, we get a different
story. If the model complexity is too low, the model will not be able
to capture the main trends, and a more complex model means better
generalization.
However, if we make the model too complex, generalization performance
drops again, because we basically learn to memorize the dataset.



## Overfitting and Underfitting

![:scale 80%](images/overfitting_underfitting_cartoon_full.png)


If we use too simple a model, this is often called underfitting, while
if we use to complex a model, this is called overfitting. And somewhere
in the middle is a sweet spot.
Most models have some way to tune model complexity, and we’ll see many
of them in the next couple of weeks.
So going back to nearest neighbors, what parameters correspond to high
model complexity and what to low model complexity? high n_neighbors =
low complexity!

In [None]:
# %load solutions/grid_search_k_neighbors.py

## Not using Pipelines vs feature selection

In [None]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
X_test = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))
y_test = rnd.normal(size=(100,))

In [None]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression,
                          percentile=5)
select.fit(X, y)
X_selected = select.transform(X)
print(X_selected.shape)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
np.mean(cross_val_score(Ridge(), X_selected, y))

In [None]:
ridge = Ridge().fit(X_selected, y)
X_test_selected = select.transform(X_test)
ridge.score(X_test_selected, y_test)

## Back to house price?

In [None]:
from sklearn.linear_model import Ridge
X, y = df, target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
ridge = Ridge().fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
ridge.score(X_test_scaled, y_test)

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

In [None]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
X_test = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))
y_test = rnd.normal(size=(100,))

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("select", select),
                 ("ridge", Ridge())])
np.mean(cross_val_score(pipe, X, y))

In [None]:
from sklearn.linear_model import Ridge
X, y = df, target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline((("scaler", StandardScaler()),
                 ("regressor", KNeighborsRegressor)))

In [None]:
from sklearn.model_selection import GridSearchCV

knn_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor())
param_grid = {'kneighborsregressor__n_neighbors': range(1, 10)}
grid = GridSearchCV(knn_pipe, param_grid, cv=10)
grid.fit(X_train, y_train)
print(grid.best_params_)
print(grid.score(X_test, y_test))

In [None]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, random_state=0)

from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    Ridge())

In [None]:
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
              'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid=param_grid,
                    n_jobs=-1, return_train_score=True)
grid.fit(X_train, y_train)

In [None]:
from sklearn.linear_model import Lasso

pipe = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])

param_grid = {'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough'],
              'regressor': [Ridge(), Lasso()],
              'regressor__alpha': np.logspace(-3, 3, 7)}


grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

In [35]:

from sklearn.tree import DecisionTreeRegressor
pipe = Pipeline([('scaler', StandardScaler()), ('regressor', Ridge())])

param_grid = [{'regressor': [DecisionTreeRegressor()],
               'regressor__max_depth': [2, 3, 4],
               'scaler': ['passthrough']},
              {'regressor': [Ridge()],
               'regressor__alpha': [0.1, 1],
               'scaler': [StandardScaler(), MinMaxScaler(), 'passthrough']}
             ]
grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)
grid.score(X_test, y_test)

0.36901969445308325