Skip to content

Commit

Permalink
notebook on nested cv
Browse files Browse the repository at this point in the history
  • Loading branch information
claesenm committed Jul 14, 2015
1 parent e982656 commit c0364b3
Show file tree
Hide file tree
Showing 2 changed files with 442 additions and 0 deletions.
173 changes: 173 additions & 0 deletions docs/notebooks/notebooks/basic-nested-cv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@

Basic: nested cross-validation
==============================

In this notebook we will briefly illustrate how to use Optunity for
nested cross-validation.

Nested cross-validation is used to reliably estimate generalization
performance of a learning pipeline (which may involve preprocessing,
tuning, model selection, ...). Before starting this tutorial, we
recommend making sure you are reliable with basic cross-validation in
Optunity.

We will use a scikit-learn SVM to illustrate the key concepts on the
MNIST data set.

.. code:: python
import optunity
import optunity.cross_validation
import optunity.metrics
import numpy as np
import sklearn.svm
We load the digits data set and will construct models to distinguish
digits 6 from and 8.

.. code:: python
from sklearn.datasets import load_digits
digits = load_digits()
n = digits.data.shape[0]
positive_digit = 6
negative_digit = 8
positive_idx = [i for i in range(n) if digits.target[i] == positive_digit]
negative_idx = [i for i in range(n) if digits.target[i] == negative_digit]
# add some noise to the data to make it a little challenging
original_data = digits.data[positive_idx + negative_idx, ...]
data = original_data + 5 * np.random.randn(original_data.shape[0], original_data.shape[1])
labels = [True] * len(positive_idx) + [False] * len(negative_idx)
The basic nested cross-validation scheme involves two cross-validation
routines:

- outer cross-validation: to estimate the generalization performance of
the learning pipeline. We will use 5folds.

- inner cross-validation: to use while optimizing hyperparameters. We
will use twice iterated 10-fold cross-validation.

Here, we have to take into account that we need to stratify the data
based on the label, to ensure we don't run into situations where only
one label is available in the train or testing splits. To do this, we
use the ``strata_by_labels`` utility function.

We will use an SVM with RBF kernel and optimize gamma on an exponential
grid :math:`10^-5 < \gamma < 10^1` and :math:`0< C < 10` on a linear
grid.

.. code:: python
# outer cross-validation to estimate performance of whole pipeline
@optunity.cross_validated(x=data, y=labels, num_folds=5,
strata=optunity.cross_validation.strata_by_labels(labels))
def nested_cv(x_train, y_train, x_test, y_test):
# inner cross-validation to estimate performance of a set of hyperparameters
@optunity.cross_validated(x=x_train, y=y_train, num_folds=10, num_iter=2,
strata=optunity.cross_validation.strata_by_labels(y_train))
def inner_cv(x_train, y_train, x_test, y_test, C, logGamma):
# note that the x_train, ... variables in this function are not the same
# as within nested_cv!
model = sklearn.svm.SVC(C=C, gamma=10 ** logGamma).fit(x_train, y_train)
predictions = model.decision_function(x_test)
return optunity.metrics.roc_auc(y_test, predictions)
hpars, info, _ = optunity.maximize(inner_cv, num_evals=100,
C=[0, 10], logGamma=[-5, 1])
print('')
print('Hyperparameters: ' + str(hpars))
print('Cross-validated AUROC after tuning: %1.3f' % info.optimum)
model = sklearn.svm.SVC(C=hpars['C'], gamma=10 ** hpars['logGamma']).fit(x_train, y_train)
predictions = model.decision_function(x_test)
return optunity.metrics.roc_auc(y_test, predictions)
auc = nested_cv()
print('')
print('Nested AUROC: %1.3f' % auc)
.. parsed-literal::
Hyperparameters: {'logGamma': -3.8679410473451057, 'C': 0.6162109374999996}
Cross-validated AUROC after tuning: 1.000
Hyperparameters: {'logGamma': -4.535231399331072, 'C': 0.4839113474508706}
Cross-validated AUROC after tuning: 0.999
Hyperparameters: {'logGamma': -4.0821875, 'C': 1.5395986549905802}
Cross-validated AUROC after tuning: 1.000
Hyperparameters: {'logGamma': -3.078125, 'C': 6.015625}
Cross-validated AUROC after tuning: 1.000
Hyperparameters: {'logGamma': -4.630859375, 'C': 3.173828125}
Cross-validated AUROC after tuning: 1.000
Nested AUROC: 0.999
If you want to explicitly retain statistics from the inner
cross-validation procedure, such as the ones we printed below, we can do
so by returning tuples in the outer cross-validation and using the
``identity`` aggregator.

.. code:: python
# outer cross-validation to estimate performance of whole pipeline
@optunity.cross_validated(x=data, y=labels, num_folds=5,
strata=optunity.cross_validation.strata_by_labels(labels),
aggregator=optunity.cross_validation.identity)
def nested_cv(x_train, y_train, x_test, y_test):
# inner cross-validation to estimate performance of a set of hyperparameters
@optunity.cross_validated(x=x_train, y=y_train, num_folds=10, num_iter=2,
strata=optunity.cross_validation.strata_by_labels(y_train))
def inner_cv(x_train, y_train, x_test, y_test, C, logGamma):
# note that the x_train, ... variables in this function are not the same
# as within nested_cv!
model = sklearn.svm.SVC(C=C, gamma=10 ** logGamma).fit(x_train, y_train)
predictions = model.decision_function(x_test)
return optunity.metrics.roc_auc(y_test, predictions)
hpars, info, _ = optunity.maximize(inner_cv, num_evals=100,
C=[0, 10], logGamma=[-5, 1])
model = sklearn.svm.SVC(C=hpars['C'], gamma=10 ** hpars['logGamma']).fit(x_train, y_train)
predictions = model.decision_function(x_test)
# return AUROC, optimized hyperparameters and AUROC during hyperparameter search
return optunity.metrics.roc_auc(y_test, predictions), hpars, info.optimum
nested_cv_result = nested_cv()
We can then process the results like this:

.. code:: python
aucs, hpars, optima = zip(*nested_cv_result)
print("AUCs: " + str(aucs))
print('')
print("hpars: " + "\n".join(map(str, hpars)))
print('')
print("optima: " + str(optima))
mean_auc = sum(aucs) / len(aucs)
print('')
print("Mean AUC %1.3f" % mean_auc)
.. parsed-literal::
AUCs: (0.9992063492063492, 1.0, 1.0, 0.9976190476190476, 0.9984126984126984)
hpars: {'logGamma': -3.5753515625, 'C': 3.9048828125000004}
{'logGamma': -2.6765234375, 'C': 6.9193359375000005}
{'logGamma': -3.0538671875, 'C': 2.2935546875}
{'logGamma': -3.593515625, 'C': 4.4136718749999995}
{'logGamma': -3.337747403818736, 'C': 4.367953383541078}
optima: (0.9995032051282051, 0.9985177917320774, 0.9994871794871795, 0.9995238095238095, 0.9995032051282051)
Mean AUC 0.999

0 comments on commit c0364b3

Please sign in to comment.