## $\S$ 7.10.2. The Wrong and Right Way to Do Cross-Validation

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

1. Screen the predictors: find a subset of "good" predictors that show fairly strong (univariate) correlation with the class labels.
1. Using just this subset of predictors, build a multivariate classifier.
1. Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation?

Consider a scenario with $N=50$ samples in two equal-sized classes, and $p=5000$ quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%.

We carried out the above recipe,
1. choosing in step (1) the 100 predictors having highest correlation with the class labels, and then
1. using a 1NN classifier, based on just these 100 predictors, in step (2).

Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

### What happened?

The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of _all of the samples_. Leaving samples out _after_ the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors "have already seen" the left out samples.

FIGURE 7.10 (top panel) illustrates the problem. We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect.

Here is the correct way to carry out cross-validation in this example:
1. Divide the samples into $K$ cross-validation folds (groups) at random.
1. For each fold $k=1,2,\cdots,K$
  1. Find a subset of "good" predictors that show fairly strong (univariate) correlation with the class labels, using all of the samples except those in fold $k$.
  1. Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold $k$.
  1. Use the classifier to predict the class labels for the samples in fold $k$.

The error estimates from step 2(c) are then accumulated over all $K$ folds, to produce the cross-validation estimate of prediction error. The lower panel of FIGURE 7.10 shows the correlations of class labels with the 100 predictors chosen in step 2(a) of the correct procedure, over the samples in a typical fold $k$. We see that they average about zero, as they should.

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be "left out" before any selection or filtering steps are applied.

> There is one qualification: Initial _unsupervised_ screening steps can be done before samples are left out.

For example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross-validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantage.

### Even in published papers!

While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals. With the large numbers of predictors that are so common in genomic and other areas, the potential consequences of this error have also increased dramatically; see Ambroise and McLachlan (2002) for a detailed discussion of this issue.