Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

<center><h1>Cross-Validation</h1></center>

## Cross-Validation: What is It?

A procedure consisting of randomly splitting data into parts, using some parts to _train_ an ML algorithm, and the other parts to _test_ how well the algorithm performs on new data, data it "hasn't seen" while learning.

## Cross-Validation:  Why Do it?

An important ML goal is to train models that _generalize_ well, that perform well on _future data_.   Most ML methods are prone to _over-fitting_ the data used to train them.   

This over-fitting has been referred to by some as "learning the data," rather than learning _patterns_ in the data. Models that are over-fit will perform poorly on new data.

## ML Validation Flavors

(See [James et al. 2017, Chapter 5](http://www-bcf.usc.edu/~gareth/ISL/))

* __Validation Set Method__: randomly divide available data in to a _training set_ and a _test set_. But,
    * test data performance measures may be highly variable due to small N
    * may result in overestimation of test error rate, again, due to small N
    * Not really a _cross_ validation technique
* __"Leave One Out"__ cross-validation: Each observation is used as validation data with training on all the other observations.
    * test performance measures tend to be highly variable
* __K-fold__ cross-validation:  data is randomly split into "folds" of similar size. Each fold is used as test data after training on the observations in in the fold. 
    * Perhaps more biased test performance measures than Leave One Out, but possibly less variable ones.
    * A very frequently used "compromise."
    

<h2>Bias-Variance Trade-off</h2>

(Based on [James et al. 2017, Chapter 2](http://www-bcf.usc.edu/~gareth/ISL/))

For a regression-type ML "learner," the expected MSE for some new data point x<sub>0</sub> is  

\begin{align*}
\large
E{(y_0 - \hat{g}(x_0))}^2 = Var(\hat{g}(x_0))+[Bias(\hat{g}(x_0))]^2+Var(\epsilon)
\end{align*}

Where:
 
Var($\hat{g}(x_0)$) refers to the amount by which $\hat{g}(x_0)$ would change when training is done using different data sets.  Different training datasets will result in different $\hat{g}(x_0)$.  That is, the model will vary with different datasets.

_Bias_ refers to error in approximating the "true" g(x<sub>i</sub>), e.g. the error in approximating a nonlinear relationship with a linear one.

Var($\epsilon$) is "noise" that's independent of the model and the data. (From the Big Bang?) It's the lower limit on the MSE.

## Best Possible Fit in Test Data

To minimize _expected_ error when predicting new (or test) data, a ML algorithm needs to minimize _both_ bias and variance.   

Heuristics:

* Using more "flexible" methods (i.e.., more complicated models) will tend to result in _increased_ variance, and _decreased_ bias.
* At some point, increases in flexibility will have little additional effect on bias, but variance will _continue_ to _increase_.

## Regression Models for Three Different Datasets

Examples of MSE, bias, and variance for regression models trained on three data sets from [James et at. (2017, Chapter 5)](http://www-bcf.usc.edu/~gareth/ISL/):

<img src="./images/bias-variance-james-2017.png" alt="bias-variance trade-off examples" width="1500"/>


<h1>There's No Free Lunch!</h1> 

* David Wolpert(1996): "...averaged over all possible data-generating distributions, every classification algorithm same error rate when classifying previously unobserved points."  (Goodfellow et al., 2016, p. 113.)
    * The bottom line: No _machine learning algorithm is universally better than any other one._
    * Wolpert's [NASA presentation](http://no-free-lunch.org/coev.pdf)
* _Two_ [Free Lunch Theorems](http://no-free-lunch.org/)
    * w.r.t. Wolpert (1996) "noise-free scenario where the loss function is the misclassification rate, if one is interested in off-training-set error, then there are no a priori distinctions between learning algorithms."
    
* Bottom Line:  There is no _best_ algorithm.  (Also, no _best regularization_.)

<h2>Bias,Variance, and Model Fit Bullseyes!</h2>

[Bias, variance, algorithm overfitting, and algorithm underfitting](https://medium.com/@martinezbielosdaniel/bias-variance-tradeoff-overfitting-and-underfitting-c63799cb4851) are in general related.  Here's a rather famous "Bullseye" respresentation:


<img src="./images/bias-variance-bullseyes.jpeg" alt="bias-variance bullseyes" width="400"/>