Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

## Regularization

"No Free Lunch" implies that ML algorithms will need to be "tweaked" in order to perform well on specific applications.  _Regularization_ methods are used to do this.

According to Goodfellow et al. (2016), "_Regularization is any modification we make to a learning algorithm that is intended to reduce it's generalization error but not it's training error._" (p. 117.)

May involve accepting additional bias in return to variance reduction.


## Regulation Methods

There are many.  Common examples:

* Increasing or decreasing the number or dimensionality of the data.
* Penalizing the _cost function_ minimized.
* Using "shrinkage estimators"
* Employing ensembles.
* Avoiding "over-training"
* Data augmentation.
* Early stopping.

## Identification

### Identification

In many statistical applications, parameters need to be adequately _identified_ parameters in order to get "good" (consistent) estimates of them.

An example that may be familiar is the contrast coding of categorical predictor variables used in regression models. You can have up to k-1 regression coefficient estimates for a k category discrete predictor variables.

When you use regression for ML predictive applications, your not primarily interested in the estimates of coefficients, but you _are_ interested in your algorithm learning patterns in a way that generalizes well to new data.

Try running the patient satisfaction k-fold CV regression examples using all three patient categories as 0/1 coded predictor variables to see what happens.

In ML circles, dummy coding may be referred to as _one hot coding_.   Using `scikit-learn`: [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)


### Data Reduction and Standardizing Transformations

The space of the data can sometimes be reduced so as to improve algorithm performance on
new data.  _Principal Components Analysis_ (PCA) is frequently used for this.  PCA is a SVD-based method that can be used to project data into smaller dimensional spaces, albeit with some loss of information.

In `scikit-learn`, PCA methods can be found at

[sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

A variety of transformation methods like rescaling to mean = 0, SD =1 can be used to improve ML algorithm performance.  See [Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)

## "Information From the Future!" Data Leakage

When creating validation data sets, and when using cross-validation, it's important to avoid using for training your algorithms any information that couldn't be known when applying it to newly available data.  "Test" data used for validation or cross validation proxy for data that "haven't yet been seen" by an algorithm.

Here are two simple examples in which data leakage would be an issue:

* Predicting whether hospital inpatients have had a myocardial infarction using whether they have been admitted to a cardiac unit;
* Predicting whether insured parties are late on making payments on their policy using the number of follow-up reminders that their payments are late.

One thing to note about both of these examples is that "time's arrow" is reversed.


[Data Leakage](https://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPaper_LeakingInDataMining.pdf)

##  A Common Cause of Leakage

Rescaling or otherwise transforming training and test data together in a way that uses estimates based on the combined data.

Avoid any transformations or rescaling of data that involve estimates of  _random variables_ that are computed using both training and test data.