# Objective

To review the notion of overfitting and introduce validation as a mitigating technique


## Preliminaries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Overfitting

Overfitting might occur when we do not have enough data to match the complexity of the model. As the number of data points increases, this kind of overfitting will reduce. 

Even when the number of data points is fixed, overfitting could also occur because of high target complexity or high amount of noise.

Overfit scenarios can be summarized as: (Abu Mostafa et al. 2012)

$$E_{out} = E_{\text{training}} + \text{overfit penalty}$$

When we focus on reducing the value of $E_{\text{training}}$ through complicated model architectures, $E_{out}$ starts to deviate from $E_{\text{training}}$ in the regime of 'finite data'. This is bad because, ultimately, we care only about $E_{out}$.


# Validation sets

An important way to 'apply brakes' on complex models is to use a validation set to estimate $E_{out}$ and to use this as a validation or stopping point for excesses on model complexity.

Though similar in spirit to the test set, we have to create a distinct validation set since using the test set repeatedly will invalidate the Hoeffding bound that links $E_{out}$ and $E_{\text{test}}$.


The best use of validation sets is when we want to select the model we expect to have the lowest $E_{out}$ from a set of $M$ candidate models.

By choosing a validation set of size $K$ extracted at random from the training data, we can estimate the $E_{out}$ as:

$$E_{out} =  E_{val} + O\left(\sqrt{\dfrac{ln(M)}{K}}\right)$$ 

The trade-offs here are: 
- While larger validation sets help us estimate the $E_{out}$ better, there is that much lesser training data for us to build a good model
- As we increase the number of candidate models $M$ evaluated using the validation set, the estimate of $E_{out}$ starts to deviate further from $E_{val}$

Since we are sure that $E_{out}$ is minimized when $E_{val}$ is minimized, we can use $E_{val}$ for comparison between models, hyperparameters or any model related decisions. We accept that $E_{val}$ has a bias and work with it only to compare models. Once we have the best model, we merge the validation set back to the training data and train the final model.

# Example: **More Advertising**

## Model

Here we tune for the degree of polynomial features used.

In [2]:
advertising_df = pd.read_csv("https://raw.githubusercontent.com/nguyen-toan/ISLR/master/dataset/Advertising.csv",
                             index_col=0)

In [3]:
poly_degrees = [1, 2, 3, 4, 5]
training_errors, validation_errors = [], []

In [4]:
%%time
for poly_degree in poly_degrees:

  advertising_X, advertising_y = (advertising_df.drop('Sales', axis=1),
                                  advertising_df.Sales)
  
  advertising_Xtrain, advertising_Xtest, advertising_ytrain, advertising_ytest = train_test_split(advertising_X,
                                                                                                  advertising_y,
                                                                                                  test_size=0.2,
                                                                                                  random_state=20130810)
  
  advertising_Xtrain, advertising_Xvalid, advertising_ytrain, advertising_yvalid = train_test_split(advertising_Xtrain,
                                                                                                    advertising_ytrain,
                                                                                                    test_size=0.2,
                                                                                                    random_state=20130810)
  polynomial_features = PolynomialFeatures(degree=poly_degree)
  
  advertising_Xtrain_poly = polynomial_features.fit_transform(advertising_Xtrain)
  advertising_Xvalid_poly = polynomial_features.transform(advertising_Xvalid)
  
  learner_lm = LinearRegression(fit_intercept=False)

  learner_lm.fit(advertising_Xtrain_poly, advertising_ytrain)

  training_errors.append(mean_squared_error(advertising_ytrain, 
                                            learner_lm.predict(advertising_Xtrain_poly)))
  
  validation_errors.append(mean_squared_error(advertising_yvalid,
                                              learner_lm.predict(advertising_Xvalid_poly)))

CPU times: user 36.5 ms, sys: 5.84 ms, total: 42.3 ms
Wall time: 62.9 ms


In [5]:
pd.DataFrame({'poly_degree': poly_degrees,
              'training_errors': training_errors,
              'validation_errors': validation_errors})

Unnamed: 0,poly_degree,training_errors,validation_errors
0,1,2.299952,3.242601
1,2,0.173942,0.510628
2,3,0.099855,0.461736
3,4,0.066527,1.026251
4,5,0.049705,2.855421


Overfitting begins beyond polynomial degree 3

Re-train with the best model

In [6]:
advertising_X, advertising_y = (advertising_df.drop('Sales', axis=1),
                                advertising_df.Sales)

advertising_Xtrain, advertising_Xtest, advertising_ytrain, advertising_ytest = train_test_split(advertising_X,
                                                                                                advertising_y,
                                                                                                test_size=0.2,
                                                                                                random_state=20130810)
polynomial_features = PolynomialFeatures(degree=3)

advertising_Xtrain_poly = polynomial_features.fit_transform(advertising_Xtrain)
advertising_Xtest_poly = polynomial_features.transform(advertising_Xtest)

learner_lm = LinearRegression(fit_intercept=False)

learner_lm.fit(advertising_Xtrain_poly, advertising_ytrain)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=False)

$E_{out}$

In [7]:
mean_squared_error(advertising_ytest,
                   learner_lm.predict(advertising_Xtest_poly))

0.7582771808490137

# Summary

Validation is a powerful technique that allows us to avoid overfitting complicated model architectures. As long as the model exploration is a reasonable number, it is a very good estimate of out-of-sample errors and hence can be a power model selection tool.