<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Agenda" data-toc-modified-id="Agenda-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Agenda</a></span></li><li><span><a href="#Motivation" data-toc-modified-id="Motivation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Motivation</a></span></li><li><span><a href="#The-Bias-Variance-Tradeoff" data-toc-modified-id="The-Bias-Variance-Tradeoff-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The Bias-Variance Tradeoff</a></span><ul class="toc-item"><li><span><a href="#A-Model-Example" data-toc-modified-id="A-Model-Example-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>A Model Example</a></span><ul class="toc-item"><li><span><a href="#Model-A" data-toc-modified-id="Model-A-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Model A</a></span></li><li><span><a href="#Model-B" data-toc-modified-id="Model-B-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Model B</a></span></li></ul></li><li><span><a href="#High-Bias-vs-High-Variance" data-toc-modified-id="High-Bias-vs-High-Variance-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>High-Bias vs High-Variance</a></span><ul class="toc-item"><li><span><a href="#Bias" data-toc-modified-id="Bias-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Bias</a></span></li><li><span><a href="#Variance" data-toc-modified-id="Variance-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Variance</a></span></li></ul></li><li><span><a href="#Let's-take-a-look-at-our-familiar-King-County-housing-data." data-toc-modified-id="Let's-take-a-look-at-our-familiar-King-County-housing-data.-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Let's take a look at our familiar King County housing data.</a></span></li><li><span><a href="#🧠-Knowledge-Check" data-toc-modified-id="🧠-Knowledge-Check-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>🧠 Knowledge Check</a></span></li></ul></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Train-Test Split</a></span><ul class="toc-item"><li><span><a href="#How-do-we-know-if-our-model-is-overfitting-or-underfitting?" data-toc-modified-id="How-do-we-know-if-our-model-is-overfitting-or-underfitting?-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>How do we know if our model is overfitting or underfitting?</a></span></li><li><span><a href="#Examples" data-toc-modified-id="Examples-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Examples</a></span></li><li><span><a href="#Should-you-ever-fit-on-your-test-set?" data-toc-modified-id="Should-you-ever-fit-on-your-test-set?-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Should you ever fit on your test set?</a></span></li><li><span><a href="#Now-check-performance-on-test-data" data-toc-modified-id="Now-check-performance-on-test-data-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Now check performance on test data</a></span></li><li><span><a href="#Knowledge-check" data-toc-modified-id="Knowledge-check-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Knowledge check</a></span></li><li><span><a href="#Same-procedure-with-a-polynomial-model" data-toc-modified-id="Same-procedure-with-a-polynomial-model-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Same procedure with a polynomial model</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li><li><span><a href="#Kfolds:-Even-More-Rigorous-Validation" data-toc-modified-id="Kfolds:-Even-More-Rigorous-Validation-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Kfolds: Even More Rigorous Validation</a></span></li></ul></div>

# Model Validation: The Bias-Variance Tradeoff and the Train-Test Split

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

## Agenda

SWBAT:

- **Explain** the bias-variance tradeoff and the correlative notions of underfit and overfit models
- **Describe** a train-test split and **explain** its purpose in the context of predictive statistics / machine learning
- **Explain** the algorithm of cross-validation

## Motivation

At this point, we have seen different ways to create models from our data through different linear regression techniques. That's great but just like a student practicing for a big end-of-the-year test, we want to make sure our _models_ are ready to predict on data it hasn't seen yet. We want to know if the model we made is ready to make predictions for data in the "wild". 

Usually, when our model is ready to be used in the "real world" we refer to this as putting our model into **production** or **deploying** our model. The data it will use to make predictions will be data its never seen before. Similar to parents sending a once teenager into the world to make it on their own, we want to make sure our model is ready for the world of new data!

But you might be thinking, how do I make sure my model I've been cultivating with my own data is ready? This is where we ***model validation*** techniques to ensure our model can generalize to data it hasn't directly seen before. Going back to our analogy of a teenager ready to leave the home, we want our teenager (model) be informed so it's not naive but also flexible enough to adjust to new situations.

We'll go over how to ensure our model is ready, but first we have to discuss how our model can make errors in the context of the **bias-variance tradeoff**.

## The Bias-Variance Tradeoff

### A Model Example

Typically we'll talk about a model as how _complex_ it is in making predictions.

Let's take a look at this data with just one feature and a target:

<!--TODO: Replace with a dataset and code -->
![](https://camo.githubusercontent.com/36a1cb13983f39fc58ecdfffb415b2258e73ba1c/68747470733a2f2f6769746875622e636f6d2f6c6561726e2d636f2d73747564656e74732f6473632d322d32342d30372d626961732d76617269616e63652d74726164652d6f66662d6f6e6c696e652d64732d73702d3030302f7261772f6d61737465722f696e6465785f66696c65732f696e6465785f375f312e706e67)

We can probably picture how a good model will fit to this data. Let's look at a couple models and discuss how they're making mistakes.

#### Model A

<!--TODO: Replace with code to implement simple model -->

![](https://camo.githubusercontent.com/aca63456aa3ee6756c493cb988f748b4105d088f/68747470733a2f2f6769746875622e636f6d2f6c6561726e2d636f2d73747564656e74732f6473632d322d32342d30372d626961732d76617269616e63652d74726164652d6f66662d6f6e6c696e652d64732d73702d3030302f7261772f6d61737465722f696e6465785f66696c65732f696e6465785f31315f302e706e67)

What do we observe here? How would you describe where the model is failing?

#### Model B

<!--TODO: Replace with code to implement complex (overfitting) model -->

![](https://camo.githubusercontent.com/8941dcea3ace95d7353e7397579b8e7fb5b40b18/68747470733a2f2f6769746875622e636f6d2f6c6561726e2d636f2d73747564656e74732f6473632d322d32342d30372d626961732d76617269616e63652d74726164652d6f66662d6f6e6c696e652d64732d73702d3030302f7261772f6d61737465722f696e6465785f66696c65732f696e6465785f31345f302e706e67)

What do we observe here? How would you describe where the model is failing?

### High-Bias vs High-Variance

We can break up how the model makes mistakes (the error) by saying there are three parts:

- Error inherent of the data (noise): **irreducible error**
- Error from being not capturing patterns (too simple): **bias**
- Error from using patterns in the data but don't generalize well (too complex): **variance**

We can summarize this in an equation for the _mean squared error_ (MSE):

$MSE = Bias(\hat{y})^2 + Var(\hat{y}) + \sigma^2$

#### Bias

**High-bias** algorithms tend to be less complex, with simple or rigid underlying structure.

+ They train models that are consistent, but inaccurate on average.
+ These include linear or parametric algorithms such as regression and naive Bayes.
+ The following sorts of difficulties could lead to high bias:
      - We did not include the correct predictors;
      - We did not take interactions into account;
      - We missed a non-linear (polynomial) relationship. 
      
High-bias models are generally **underfit**: The models have not picked up enough of the signal in the data. And so even though they may be consistent, they don't perform particularly well on the initial data, and so they will be consistently inaccurate.

#### Variance

On the other hand, **high-variance** algorithms tend to be more complex, with flexible underlying structure.

+ They train models that are accurate on average, but inconsistent.
+ These include non-linear or non-parametric algorithms such as decision trees and nearest-neighbor models.
+ The following sorts of difficulties could lead to high variance:
      - We included an unreasonably large number of predictors;
      - We created new features by squaring and cubing each feature.

High variance models are **overfit**: The models have picked up on the noise as well as the signal in the data. And so even though they may perform well on the initial data, they will be inconsistently accurate on new data.

While we build our models, we have to keep this relationship in mind.  If we build complex models, we risk overfitting our models.  Their predictions will vary greatly when introduced to new data.  If our models are too simple, the predictions as a whole will be inaccurate.   

The goal is to build a model with enough complexity to be accurate, but not too much complexity to be erratic.

![optimal](img/optimal_bias_variance.png)
http://scott.fortmann-roe.com/docs/BiasVariance.html

### Let's take a look at our familiar King County housing data. 

In [None]:
np.random.seed(42)
df = pd.read_csv('data/king_county.csv', index_col='id')
df = df.iloc[:, :12]
df.head()

In [None]:
df.shape

In [None]:
np.random.seed(42)

# Let's generate random subsets of our data

# Date is not in the correct format so we are dropping it for now.

sample_point = df.drop('price', axis=1).sample(1)
point_preds = []

r_2 = []
simple_rmse = []

for i in range(100):
    
    df_sample = df.sample(5000, replace=True)
    y = df_sample.price
    X = df_sample.drop('price', axis=1)
    
    lr = LinearRegression()
    lr.fit(X, y)
    
    y_hat = lr.predict(X)
    simple_rmse.append(np.sqrt(mean_squared_error(y, y_hat)))
    r_2.append(lr.score(X, y))
    
    y_hat_point = lr.predict(sample_point)
    
    point_preds.append(y_hat_point)

In [None]:
print(f'simple mean: {np.mean(simple_rmse)}')
print(f'simple variance: {np.var(point_preds)}')

In [None]:
df = pd.read_csv('data/king_county.csv', index_col='id')

pf = PolynomialFeatures(2)

df_poly = pd.DataFrame(pf.fit_transform(df.drop('price', axis=1)))
df_poly.index = df.index
df_poly['price'] = df['price']

cols = list(df_poly)

# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('price')))

df_poly = df_poly.loc[:, cols]

df_poly.head(10)

In [None]:
np.random.seed(42)

sample_point = df_poly.drop('price', axis=1).sample(1)

r_2 = []
point_preds_comp = []
complex_rmse = []
for i in range(100):
    
    df_sample = df_poly.sample(1000, replace=True)
    y = df_sample.price
    X = df_sample.drop('price', axis=1)
    
    lr = LinearRegression()
    lr.fit(X, y)
    y_hat = lr.predict(X)
    complex_rmse.append(np.sqrt(mean_squared_error(y, y_hat)))
    r_2.append(lr.score(X, y))
    
    y_hat_point = lr.predict(sample_point)
    
    point_preds_comp.append(y_hat_point)

In [None]:
print(f'simple mean {np.mean(simple_rmse)}')
print(f'complex mean {np.mean(complex_rmse)}')

print(f'simple variance {np.var(point_preds)}')
print(f'complex variance {np.var(point_preds_comp)}')

### 🧠 Knowledge Check

![which_model](img/which_model_is_better_2.png)

## Train-Test Split

It is hard to know if your model is too simple or complex by just using it on training data.

We can hold out part of our training sample, and use it as a test sample and use it to monitor our prediction error.

This allows us to evaluate whether our model has the right balance of bias/variance. 

<img src='img/testtrainsplit.png' width =550 />

* **training set** —a subset to train a model.
* **test set**—a subset to test the trained model.

In [None]:
df = pd.read_csv('data/king_county.csv', index_col='id')

y = df.price
X = df[['bedrooms', 'sqft_living']]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=None,
                                                   random_state=42)

print(X_train.shape)
print(X_test.shape)

print(X_train.shape[0] == y_train.shape[0])
print(X_test.shape[0] == y_test.shape[0])

### How do we know if our model is overfitting or underfitting?

If our model is not performing well on the training  data, we are probably underfitting it.  

To know if our  model is overfitting the data, we need  to test our model on unseen data. 
We then measure our performance on the unseen data. 

If the model performs significantly worse on the  unseen data, it is probably  overfitting the data.

<img src='https://developers.google.com/machine-learning/crash-course/images/WorkflowWithTestSet.svg' width=500/>

### Examples

Consider the following scenarios and describe them according to bias and variance. There are four possibilities:

- a. The model has low bias and high variance.
- b. The model has high bias and low variance.
- c. The model has both low bias and low variance.
- d. The model has both high bias and high variance.

**Scenario 1**: The model has a low RMSE on training and a low RMSE on test.
<details>
    <summary> Anwer
    </summary>
    c. The model has both low bias and low variance.
    </details>

**Scenario 2**: The model has a high $R^2$ on the training set, but a low $R^2$ on the test.
<details>
    <summary> Anwer
    </summary>
    a. The model has low bias and high variance.
    </details>

**Scenario 3**: The model performs well on data it is fit on and well on data it has not seen.
<details>
    <summary> Anwer
    </summary>
    c. The model has both low bias and low variance.
    </details>
  

**Scenario 4**: The model has a low $R^2$ on training but high on the test set.
<details>
    <summary> Anwer
    </summary>
    d. The model has both high bias and high variance.
    </details>

**Scenario 5**: The model leaves out many of the meaningful predictors, but is consistent across samples.
<details>
    <summary> Anwer
    </summary>
    b. The model has high bias and low variance.
    </details>

**Scenario 6**: The model is highly sensitive to random noise in the training set.
<details>
    <summary> Anwer
    </summary>
    a. The model has low bias and high variance.
    </details>

### Should you ever fit on your test set?  

![no](https://media.giphy.com/media/d10dMmzqCYqQ0/giphy.gif)

**Never fit on test data.** If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set.

Let's go back to our KC housing data without the polynomial transformation.

In [None]:
df = pd.read_csv('data/king_county.csv', index_col='id')

# Date  is not in the correct format so we are dropping it for now.
df.head()

Now, we create a train-test split via the `sklearn.model_selection` package.

In [None]:
np.random.seed(42)

y = df.price
X = df[['bedrooms', 'sqft_living']]

# Here is the convention for a traditional train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [None]:
# Instanstiate your linear regression object
lr = LinearRegression()

In [None]:
# fit the model on the training set
lr.fit(X_train, y_train)

In [None]:
# Check the R^2 of the training data
lr.score(X_train, y_train)

In [None]:
lr.coef_

A .506 R-squared reflects a model that explains about half of the total variance in the data. 

### Now check performance on test data

Next, we test how well the model performs on the unseen test data. Remember, we do not fit the model again. The model has calculated the optimal parameters learning from the training set.  

In [None]:
lr.score(X_test, y_test)

### Knowledge check

How would you describe the bias of the model based on the above training $R^2$?

The difference between the train and test scores is low.

What does that indicate about variance?

### Same procedure with a polynomial model

In [None]:
df = pd.read_csv('data/king_county.csv', index_col='id')
df.head()

In [None]:
poly_2 = PolynomialFeatures(4)

X_poly = pd.DataFrame(
            poly_2.fit_transform(df.drop('price', axis=1))
                      )

y = df.price
X_poly.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y,
                                                    test_size=0.2,
                                                    random_state=42)
lr_poly = LinearRegression()

# Always fit on the training set
lr_poly.fit(X_train, y_train)

lr_poly.score(X_train, y_train)

In [None]:
lr_poly.score(X_test, y_test)

### Exercise

[This post about scaling and data leakage](https://datascience.stackexchange.com/questions/38395/standardscaler-before-and-after-splitting-data) explains that if you are going to scale your data, you should only train your scaler on the training data to prevent data leakage.  

Perform the same train-test split as shown above for the simple model, but now scale your data appropriately.  

The $R^2$ for both train and test should be the same as before.

In [None]:
np.random.seed(42)

y = df.price
X = df[['bedrooms', 'sqft_living']]

# Train test split with random_state=42 and test_size=0.2

# Scale appropriately

# fit and score the model 


## Kfolds: Even More Rigorous Validation  

For a more rigorous cross-validation, we turn to K-folds

![kfolds](img/k_folds.png)

[image via sklearn](https://scikit-learn.org/stable/modules/cross_validation.html)

In this process, we split the dataset into train and test as usual, then we perform a shuffling train-test split on the training set.  

KFolds holds out one fraction of the dataset, trains on the larger fraction, then calculates a test score on the held out set. It repeats this process until each group has served as the test set.

We tune our parameters on the training set using k-many folds, then validate on the test data. This allows us to build our model and check to see if it is overfit without touching the test data set. This protects our model from bias.

In [None]:
X = df.drop('price', axis=1)
y = df.price

In [None]:
kf = KFold(n_splits=5)

train_r2 = []
test_r2 = []
for train_ind, test_ind in kf.split(X, y):
    
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_test, y_test = X.iloc[test_ind], y.iloc[test_ind]
    
    lr.fit(X_train, y_train)
    train_r2.append(lr.score(X_train, y_train))
    test_r2.append(lr.score(X_test, y_test))

In [None]:
# Mean train r_2
np.mean(train_r2)

In [None]:
# Mean test r_2
np.mean(test_r2)

In [None]:
# Test out our polynomial model
poly_2 = PolynomialFeatures(2)

df_poly = pd.DataFrame(
            poly_2.fit_transform(df.drop('price', axis=1))
                      )

X = df_poly
y = df.price

In [None]:
kf = KFold(n_splits=5)

train_r2 = []
test_r2 = []
for train_ind, test_ind in kf.split(X, y):
    
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_test, y_test = X.iloc[test_ind], y.iloc[test_ind]
    
    lr.fit(X_train, y_train)
    train_r2.append(lr.score(X_train, y_train))
    test_r2.append(lr.score(X_test, y_test))

In [None]:
# Mean train r_2
np.mean(train_r2)

In [None]:
# Mean test r_2
np.mean(test_r2)

Once we have an acceptable model, we train our model on the entire training set, and score on the test to validate.

