In [None]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

# Carryover setup from last lecture
import seaborn as sns
tips = sns.load_dataset('tips')

from sklearn.linear_model import LinearRegression

import util

import warnings
warnings.simplefilter('ignore')

# Lecture 23 – Cross-Validation

## DSC 80, Spring 2023

### Agenda

- Generalization.
- Train-test split.
- Hyperparameters.
- Cross-validation.

## Generalization

Recall, last time, we drew two samples from the same data generating process, and fit polynomials of degree 1, 3, and 25 on each sample.

In [None]:
np.random.seed(23) # For reproducibility.

def sample_dgp(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return pd.DataFrame({'x': x, 'y': y})

sample_1 = sample_dgp()
sample_2 = sample_dgp()

When trained on sample 1, the degree 25 polynomial had the lowest RMSE on sample 1.

In [None]:
# Look at the definition of train_and_plot in util.py if you're curious as to how the plotting works.
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_1, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 1')

But, when trained on sample 1, the degree 3 polynomial had the lowest RMSE on sample 2.

In [None]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')

If we train polynomials of degree 1, 3, and 25 on each sample, we see that the degree 25 polynomials vary more than the degree 1 and 3 polynomials do.

In [None]:
util.plot_multiple_models(sample_1, sample_2, degs=[1, 3, 25])

### Bias and variance

The training data we have access to is a sample from the DGP. We are concerned with our model's ability to **generalize** and work well on **different datasets** drawn from the same DGP.

Suppose we **fit** a model $H$ (e.g. a degree 3 polynomial) on **several different datasets** from a DGP. There are three sources of error that arise:

* ⭐️ **Bias**: **The expected deviation between a predicted value and an actual value**.
    - In other words, **for a given $x_i$, how far is $H(x_i)$ from the true $y_i$, on average?**
    - Low bias is good! ✅
    - High bias is a sign of **underfitting**, i.e. that our model is too **basic** to capture the relationship between our features and response.

- ⭐️ **Model variance ("variance")**: **The variance of a model's predictions**.
    - In other words, **for a given $x_i$, what is the variance of $H(x_i)$ across all datasets**?
    - Low model variance is good! ✅
    - High model variance is a sign of **overfitting**, i.e. that our model is too **complicated** and is prone to fitting to the noise in our training data.

- **Observation variance**: The variance due to the random noise in the process we are trying to model (e.g. measurement error). _We can't control this, without collecting more data!_

Here, suppose:
- The <span style='color:#c6283f'><b>red bulls-eye</b></span> represents your **true weight and height** 🧍.
- The <span style='color:#080c6f'><b>dark blue darts</b></span> represent **predictions of your weight and height** using different models that were fit on the same DGP. 
<br>

<center><img src="imgs/image_5.png" width="40%"></center>

We'd like our models to be in the top left, but in practice that's hard to achieve!

## Train-test split

### Avoiding overfitting

- We won't know whether our model has **overfit** to our sample (training data) unless we get to see how well it performs on a new sample from the same DGP.

- 💡**Idea**: **Split** our sample into a **training set** and **test set**.

- Use **only** the training set to fit the model (i.e. find $w^*$).

- Use the test set to evaluate the model's error (RMSE, $R^2$).

- The test set is like a new sample of data from the same DGP as the training data!
    - _Similar_ to bootstrapping (but not quite the same, because there is no resampling involved).
    - **If our sample is not representative of the DGP, this method has limited effectiveness!**

<center><img src="imgs/train-test.png" width='50%'></center>

### Train-test split 🚆

`sklearn.model_selection.train_test_split` implements a train-test split for us! 🙏🏼 

If `X` is an array/DataFrame of features and `y` is an array/Series of responses,

```py
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```

randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Read the documentation!
train_test_split?

Let's perform a train/test split on our `tips` dataset.

In [None]:
X = tips.drop('tip', axis=1)
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # We don't have to choose 0.25.

Before proceeding, let's check the sizes of `X_train` and `X_test`.

In [None]:
print('Rows in X_train:', X_train.shape[0])
display(X_train.head())
print('Rows in X_test:', X_test.shape[0])
display(X_test.head())

In [None]:
X_train.shape[0] / tips.shape[0]

### Example train-test split

Steps:
1. Fit a model on the training set.
2. Evaluate the model on the test set.

In [None]:
tips.head()

In [None]:
X = tips[['total_bill', 'size']] # For this example, we'll use just the already-quantitative columns in tips.
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # random_state is like np.random.seed.

Here, we'll use a stand-alone `LinearRegression` model without a `Pipeline`, but this process would work the same if we were using a `Pipeline`.

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

Let's check our model's performance on the **training** set first.

In [None]:
from sklearn.metrics import mean_squared_error # Built-in RMSE/MSE function.

In [None]:
pred_train = lr.predict(X_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
rmse_train

And the **test** set:

In [None]:
pred_test = lr.predict(X_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
rmse_test

Since `rmse_train` and `rmse_test` are similar, it **doesn't seem like our model is overfitting** to the training data. If `rmse_test` was much larger than `rmse_train`, it would be evidence that our model is unable to **generalize well**.

## Hyperparameters

### Example: Polynomial regression

We recently looked at an example of **polynomial regression**.

In [None]:
fig = util.train_and_plot(train_sample=sample_1, test_sample=sample_2, degs=[1, 3, 25])
fig.update_layout(title='Trained on Sample 1, Performance on Sample 2')

When building these models:
- We **got to choose** the degree of the polynomials (i.e. we chose 1, 3, and 25).
- We didn't get to choose the exact formulas for the three polynomials – their formulas were **learned from data**.

### Parameters vs. hyperparameters

- A **parameter** defines the relationship between variables in a model. 
    - **We learn parameters from data**.
    - For instance, suppose we fit a degree 3 polynomial to data, and end up with
    
    $$H(x) = 1 - 2x + 13x^2 - 4x^3$$
    
    - 1, -2, 13, and -4 are parameters.

- A **hyperparameter** is a parameter that we get to choose **before our model is fit to the data**.
    - Think of hyperparameters as knobs 🎛 – **we get to pick and tune them!**
    - **Polynomial degree** was a hyperparameter in the previous example, and we tried three different values – 1, 3, and 25.

- **Question:** How do we choose the "right" hyperparameter(s)?

### Training error vs. test error

- We know that a model's performance on a **test set** is a good estimate of its ability to generalize to unseen data.

- We want to find the hyperparameter that leads to the best **test set performance**.

- Idea:
    1. Come up with a **list** of hyperparameters to try.
    2. For each hyperparameter, train the model on the training set and compute its performance on the test set.
    3. Pick the hyperparameter with the best performance on the test set.

### Training error vs. test error

- Let's try this strategy on sample 1 from our earlier example. 

- We'll try to fit a polynomial model on the dataset; we'll choose the polynomial's degree from the list [1, 2, ..., 25].

In [None]:
px.scatter(sample_1, x='x', y='y', title='Sample 1', template=TEMPLATE)

First, we perform a train-test split.

In [None]:
X = sample_1[['x']]
y = sample_1['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

Then, we'll implement the logic from the previous slide.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

In [None]:
train_errs = []
test_errs = []

for d in range(1, 26):
    pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
    pl.fit(X_train, y_train)
    train_errs.append(mean_squared_error(y_train, pl.predict(X_train), squared=False))
    test_errs.append(mean_squared_error(y_test, pl.predict(X_test), squared=False))

Let's look at the plots of training error vs. degree and test error vs. degree.

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Scatter(x=np.arange(1, 26), y=train_errs, name='Training Error')
)

fig.add_trace(
    go.Scatter(x=np.arange(1, 26), y=test_errs, name='Test Error', line={'color': 'orange'})
)

fig.update_layout(showlegend=True, xaxis_title='Degree', yaxis_title='RMSE')

- Training error appears to decrease as polynomial degree increases.

- Test error appears to decrease until a "valley", and then increases again.

- Here, we'd choose a degree of 3, since that degree has the **lowest test error**.

### Training error vs. test error

The pattern we saw in the previous example is true more generally.

<center><img src='imgs/tt-errors.png' width=50%></center>

We pick the hyperparameter(s) at the "valley" of test error.

Note that training error **tends** to underestimate test error, but it doesn't have to – i.e., it is possible for test error to be lower than training error (say, if the test set is "easier" to predict than the training set).

### Conducting train-test splits

- Recall, <span style='color: blue'><b>training data</b></span> is used to fit our model, and <span style='color: orange'><b>test data</b></span> is used to evaluate our model.

<center><img src='imgs/train-test-first.png' width=40%></center>


- **Question:** _How_ should we split?
    - `sklearn`'s `train_test_split` splits **randomly**, which usually works well.
    - However, if there is some element of **time** in the training data (say, when predicting the future price of a stock), a better split is "past" and "future".

- **Question:** How _large_ should the split be, e.g. 90%-10% vs. 75%-25%?
    - There's a tradeoff – a larger training set should lead to a "better" model, while a larger test set should lead to a better estimate of our model's ability to generalize.
    - There's no "right" choice, but we usually choose between a split between the ranges above.

### But wait...

- With our current strategy, we are choosing the hyperparameter that creates the model that **performs best on the test set**.

- As such, we are **overfitting to the test set** – the best hyperparameter for the test set might not be the best hyperparameter for a totally unseen dataset!

- It seems like we need **another** split.

## Cross-validation

### A single validation set

<center><img src='imgs/train-test-val.png' width=40%></center>

1. Split the data into three sets: <span style='color: blue'><b>training</b></span>, <span style='color: green'><b>validation</b></span>, and <span style='color: orange'><b>test</b></span>.

2. For each hyperparameter choice, <span style='color: blue'><b>train</b></span> the model only on the <span style='color: blue'><b>training set</b></span>, and <span style='color: green'><b>evaluate</b></span> the model's performance on the <span style='color: green'><b>validation set</b></span>.

3. Find the hyperparameter with the best <span style='color: green'><b>validation</b></span> performance.

4. Retrain the final model on the <span style='color: blue'><b>training</b></span> and <span style='color: green'><b>validation</b></span> sets, and report its performance on the <span style='color: orange'><b>test set</b></span>.

**Issue:** This strategy is too dependent on the <span style='color: green'><b>validation</b></span> set, which may be small and/or not a representative sample of the data.

### $k$-fold cross-validation

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

<center><img src='imgs/k-fold.png' width=40%></center>

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is **the** technique we will use for finding hyperparameters.

### Creating folds in `sklearn`

`sklearn` has a `KFold` class that splits data into training and validation folds.

In [None]:
from sklearn.model_selection import KFold

Let's use a simple dataset for illustration.

In [None]:
data = np.arange(10, 70, 10)
data

Let's instantiate a `KFold` object with $k=3$.

In [None]:
kfold = KFold(3, shuffle=True, random_state=1)
kfold

Finally, let's use `kfold` to `split` `data`:

In [None]:
for train, val in kfold.split(data):
    print(f'train: {data[train]}, validation: {data[val]}')

Note that each value in `data` is used for validation exactly once and for training exactly twice. Also note that because we set `shuffle=True` the groups are not simply `[10, 20]`, `[30, 40]`, and `[50, 60]`.

### $k$-fold cross-validation

First, **shuffle** the dataset randomly and **split** it into $k$ disjoint groups. Then:

- For each hyperparameter:
    - For each unique group:
        - Let the unique group be the "validation set".
        - Let all other groups be the "training set".
        - Train a model using the selected hyperparameter on the training set.
        - Evaluate the model on the validation set.
    - Compute the **average** validation score (e.g. RMSE) for the particular hyperparameter.
- Choose the hyperparameter with the best average validation score.

### $k$-fold cross-validation in `sklearn`

While you could manually use `KFold` to perform cross-validation, the `cross_val_score` function in `sklearn` implements $k$-fold cross-validation for us! 

```py
cross_val_score(estimator, X_train, y_train, cv)
```

Specifically, it takes in:
- A `Pipeline` or estimator **that has not already been `fit`**.
- Training data.
- A value of $k$ (through the `cv` argument).
- (Optionally) A `scoring` metric.

and performs $k$-fold cross-validation, returning the values of the scoring metric on each fold.

In [None]:
from sklearn.model_selection import cross_val_score

### $k$-fold cross-validation in `sklearn`

- Let's perform $k$-fold cross validation in order to help us pick a degree for polynomial regression from the list [1, 2, ..., 25].

- We'll use $k=5$ since it's a common choice (and the default in `sklearn`).


- For the sake of this example, we'll suppose `sample_1` is our "training + validation data", i.e. that our test data is in some other dataset.
    - If this were not true, we'd first need to split `sample_1` into separate training and test sets.

In [None]:
errs_df = pd.DataFrame()

for d in range(1, 26):
    pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
    
    # The `scoring` argument is used to specify that we want to compute the RMSE; 
    # the default is R^2. It's called "neg" RMSE because, 
    # by default, sklearn likes to "maximize" scores, and maximizing -RMSE is the same
    # as minimizing RMSE.
    errs = cross_val_score(pl, sample_1[['x']], sample_1['y'], 
                           cv=5, scoring='neg_root_mean_squared_error')
    errs_df[f'Deg {d}'] = -errs # Negate to turn positive (sklearn computed negative RMSE).
    
errs_df.index = [f'Fold {i}' for i in range(1, 6)]
errs_df.index.name = 'Validation Fold'

Next class, we'll look at how to implement this procedure without needing to `for`-loop over values of `d`.

### $k$-fold cross-validation in `sklearn`

Note that for each choice of degree (our hyperparameter), we have **five** RMSEs, one for each "fold" of the data. This means that in total, 125 models were trained/fit to data!

In [None]:
errs_df

We should choose the degree with the lowest **average** validation RMSE.

In [None]:
errs_df.mean().idxmin()

Note that if we didn't perform $k$-fold cross-validation, but instead just used a single validation set, we may have ended up with a different result:

In [None]:
errs_df.idxmin(axis=1)

***Note***: You may notice that the RMSEs in Folds 1 and 5 are significantly higher than in other folds. Can you think of reasons why, and how we might fix this?

In [None]:
px.scatter(sample_1, x='x', y='y', title='Sample 1', template=TEMPLATE)

### Another example: Tips

We can also use $k$-fold cross-validation to determine which subset of features to use in a linear model that predicts tips (though, as you'll see, the code is not pretty).

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

As we should always do, we'll perform a train-test split on `tips` and will only use the training data for cross-validation.

In [None]:
X = tips.drop('tip', axis=1)
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# A dictionary that maps names to Pipeline objects.
pipes = {
    'total_bill only': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size + OHE smoker': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size']),
             ('ohe', OneHotEncoder(), ['smoker'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size + OHE all': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size']),
             ('ohe', OneHotEncoder(), ['smoker', 'sex', 'time', 'day'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
}

In [None]:
pipe_df = pd.DataFrame()

for pipe in pipes:
    errs = cross_val_score(pipes[pipe], X_train, y_train,
                           cv=5, scoring='neg_root_mean_squared_error')
    pipe_df[pipe] = -errs
    
pipe_df.index = [f'Fold {i}' for i in range(1, 6)]
pipe_df.index.name = 'Validation Fold'

In [None]:
pipe_df

In [None]:
pipe_df.mean()

In [None]:
pipe_df.mean().idxmin()

Even though the third model has the lowest average validation RMSE, its average validation RMSE is very close to that of the other, simpler models, and as a result we'd likely use the simplest model in practice.

### Summary: Generalization

1. Split the data into two sets: <span style='color: blue'><b>training</b></span> and <span style='color: orange'><b>test</b></span>.

2. Use only the <span style='color: blue'><b>training</b></span> data when designing, training, and tuning the model.
    - Use <span style='color: green'><b>$k$-fold cross-validation</b></span> to choose hyperparameters and estimate the model's ability to generalize.
    - Do not ❌ look at the <span style='color: orange'><b>test</b></span> data in this step!
    
3. Commit to your final model and train it using the entire <span style='color: blue'><b>training</b></span> set.

4. Test the data using the <span style='color: orange'><b>test</b></span> data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

5. Finally, train on **all available data** and ship the model to production! 🛳

🚨 This is the process you should **always** use! 🚨 

### Discussion Question 🤔

- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- **How many times is the first row in the training dataset (`X.iloc[0]`) used for training a model?**

## Summary, next time

### Summary

- A model's training error tends to decrease as model complexity increases, while its test error tends to decrease, before reaching a "sweet spot" and increasing again.
- A hyperparameter is a configuration that we choose before training a model; an important task in machine learning is selecting "good" hyperparameters.
- In order to quantify a model's ability to generalize to unseen data, use **$k$-fold cross-validation**.
    - In particular, $k$-fold CV is used to select hyperparameters.

### Next time

- Example: Decision trees 🌲.
- Tuning multiple hyperparameters at once.
- Multicollinearity (time permitting).