In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Carryover setup from last lecture
from sklearn.linear_model import LinearRegression
tips = sns.load_dataset('tips')

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

import warnings
warnings.simplefilter('ignore')

# Lecture 24 – Cross-Validation

## DSC 80, Spring 2022

### Announcements

- Lab 8 is due **today at 11:59PM**.
- Project 4 is due **Thursday, May 26th at 11:59PM**.
    - IMPORTANT: See [this post](https://campuswire.com/c/G325FA25B/feed/1511) for more public tests for Question 5.
- Remaining assignments: Lab 9 and Project 5.

### Agenda

- Train-test split.
- Hyperparameters.
- Cross-validation.
- Example: Decision trees 🌲.

## Train-test split

### Avoiding overfitting

- We won't know whether our model has **overfit** to our sample (training data) unless we get to see how well it performs on a new sample from the same DGP.

- **Idea 💡:** **Split** our sample into a **training set** and **test set**.

- Use **only** the training set to fit the model (i.e. find $w^*$).

- Use the test set to evaluate the model's error (RMSE, $R^2$).

- This is like "generating" new data from the same DGP!
    - _Similar_ to bootstrapping (but not quite the same, because there is no resampling involved).
    - **If our sample is not representative of the DGP, this method has limited effectiveness!**

<center><img src="imgs/train-test.png" width='50%'></center>

### Train-test split 🚆

`sklearn.model_selection.train_test_split` implements a train-test split for us! 🙏🏼 

If `X` is an array/DataFrame of features and `y` is an array/Series of responses,

```py
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```

randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Read the documentation!
train_test_split?

Let's perform a train/test split on our `tips` dataset.

In [None]:
X = tips.drop('tip', axis=1)
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Don't have to choose 0.25

Before proceeding, let's check the sizes of `X_train` and `X_test`.

In [None]:
print('Rows in X_train:', X_train.shape[0])
display(X_train.head())
print('Rows in X_test:', X_test.shape[0])
display(X_test.head())

In [None]:
X_train.shape[0] / tips.shape[0]

### Example prediction pipeline

Steps:
1. Fit a model on the training set.
2. Evaluate the model on the test set.

In [None]:
tips.head()

In [None]:
X = tips[['total_bill', 'size']] # For this example, we'll use just the already-quantitative columns in tips
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Here, we'll use a stand-alone `LinearRegression` model without a `Pipeline`, but this process would work the same if we were using a `Pipeline`.

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

Let's check our model's performance on the **training** set first.

In [None]:
from sklearn.metrics import mean_squared_error # built-in RMSE/MSE function

In [None]:
pred_train = lr.predict(X_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
rmse_train

And the **test** set:

In [None]:
pred_test = lr.predict(X_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
rmse_test

Since `rmse_train` and `rmse_test` are similar, it **doesn't seem like our model is overfitting** to the training data. If `rmse_test` was much larger than `rmse_train`, it would be evidence that our model is unable to **generalize well**.

## Hyperparameters

### Example: Polynomial regression

Recall, last class we looked at an example of **polynomial regression**.

<center><img src='imgs/poly-reg.png' width=85%></center>
    
When building these models:
- We **got to choose** the degree of the polynomials (i.e. we chose 1, 3, and 15).
- We didn't get to choose the exact formulas for the three polynomials – their formulas were **learned from data**.

### Parameters vs. hyperparameters

- A **parameter** defines the relationship between variables in a model. 
    - **We learn parameters from data**.
    - For instance, suppose we fit a degree 3 polynomial to data, and end up with
    
    $$H(x) = 1 - 2x + 13x^2 - 4x^3$$
    
    - 1, -2, 13, and -4 are parameters.

- A **hyperparameter** is a parameter that we get to choose **before our model is fit to the data**.
    - Think of hyperparameters as knobs 🎛 – **we get to pick and tune them!**
    - **Polynomial degree** was a hyperparameter in the previous example, and we tried three different values – 1, 3, and 15.

- **Question:** How do we choose the "right" hyperparameter(s)?

### Training error vs. test error

- We know that a model's performance on a **test set** is a good estimate of its ability to generalize to unseen data.
- We want to find the hyperparameter that leads to the best **test set performance**.

- Idea:
    1. Come up with a **list** of hyperparameters to try.
    2. For each hyperparameter, train the model on the training set and compute its performance on the test set.
    3. Pick the hyperparameter with the best performance on the test set.

### Training error vs. test error

- Let's try this strategy on the dataset ("Sample 1") from last class. 
- We'll try to fit a polynomial model on the dataset; we'll choose the polynomial's degree from the list [1, 2, ..., 15].

In [None]:
sample_1 = pd.read_csv('data/sample-1.csv')
sample_1.head()

In [None]:
plt.scatter(sample_1['x'], sample_1['y']);

First, we perform a train-test split.

In [None]:
X = sample_1[['x']]
y = sample_1['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100) # random_state is like np.random.seed

Now, we'll implement the logic from the previous slide.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

In [None]:
train_errs = []
test_errs = []

for d in range(1, 16):
    pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
    pl.fit(X_train, y_train)
    train_errs.append(mean_squared_error(y_train, pl.predict(X_train), squared=False))
    test_errs.append(mean_squared_error(y_test, pl.predict(X_test), squared=False))

Let's look at both the training RMSEs and test RMSEs we computed.

In [None]:
plt.plot(range(1, 16), train_errs, label='training RMSE')
plt.plot(range(1, 16), test_errs, label='test RMSE')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE')
plt.legend();

**Observations:**
- Training error appears to decrease as polynomial degree increases.
- Testing error appears to decrease until an "elbow", and then increases again.

Here, we'd choose a degree of 3, since that degree has the **lowest test error**.

### Training error vs. test error

The pattern we saw in the previous example is true more generally.

<center><img src='imgs/tt-errors.png' width=50%></center>

We pick the hyperparameter(s) at the "valley" of test error.

Note that training error **tends** to underestimate test error, but it doesn't have to – i.e., it is possible for test error to be lower than training error (say, if the test set is "easier" to predict than the training set).

### Conducting train-test splits

- Recall, <span style='color: blue'><b>training data</b></span> is used to fit our model, and <span style='color: orange'><b>test data</b></span> is used to evaluate our model.

<center><img src='imgs/train-test-first.png' width=40%></center>


- **Question:** _How_ should we split?
    - `sklearn`'s `train_test_split` splits **randomly**, which usually works well.
    - However, if there is some element of **time** in the training data (say, when predicting the future price of a stock), a better split is "past" and "future".

- **Question:** How _large_ should the split be, e.g. 90%-10% vs. 75%-25%?
    - There's a tradeoff – a larger training set should lead to a "better" model, while a larger test set should lead to a better estimate of our model's ability to generalize.
    - There's no "right" choice, but we usually choose between a split between the ranges above.

### But wait...

- With our current strategy, we are choosing the hyperparameter that creates the model that **performs best on the test set**.

- As such, we are **overfitting to the test set** – the best hyperparameter for the test set might not be the best hyperparameter for a totally unseen dataset!

- It seems like we need **another** split.

## Cross-validation

### A single validation set

<center><img src='imgs/train-test-val.png' width=40%></center>

1. Split the data into three sets: <span style='color: blue'><b>training</b></span>, <span style='color: green'><b>validation</b></span>, and <span style='color: orange'><b>test</b></span>.

2. For each hyperparameter choice, <span style='color: blue'><b>train</b></span> the model only on the <span style='color: blue'><b>training set</b></span>, and <span style='color: green'><b>evaluate</b></span> the model's performance on the <span style='color: green'><b>validation set</b></span>.

3. Find the hyperparameter with the best <span style='color: green'><b>validation</b></span> performance.

4. Retrain the final model on the <span style='color: blue'><b>training</b></span> and <span style='color: green'><b>validation</b></span> sets, and report its performance on the <span style='color: orange'><b>test set</b></span>.

**Issue:** This strategy is too dependent on the <span style='color: green'><b>validation</b></span> set, which may be small and/or not a representative sample of the data.

### $k$-fold cross-validation

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the following example).

<center><img src='imgs/k-fold.png' width=40%></center>

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

### $k$-fold cross-validation

First, **shuffle** the dataset randomly and **split** it into $k$ disjoint groups. Then:

- For each hyperparameter:
    - For each unique group:
        - Let the unique group be the "validation set".
        - Let all other groups be the "training set".
        - Train a model using the selected hyperparameter on the training set.
        - Evaluate the model on the validation set.
    - Compute the **average** validation score (e.g. RMSE) for the particular hyperparameter.
- Choose the hyperparameter with the best average validation score.

### Creating folds in `sklearn`

`sklearn` has a `KFold` class that splits data into training and validation folds.

In [None]:
from sklearn.model_selection import KFold

Let's use a simple dataset for illustration.

In [None]:
data = np.arange(10, 70, 10)
data

Let's instantiate a `KFold` object with $k=3$.

In [None]:
kfold = KFold(3, shuffle=True, random_state=1)
kfold

Finally, let's use `kfold` to `split` `data`:

In [None]:
for train, val in kfold.split(data):
    print(f'train: {data[train]}, validation: {data[val]}')

Note that each value in `data` is used for validation exactly once and for training exactly twice. Also note that because we set `shuffle=True` the groups are not simply `[10, 20]`, `[30, 40]`, and `[50, 60]`.

### "Manual" $k$-fold cross-validation in `sklearn`

- Let's use `KFold` to perform $k$-fold cross validation in order to help us pick a degree for polynomial regression from the list [1, 2, ..., 15]. 
- We'll use $k=5$ (a common choice, and the default in `sklearn`).
- For the sake of example, we'll suppose `sample_1` is our "training + validation data", i.e. that our test data is in some other dataset.
    - If this were not true, we'd first need to split `sample_1` into train and test.

In [None]:
plt.scatter(sample_1['x'], sample_1['y']);

In [None]:
kfold = KFold(5, shuffle=True, random_state=1)

In [None]:
errs_df = pd.DataFrame()

for d in range(1, 16):
    errs = []
    for train, val in kfold.split(sample_1):
        # Separate the data into a training set and validation set
        data_train, data_val = sample_1.iloc[train], sample_1.iloc[val]
        
        # Fit the model on the training set
        pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
        pl.fit(data_train[['x']], data_train['y'])
        
        # Compute the model's validation error
        val_err = mean_squared_error(data_val['y'], pl.predict(data_val[['x']]), squared=False)
        errs.append(val_err)

    errs_df[f'Deg {d}'] = errs
    
errs_df.index = [f'Fold {i}' for i in range(1, 6)]

In [None]:
errs_df

Note that for each choice of degree (our hyperparameter), we have **five** RMSEs, one for each "fold" of the data.

We should choose the degree with the lowest **average** validation RMSE.

In [None]:
errs_df.mean()

In [None]:
errs_df.mean().idxmin()

Note that if we only performed non-$k$-fold cross-validation, we might pick a different degree:

In [None]:
for fold in errs_df.index:
    print(errs_df.loc[fold].idxmin())

### "Semi-automatic" $k$-fold cross validation in `sklearn`

The `cross_val_score` function in `sklearn` implements a few of the previous steps in one. 

```py
cross_val_score(estimator, data, target, cv)
```

Specifically, it takes in:
- A `Pipeline` or estimator **that has not already been `fit`**
- Training data
- A value of $k$ (through the `cv` argument)
- (Optionally) A `scoring` metric

and performs $k$-fold cross-validation, returning the values of the scoring metric on each fold.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
errs_df_auto = pd.DataFrame()

for d in range(1, 16):
    pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
    
    # The `scoring` argument is used to specify that we want to compute the RMSE; the default is R^2
    # It is called "neg" RMSE because by default sklearn likes to "maximize" scores
    errs = cross_val_score(pl, sample_1[['x']], sample_1['y'], cv=5, scoring='neg_root_mean_squared_error')
    errs_df_auto[f'Deg {d}'] = -errs # Negate to turn positive (sklearn computed negative RMSE)
    
errs_df_auto.index = [f'Fold {i}' for i in range(1, 6)]

In [None]:
errs_df_auto

In [None]:
errs_df_auto.mean().idxmin()

That was considerably easier! Next class, we'll look at how to streamline this procedure even more (no loop necessary).

***Note:*** You may notice that the RMSEs in the above table, particularly in Folds 1 and 5, are much higher than they were in the manual method. Can you think of reasons why, and how we might fix this? (_Hint: Go back to the "manual" method and switch `shuffle` to `False`. What do you notice?_)

### Summary: Generalization

1. Split the data into two sets: <span style='color: blue'><b>training</b></span> and <span style='color: orange'><b>test</b></span>.

2. Use only the <span style='color: blue'><b>training</b></span> data when designing, training, and tuning the model.
    - Use <span style='color: green'><b>cross-validation</b></span> to choose hyperparameters and estimate the model's ability to generalize.
    - Do not ❌ look at the <span style='color: orange'><b>test</b></span> data in this step!
    
3. Commit to your final model and train it using the entire <span style='color: blue'><b>training</b></span> set.

4. Test the data using the <span style='color: orange'><b>test</b></span> data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

5. Finally, train on **all available data** and ship the model to production! 🛳

🚨 This is the process you should **always** use! 🚨 

### Discussion Question 🤔

- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- **How many times is the first row in the training dataset (`X.iloc[0]`) used for training a model?**

## Example: Decision trees 🌲

<center><img src='imgs/taxonomy.png' width=50%></center>

Decision trees can be used for both regression and classification. We will start by discussing their use in **classification**.

### Example: Predicting diabetes

In [None]:
diabetes = pd.read_csv('data/diabetes.csv')
diabetes.head()

In [None]:
diabetes['Outcome'].value_counts()

For illustration, we'll use `'Glucose'` and `'BMI'` to predict whether or not a patient has diabetes (the response variable is in the `'Outcome'` column).

### Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(diabetes[['Glucose', 'BMI']], 
                                                    diabetes['Outcome'],
                                                    random_state=1)

The relevant class is `DecisionTreeClassifier`, from `sklearn.tree`.

In [None]:
from sklearn.tree import DecisionTreeClassifier

Note that we `fit` it the same way we `fit` earlier estimators.

_You may wonder what `max_depth=2` does – more on this soon!_

In [None]:
dt = DecisionTreeClassifier(max_depth=2)

In [None]:
dt.fit(X_train, y_train)

### Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

<span style='color: orange'><b>Class 0 (orange) is "no diabetes"</b></span>; <span style='color: blue'><b>Class 1 (blue) is "diabetes"</b></span>.

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize=(10, 5))
plot_tree(dt, feature_names=X_train.columns, class_names=['no', 'yes'], 
               filled=True, rounded=True, fontsize=15, impurity=False);

- To **classify a new data point**, we start at the top and answer the first question (i.e. "Glucose <= 129.5").
- If the answer is "Yes", we move to the left branch, otherwise we move to the right branch.
- We repeat this process until we end up at a leaf node, at which point we predict the most common class in that node.
    - Note that each node has a `value` attribute, which describes the number of **training** individuals of each class that fell in that node.

In [None]:
# Note that the left node at depth 2 has a `value` of [304, 78]
y_train.loc[X_train[X_train['Glucose'] <= 129.5].index].value_counts()

### Evaluating classifiers

The most common evaluation metric in classification is **accuracy**:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

In [None]:
(dt.predict(X_train) == y_train).mean()

The `score` method of a classifier computes accuracy by default.

In [None]:
# Training accuracy – same number as above
dt.score(X_train, y_train)

In [None]:
# Testing accuracy
dt.score(X_test, y_test)

### Some questions...

- How did `sklearn` decide what questions to ask?

- Can we ask more questions (i.e. build a **deeper** tree)?

The answers will come next class!

## Summary, next time

### Summary

- A model's training error tends to decrease as model complexity increases, while its test error tends to decrease, before reaching a "sweet spot" and increasing again.
- A hyperparameter is a configuration that we choose before training a model; an important task in machine learning is selecting "good" hyperparameters.
- In order to quantify a model's ability to generalize to unseen data, use **$k$-fold cross-validation**.
    - In particular, $k$-fold CV is used to select hyperparameters.
- Decision trees can be used for classification and regression.