In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

plt.style.use('seaborn-white')
plt.rc('figure', dpi=100, figsize=(7, 5))
plt.rc('font', size=12)

import warnings
warnings.simplefilter('ignore')

# Lecture 23 – Pipelines and Evaluation

## DSC 80, Spring 2022

For fun: read [this article from CBS 8 San Diego](https://www.cbs8.com/article/news/investigations/san-diego-pays-113-million-in-overtime/509-cccb0373-b602-448d-90d7-a40b0f30a2a6) 💰🚒🚔.

> Last year, the city paid \\$52.93 million in overtime to the fire department. San Diego Police Department received the second highest out of any other department with \\$36.6 million.

### Announcements

- Discussion 7 is due for extra credit **tomorrow at 11:59PM**.
    - See [this post](https://campuswire.com/c/G325FA25B/feed/1473) for a clarification.
- Lab 8 is due on **Monday, May 23rd at 11:59PM**.
- Project 4 is due **Thursday, May 26th at 11:59PM**.
    - See [this post](https://campuswire.com/c/G325FA25B/feed/1511) for more public tests for Question 5.
- Lab 7 grades are released.
    - See [this post](https://campuswire.com/c/G325FA25B/feed/1515) for more details.

### Agenda

- Models in `sklearn`.
- Pipelines.
- Model evaluation 🧪.

## Models in `sklearn`

### Example: Predicting `'tip'` from `'total_bill'` and `'size'`

In [None]:
tips = sns.load_dataset('tips')
tips.head()

First, we instantiate and fit. By calling `fit`, we are saying "minimize mean squared error and find $w^*$".

In [None]:
lr = LinearRegression()

# Note that there are two arguments to fit – X and y!
# (It is not necessary to write X= and y=)
lr.fit(X=tips[['total_bill', 'size']], y=tips['tip'])

After fitting, the `predict` method is available. Note that the argument to `predict` can be any 2D array with two columns.

In [None]:
# Predicted tip from a table of 3 that spends $25 
lr.predict([[25, 3]])

In [None]:
# Predicted tip from a table of 14 that spends $1000 – probably not accurate!
lr.predict([[1000, 14]])

We can access the intercepts and slopes individually. This model is of the form

$$\text{predicted tip} = w_0^* + w_1^* \cdot \text{total bill} + w_2^* \cdot \text{table size}$$

so we should expect three parameters total.

In [None]:
lr.intercept_

In [None]:
lr.coef_

If we want to compute the RMSE of our model, we need to find its predictions on every row in the training data (`tips`).

In [None]:
all_preds = lr.predict(tips[['total_bill', 'size']])

In [None]:
np.sqrt(np.mean((all_preds - tips['tip']) ** 2))

It turns out that fit `LinearRegression` objects also have a `score` method:

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

That doesn't look like the RMSE... what is it? 🤔

### Aside: $R^2$

- $R^2$, or the **coefficient of determination**, is a measure of the **quality of a linear fit**.
- There are a few equivalent ways of computing it, assuming your model **is linear and has an intercept term**:

$$R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$$

$$R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$$

- In the simple linear regression case, it is the square of the correlation coefficient, $r$.
- **Key idea:** $R^2$ ranges from 0 to 1. **The closer it is to 1, the better the linear fit is.**
    - $R^2$ has no units of measurement, unlike RMSE.
- Interpretation: $R^2$ is the **proportion of variance in $y$ that the linear model explains**.

### Calculating $R^2$

Recall, `all_preds` contains the predicted `'tip'` for every data point in `tips`.

In [None]:
tips.head()

In [None]:
all_preds[:5]

**Method 1: $R^2 = \frac{\text{var}(\text{predicted $y$ values})}{\text{var}(\text{actual $y$ values})}$**


In [None]:
np.var(all_preds) / np.var(tips['tip'])

**Method 2:** $R^2 = \left[ \text{correlation}(\text{predicted $y$ values}, \text{actual $y$ values}) \right]^2$

Note: By correlation here, we are referring to $r$.

In [None]:
(np.corrcoef(all_preds, tips['tip'])) ** 2

**Method 3:** `lr.score`

In [None]:
lr.score(tips[['total_bill', 'size']], tips['tip'])

All three methods provide the same result!

### `LinearRegression` summary

|Property|Example|Description|
|---|---|---|
|Initialize model parameters| `lr = LinearRegression()` | Create (empty) linear regression model|
|Fit the model to the data | `lr.fit(data, responses)` | Determines regression coefficients|
|Use model for prediction |`lr.predict(newdata)`| Use regression line make predictions|
|Evaluate the model| `lr.score(data, responses)` | Calculate the $R^2$ of the LR model|
|Access model attributes| `lr.coef_` | Access the regression coefficients|

***Note:*** Once `fit`, estimators like `LinearRegression` are just transformers (`predict` <-> `transform`).

## Pipelines

<center><img src="imgs/image_0.png" width="50%"></center>

<br>

So far, we've used transformers for feature engineering and models for prediction. We can combine these steps into a single `Pipeline`.

### `Pipeline`s in `sklearn`

- A `Pipeline` object is instantiated using a **list** containing transformer(s) and a model (estimator).
```py
pl = Pipeline([feat_trans1, feat_trans2, ..., mdl])
```
- Once a `Pipeline` is instantiated, you can fit **all** steps (transformers and model) using `fit`.
```py
pl.fit(data, responses)
```
- To make predictions using **raw (untransformed) data**, use `pl.predict`.

### Creating a `Pipeline`

- To instantiate a `Pipeline`, we must provide a list with zero or more transformers followed by a single model.
    - All "steps" must have `fit` methods, and all but the last must have `transform` methods.
- The list we provide `Pipeline` with must be a list of **tuples**, where
    - The first element is a "name" (that we choose) for the step.
    - The second element is a transformer or estimator instance.

Let's build a `Pipeline` that:
- One-hot-encodes the categorical features in `tips`.
- Fits a regression model on the one-hot-encoded data.

In [None]:
tips_cat = tips[['sex', 'smoker', 'day', 'time']]
tips_cat.head()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [None]:
pl = Pipeline([
    ('one-hot', OneHotEncoder()),
    ('lin-reg', LinearRegression())
])

Now that `pl` is instantiated, we `fit` it the same way we would fit the individual steps.

In [None]:
pl.fit(tips_cat, tips['tip'])

Now, to make predictions using **raw data**, all we need to do is use `pl.predict`:

In [None]:
pl.predict([['Male', 'Yes', 'Sat', 'Lunch']])

In [None]:
pl.predict(tips_cat.iloc[:5])

`pl` performs **both** feature transformation and prediction with just a single call to `predict`!

We can access individual "steps" of a `Pipeline` through the `named_steps` attribute:

In [None]:
pl.named_steps

In [None]:
pl.named_steps['one-hot'].transform(tips_cat).toarray()

In [None]:
pl.named_steps['lin-reg'].coef_

### More sophisticated `Pipeline`s

- In the previous example, we one-hot-encoded every input column. **What if we want to perform different transformations on different columns?**
- **Solution:** Use a `ColumnTransformer`.
    - Instantiate a `ColumnTransformer` using a list of tuples, where:
        - The first element is a "name" we choose for the transformer.
        - The second element is a transformer instance (e.g. `OneHotEncoder()`).
        - The third element is a **list of relevant column names**.
    - `ColumnTransformer` is extremely useful, but it was only added to `sklearn` in 2018!

<center><img src='imgs/image_3.png' width=50%></center>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

Let's perform different transformations on the quantitative and categorical features of `tips` (so, we will not transform `'tip'`).

In [None]:
tips_features = tips.drop('tip', axis=1)
tips_features.head()

- To the **quantitative features (`'total_bill'` and `'size'`)**, we will apply the `StandardScaler` transformer.
- To the **categorical features**, we will apply the `OneHotEncoder` transformer.

In [None]:
preproc = ColumnTransformer(
    transformers = [
        ('quant', StandardScaler(), ['total_bill', 'size']),
        ('cat', OneHotEncoder(), ['sex', 'smoker', 'day', 'time'])
    ]
)

Now, let's create a `Pipeline` using `preproc` as a transformer, and `fit` it:

In [None]:
pl = Pipeline([
    ('preprocessor', preproc), 
    ('lin-reg', LinearRegression())
])

In [None]:
pl.fit(tips_features, tips['tip'])

Prediction is as easy as calling `predict`:

In [None]:
tips_features.head()

In [None]:
pl.predict(tips_features.head())

`pl` also has a `score` method, the same way a fit `LinearRegression` instance does:

In [None]:
pl.score(tips_features, tips['tip'])

Recall, we can access the individual "steps" in `pl` using the `named_steps` attribute:

In [None]:
pl.named_steps['preprocessor'].transform(tips_features)

**Note:** `ColumnTransformer` has a `remainder` argument that you can use to specify what to do with columns that aren't being transfromed (`'drop'` or `'passthrough'`).

## Model evaluation 🧪

### Motivation

- You and Billy are studying for an upcoming exam. You both decide to test your understanding by taking a **practice exam**.
    - Your logic: If you do well on the practice exam, you should do well on the real exam.

- You each take the practice exam once and look at the solutions afterwards.

- **Your strategy:** Memorize the answers to all practice exam questions, e.g. "Question 1: A; Question 2: C; Question 3: A."

- **Billy's strategy:** Learn high-level concepts from the solutions, e.g. "data are NMAR if the likelihood of missingness depends on the missing values themselves."

- Who will do better on the **practice exam**? Who will probably do better on the **real exam**? 🧐

### Evaluating the quality of a model

- So far, we've computed the RMSE (and $R^2$) of our fit regression models on the **data that we used to fit them**, i.e. the **training data**.
- We've said that Model A is **better** than Model B if Model A's RMSE is **lower** than Model B's RMSE.
    - Remember, our **training data** is a sample from the data generating process.
    - Just because a model fits the training data well doesn't mean it will **generalize** and work well on **similar, unseen samples**!

### Example: Overfitting and underfitting

Let's collect two samples $\{(x_i, y_i)\}$ from the same **data generating process**.

In [None]:
np.random.seed(23) # For reproducibility

def sample_dgp(n=100):
    x = np.linspace(-2, 3, n)
    y = x ** 3 + (np.random.normal(0, 3, size=n))
    return x.reshape(-1, 1), y

x1, y1 = sample_dgp()
x2, y2 = sample_dgp()

For now, let's just look at Sample 1. The relationship between $x$ and $y$ is roughly **cubic**; that is, $y \approx x^3$ (remember, in reality, you won't get to see the DGP).

In [None]:
plt.scatter(x1, y1);

Let's fit three **polynomial** models on Sample 1:
- Degree 1.
- Degree 3.
- Degree 15.

The `PolynomialFeatures` transformer will be helpful here.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
d2 = PolynomialFeatures(3)
d2.fit_transform([[1], [2], [3], [4], [5]]) # fit_transform fits and transforms the same input

In [None]:
degs = [1, 3, 15]

def fit_polys(x, y):
    # Create all three Pipelines
    pls = [
        Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
        for d in degs
    ]

    # Fit all three Pipelines
    [pl.fit(x, y) for pl in pls];

    # Make all three sets of predictions
    preds = [pl.predict(x) for pl in pls]
    return preds
    
preds_1 = fit_polys(x1, y1)

Below, we look at our three models' predictions on Sample 1 (which they were trained on).

In [None]:
plt.subplots(1, 3, figsize=(15, 5), dpi=100)
plt.suptitle('Performance on Sample 1\n')

for i in range(1, 4):
    plt.subplot(1, 3, i)
    plt.scatter(x1, y1, label='actual')
    plt.plot(x1, preds_1[i-1], color='orange', label='predictions', linewidth=5)
    rmse_d = np.sqrt(np.mean((preds_1[i-1] - y1) ** 2))
    plt.title(f'Degree {degs[i-1]}, RMSE: {np.round(rmse_d, 2)}')
plt.legend(loc=(1.04, 0))
plt.show()

The degree 15 polynomial has the lowest RMSE on Sample 1.

How do things look in Sample 2?

In [None]:
plt.subplots(1, 3, figsize=(15, 5), dpi=100)
plt.suptitle('Performance on Sample 2\n')

for i in range(1, 4):
    plt.subplot(1, 3, i)
    plt.scatter(x2, y2, label='actual')
    plt.plot(x1, preds_1[i-1], color='orange', label='predictions', linewidth=5)
    rmse_d = np.sqrt(np.mean((preds_1[i-1] - y2) ** 2))
    plt.title(f'Degree {degs[i-1]}, RMSE: {np.round(rmse_d, 2)}')
plt.legend(loc=(1.04, 0))
plt.show()

- The degree 3 polynomial has the lowest RMSE on Sample 2. 
- Note that **we didn't get to see Sample 2 when fitting our models**! 
- As such, it seems that the degree 3 polynomial **generalizes better** to unseen data than the degree 15 polynomial does.

What if we fit a degree 1, degree 3, and degree 15 polynomial **on Sample 2** as well?

In [None]:
preds_2 = fit_polys(x2, y2)

plt.subplots(1, 3, figsize=(15, 5), dpi=100)
plt.suptitle('Models Fit on Samples 1 and 2\n')

for i in range(1, 4):
    plt.subplot(1, 3, i)
#     plt.scatter(x2, y2, label='actual')
    plt.plot(x1, preds_1[i-1], color='orange', label='Fit on Sample 1', linewidth=5)
    plt.plot(x2, preds_2[i-1], color='purple', label='Fit on Sample 2', linewidth=5)
    plt.title(f'Degree {degs[i-1]}')
plt.legend(loc=(1.04, 0))
plt.show()

**Key idea:** The degree 15 polynomial seems to **vary more** than the degree 3 and 1 polynomials do.

### Bias and variance

The training data we have access to is a sample from the DGP. We are concerned with our model's performance **across different datasets** from the same DGP.

Suppose we **fit** a model $H$ (e.g. a degree 3 polynomial) on **several different datasets** from a DGP. There are three sources of error that arise:

* ⭐️ **Bias**: **The expected deviation between a predicted value and an actual value**.
    - In other words, **for a given $x_i$, how far is $H(x_i)$ from the true $y_i$, on average?**
    - Low bias is good! ✅
    - High bias is a sign of **underfitting**, i.e. that our model is too **basic** to capture the relationship between our features and response.

- ⭐️ **Model variance ("variance")**: **The variance of a model's predictions**.
    - In other words, **for a given $x_i$, what is the variance of $H(x_i)$ across all datasets**?
    - Low model variance is good! ✅
    - High model variance is a sign of **overfitting**, i.e. that our model is too **complicated** and is prone to fitting to the noise in our training data.

- **Observation variance**: The variance due to the random noise in the process we are trying to model (e.g. measurement error). _We can't control this, without collecting more data!_

Here, suppose 
- the <span style='color:red'>red bulls-eye</span> represents your **true weight and height** 🧍, and 
- the <span style='color:blue'>blue darts</span> represent **predictions of your weight and height** using different models that were fit on the same DGP. 
<br>

<center><img src="imgs/image_5.png" width="40%"></center>

### Avoiding overfitting

- We won't know whether our model has **overfit** to our sample (training data) unless we get to see how well it performs on a new sample from the same DGP.

- **Idea 💡:** **Split** our sample into a **training set** and **test set**.

- Use **only** the training set to fit the model (i.e. find $w^*$).

- Use the test set to evaluate the model's error (RMSE, $R^2$).

- This is like "generating" new data from the same DGP!
    - _Similar_ to bootstrapping (but not quite the same, because there is no resampling involved).
    - **If our sample is not representative of the DGP, this method has limited effectiveness!**

<center><img src="imgs/train-test.png" width='50%'></center>

### Train-test split 🚆

`sklearn.model_selection.train_test_split` implements a train-test split for us! 🙏🏼 

If `X` is an array/DataFrame of features and `y` is an array/Series of responses,

```py
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```

randomly splits the features and responses into training and test sets, such that the test set contains 0.25 of the full dataset.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Read the documentation!
train_test_split?

Let's perform a train/test split on our `tips` dataset.

In [None]:
X = tips.drop('tip', axis=1)
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Don't have to choose 0.25

Before proceeding, let's check the sizes of `X_train` and `X_test`.

In [None]:
print('Rows in X_train:', X_train.shape[0])
display(X_train.head())
print('Rows in X_test:', X_test.shape[0])
display(X_test.head())

In [None]:
X_train.shape[0] / tips.shape[0]

### Example prediction pipeline

Steps:
- Fit a model on the training set.
- Evaluate the model on the test set.

In [None]:
X = tips[['total_bill', 'size']] # For this example, we'll use just the already-quantitative columns in tips
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here, we'll use a stand-alone `LinearRegression` model without a `Pipeline`, but this process would work the same if we were using a `Pipeline`.

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

Let's check our model's performance on the **training** set first.

In [None]:
pred_train = lr.predict(X_train)
rmse_train = np.sqrt(np.mean((pred_train - y_train) ** 2))
rmse_train

And the **test** set:

In [None]:
pred_test = lr.predict(X_test)
rmse_test = np.sqrt(np.mean((pred_test - y_test) ** 2))
rmse_test

Since `rmse_train` and `rmse_test` are similar, it **doesn't seem like our model is overfitting** to the training data. If `rmse_test` was much larger than `rmse_train`, it would be evidence that our model is unable to **generalize well**.

## Summary, next time

### Summary

- $R^2$ is a measure of a model's "goodness-of-fit". For linear models, it ranges between 0 (poor) and 1 (great).
- `Pipeline`s in `sklearn` combine one or more transformers with a single model (estimator), allowing us to perform feature engineering and prediction through a single object.
- We want to build models that **generalize well** to unseen data.
    - Models that have high **bias** are too simple to represent complex relationships in data, and **underfit**.
    - Models that have high **variance** are overly complex for the relationships in the data, and vary a lot when fit on different datasets. Such models **overfit** to the training data.
- In order to prevent overfitting, we should perform a **train-test split** in which we only train our model on a subset of the data available to us, and use the remainder for evaluation.