In [None]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

import warnings
warnings.simplefilter('ignore')

# Lecture 25 â€“ Grid Search, Multicollinearity, Examples

## DSC 80, Spring 2023

### Agenda

- Recap: Decision trees ðŸŒ² and grid search.
- Multicollinearity (including how it arises in one hot encoding).
- Example: Modeling with text features.

## Recap: Decision trees ðŸŒ² and grid search

### Example: Predicting diabetes

In [None]:
diabetes = pd.read_csv('data/diabetes.csv')
diabetes.head()

In [None]:
fig = (
    diabetes.assign(Outcome=diabetes['Outcome'].astype(str))
            .plot(kind='scatter', x='Glucose', y='BMI', color='Outcome', 
                  color_discrete_map={'0': 'orange', '1': 'blue'},
                  title='Relationship between Glucose, BMI, and Diabetes',
                  template=TEMPLATE)
)
fig

Recall, we started with a relatively simple decision tree.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree

In [None]:
X_train, X_test, y_train, y_test = train_test_split(diabetes[['Glucose', 'BMI']], 
                                                    diabetes['Outcome'],
                                                    random_state=1)

In [None]:
dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15, 5))
plot_tree(dt, feature_names=X_train.columns, class_names=['no db', 'yes db'], 
          filled=True, rounded=True, fontsize=15, impurity=False);

### Goal

Create a `DecisionTreeClassifier` that 

- will **generalize well** to unseen data, by

- finding the **combination of hyperparameters (`max_depth`, `min_samples_split`, `criterion`)** that maximizes average validation accuracy.

### Grid search

`GridSearchCV` takes in:
- an **un-`fit`** instance of an estimator, and
- a **dictionary** of hyperparameter values to try,

and performs $k$-fold cross-validation to find the **combination of hyperparameters with the best average validation performance**.

In [None]:
from sklearn.model_selection import GridSearchCV

The following dictionary contains the values we're considering for each hyperparameter. (We're using `GridSearchCV` with 3 hyperparameters, but we could use it with even just a single hyperparameter.)

In [None]:
hyperparameters = {
    'max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None], 
    'min_samples_split': [2, 5, 10, 20, 50, 100, 200],
    'criterion': ['gini', 'entropy']
}

Note that there are 140 **combinations** of hyperparameters we need to try. We need to find the **best combination** of hyperparameters, not the best value for each hyperparameter individually.

In [None]:
len(hyperparameters['max_depth']) * \
len(hyperparameters['min_samples_split']) * \
len(hyperparameters['criterion'])

`GridSearchCV` needs to be instantiated and `fit`.

In [None]:
searcher = GridSearchCV(DecisionTreeClassifier(), hyperparameters, cv=5)

In [None]:
searcher.fit(X_train, y_train)

After being `fit`, the `best_params_` attribute provides us with the best combination of hyperparameters to use.

In [None]:
searcher.best_params_

All of the intermediate results â€“ validation accuracies for each fold, mean validation accuaries, etc. â€“ are stored in the `cv_results_` attribute:

In [None]:
searcher.cv_results_['mean_test_score'] # Array of length 140.

In [None]:
# Rows correspond to folds, columns correspond to hyperparameter combinations.
pd.DataFrame(np.vstack([searcher.cv_results_[f'split{i}_test_score'] for i in range(5)]))

Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!

Now that we've found the best combination of hyperparameters, we should fit a decision tree instance using those hyperparameters on our entire training set.

In [None]:
searcher.best_params_

In [None]:
final_tree = DecisionTreeClassifier(**searcher.best_params_)
final_tree

In [None]:
final_tree.fit(X_train, y_train)

In [None]:
# Training accuracy.
final_tree.score(X_train, y_train)

In [None]:
# Testing accuracy.
final_tree.score(X_test, y_test)

Remember, `searcher` itself is a model object (we had to `fit` it). After performing $k$-fold cross-validation, behind the scenes, `searcher` is trained on the entire training set using the optimal combination of hyperparameters.

In other words, `searcher` makes the same predictions that `final_tree` does!

In [None]:
searcher.score(X_train, y_train)

In [None]:
searcher.score(X_test, y_test)

### Choosing possible hyperparameter values

- A full grid search can take a **long time**.
    - In our previous example, we tried 140 combinations of hyperparameters.
    - Since we performed 5-fold cross-validation, we trained 700 decision trees under the hood.

- **Question**: How do we pick the possible hyperparameter values to try?

- **Answer**: Trial and error.
    - If the "best" choice of a hyperparameter was at an extreme, try increasing the range.
    - For instance, if you try `max_depth`s from 32 to 128, and 32 was the best, try including `max_depth`s under 32.

### Key takeaways

- Decision trees are trained by finding the best questions to ask using the features in the training data. A good question is one that isolates classes as much as possible.
- Decision trees have a tendency to overfit to training data. One way to mitigate this is by restricting the maximum depth of the tree.
- To efficiently find hyperparameters through cross-validation, use `GridSearchCV`.
    - Specify which values to try for each hyperparameter, and `GridSearchCV` will try all **unique combinations of hyperparameters** and return the combination with the best average validation performance.
    - `GridSearchCV` is not the only solution â€“ see [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) if you're curious.

## Multicollinearity

### Heights and weights

We have a dataset containing the weights and heights of 25,0000 18 year olds, taken from [here](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights).

In [None]:
people = pd.read_csv('data/SOCR-HeightWeight.csv').drop('Index', axis=1)
people.head()

In [None]:
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)', 
            title='Weight vs. Height for 25,000 18 Year Olds', template=TEMPLATE)

### Motivating example

Suppose we fit a simple linear regression model that uses **height in inches** to predict **weight in pounds**.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)}$$

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(people[['Height (Inches)']], 
                                                            people['Weight (Pounds)'], 
                                                            random_state=1)

In [None]:
lr_one_feat = LinearRegression()
lr_one_feat.fit(X_train_1, y_train_1)

$w_0^*$ and $w_1^*$ are shown below, along with the model's **testing** RMSE.

In [None]:
lr_one_feat.intercept_, lr_one_feat.coef_

In [None]:
rmse_one_feat = mean_squared_error(y_test_1, 
                                   lr_one_feat.predict(X_test_1), 
                                   squared=False)
rmse_one_feat

Now, suppose we fit another regression model, that uses **height in inches** AND **height in centimeters** to predict weight.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$

In [None]:
people['Height (cm)'] = people['Height (Inches)'] * 2.54 # 1 inch = 2.54 cm.

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(people[['Height (Inches)', 'Height (cm)']], 
                                                            people['Weight (Pounds)'], 
                                                            random_state=1)

In [None]:
lr_two_feat = LinearRegression()
lr_two_feat.fit(X_train_2, y_train_2)

What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's testing RMSE?

In [None]:
lr_two_feat.intercept_, lr_two_feat.coef_

In [None]:
rmse_two_feat = mean_squared_error(y_test_2, 
                                   lr_two_feat.predict(X_test_2), 
                                   squared=False)
rmse_two_feat

**Observation**: The intercept is the same as before (roughly -81.17), as is the testing RMSE. However, the coefficients on `'Height (Inches)'` and `'Height (cm)'` are massive in size!

What's going on?

### Redundant features

Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.

$$\text{predicted weight (pounds)} = -80 + 3 \cdot \text{height (inches)}$$

In the second model, we have:

$$\begin{align*}\text{predicted weight (pounds)} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$

In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.

**So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.**

### Infinitely many parameter choices

**Issue**: There are an infinite number of $w_1^*$ and $w_2^*$ values that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!

$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$

$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$

- Both prediction rules look very different, but actually make the same predictions.


- `lr.coef_` could return either set of coefficients, or any other of the infinitely many options. 


- But neither set of coefficients **has any meaning!**

In [None]:
(-80 - 10 * people.iloc[:, 0] + (13 / 2.54) * people.iloc[:, 2]).head()

In [None]:
(-80 + 10 * people.iloc[:, 0] - (7 / 2.54) * people.iloc[:, 2]).head()

### Multicollinearity

- Multicollinearity occurs when features in a regression model are **highly correlated** with one another.
    - In other words, multicollinearity occurs when **a feature can be predicted using a linear combination of other features, fairly accurately**.

- When multicollinearity is present in the features, the **coefficients in the model** are uninterpretable â€“ they have no meaning.
    - A "slope" represents "the rate of change of $y$ with respect to a feature", when all other features are held constant â€“ but if there's multicollinearity, you can't hold other features constant.

- **Note: Multicollinearity doesn't impact a model's predictions!**
    - It doesn't impact a model's ability to generalize to unseen data.
    - If features are multicollinear in the training data, they will probably be multicollinear in the test data too.

- **Solutions**:
    - Manually remove highly correlated features.
    - Use a dimensionality reduction technique (such as PCA) to automatically reduce dimensions.

### One hot encoding and multicollinearity

When we one hot encode categorical features, we create several **redundant** columns.

In [None]:
tips = px.data.tips()
tips_features = tips.drop('tip', axis=1)
tips_features.head()

Aside: You can use `pd.get_dummies` in EDA, but **don't** use it for modeling (instead, use `OneHotEncoder`, which works with `Pipeline`s).

In [None]:
X = pd.get_dummies(tips_features)
X.head()

Remember that under the hood, `LinearRegression()` creates a **design matrix** that has a column of all ones (for the intercept term). Let's add that column above for demonstration.

In [None]:
X['all_ones'] = 1
X.head()

Now, many of the above columns **can be written as linear combinations of other columns**!
- We don't need `'sex_Male'` â€“ its value is just `'all_ones'` - `'sex_Female'`.
- We don't need `'smoker_Yes'` â€“ its value is just `'all_ones'` - `'smoker_No'`.
- We don't need `'time_Lunch'` â€“ its value is just `'all_ones'` - `'time_Dinner'`.
- We don't need `'day_Thur'` â€“ its value is just `'all_ones'` - (`'day_Fri'` + `'day_Sat'` + `'day_Sun'`).

Note that if we get rid of the four redundant columns above, the **rank** of our design matrix â€“ that is, the number of linearly independent columns it has â€“ does not change (and so the "predictive power" of our features don't change either).

In [None]:
np.linalg.matrix_rank(X)

In [None]:
np.linalg.matrix_rank(X.drop(columns=['sex_Male', 'smoker_Yes', 'time_Lunch', 'day_Thur']))

However, without the redundant columns, there is only a single unique set of optimal parameters $w^*$, and the multicollinearity is no more.

**Aside**: Most one hot encoding techniques (including `OneHotEncoder`) have an in-built `drop` argument, which allow you to specify that you'd like to drop **one column per categorical feature**.

In [None]:
pd.get_dummies(tips_features, drop_first=True)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder(drop='first')
ohe.fit_transform(tips_features[['sex', 'smoker', 'day', 'time']]).toarray()

The above array only has $(2-1) + (2-1) + (4-1) + (2-1) = 6$ columns, rather than $2 + 2 + 4 + 2 = 10$, since we dropped 1 per categorical column in `tips_features`.

### Key takeaways

- Multicollinearity is present in a linear model when one feature can be accurately predicted using one or more other features.
    - In other words, it is present when a feature is **redundant**.
- Multicollinearity doesn't pose an issue for prediction; it doesn't hinder a model's ability to generalize. Instead, it renders the **coefficients** of a linear model meaningless.
- Multicollinearity is present when performing one hot encoding; a solution is to **drop one one hot encoded column for each original categorical feature**.

## Example: Modeling using text features

### Example: Predicting reviews

We have a dataset containing Amazon reviews and ratings for patio, lawn, and gardening products. (Aside: [Here](https://cseweb.ucsd.edu/~jmcauley/datasets.html) is a good source for such data.)

In [None]:
reviews = pd.read_json(open('data/reviews.json'), lines=True)
reviews.head()

**Goal**: Use a review's `'summary'` to predict its `'overall'` rating.

Note that there are five possible `'overall'` rating values â€“ 1, 2, 3, 4, 5 â€“ not just two. As such, this is an instance of **multiclass classification**.

In [None]:
reviews['overall'].value_counts(normalize=True)

**Question**: What is the worst possible accuracy we should expect from a ratings classifier, given the above distribution?

### Aside: `CountVectorizer`

Entries in the `'summary'` column are not currently quantitative! We can use the bag of words encoding to create quantitative features out of each `'summary'`.

 Instead of performing a bag of words encoding manually as we did before, we can rely on `sklearn`'s `CountVectorizer`. (There is also a `TfidfVectorizer`.)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
example_corp = ['hey hey hey my name is billy', 
                'hey billy how is your dog billy']

In [None]:
count_vec = CountVectorizer()
count_vec.fit(example_corp)

`count_vec` learned a **vocabulary** from the corpus we `fit` it on.

In [None]:
count_vec.vocabulary_

In [None]:
count_vec.transform(example_corp).toarray()

Note that the values in `count_vec.vocabulary_` correspond to the positions of the columns in `count_vec.transform(example_corp).toarray()`, i.e. `'billy'` is the first column and `'your'` is the last column.

In [None]:
example_corp

In [None]:
pd.DataFrame(count_vec.transform(example_corp).toarray(),
             columns=pd.Series(count_vec.vocabulary_).sort_values().index)

### Creating an initial `Pipeline`

Let's build a `Pipeline` that takes in summaries and overall ratings and:

- Uses `CountVectorizer` to quantitatively encode summaries.

- Fits a `RandomForestClassifier` to the data.
    - A "random forest" is a combination (or **ensemble**) of decision trees, each fit on a different **bootstrapped** resample of the training data.
    - It makes predictions by aggregating the results of the individual trees (in the case of classification, by taking the **most common prediction**).

But first, a train-test split (like **always**).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

In [None]:
X = reviews['summary']
y = reviews['overall']
X_train, X_test, y_train, y_test = train_test_split(X, y)

To start, we'll create a random forest with 7 trees (`n_estimators`) each of which has a maximum depth of 8 (`max_depth`).

In [None]:
pl = Pipeline([
    ('cv', CountVectorizer()), 
    ('clf', RandomForestClassifier(max_depth=8, n_estimators=7)) # Uses 7 separate decision trees!
])

In [None]:
pl.fit(X_train, y_train)

In [None]:
# Training accuracy.
pl.score(X_train, y_train)

In [None]:
# Testing accuracy.
pl.score(X_test, y_test)

The accuracy of our random forest is just above 50%, on both the training and testing sets. We'd get the same performance by predicting a rating of 5 every time!

In [None]:
# Distribution of true ys in the training set: 53% are 5.
y_train.value_counts(normalize=True)

In [None]:
# Distribution of predicted ys in the training set: 99.8% are 5.
# It turns out we essentially are predicting 5 every time!
pd.Series(pl.predict(X_train)).value_counts(normalize=True)

In [None]:
len(pl.named_steps['cv'].vocabulary_) # We have many features, but we are not asking many questions!

### Choosing tree depth via `GridSearchCV`

We arbitrarily chose `max_depth=8` before, but it seems like that isn't working well. Let's perform a grid search to find the `max_depth` with the best generalization performance.

In [None]:
# Note that we've used the key clf__max_depth, not max_depth
# because max_depth is a hyperparameter of clf, not of pl.

hyperparameters = {
    'clf__max_depth': np.arange(2, 500, 20)
}

Note that while `pl` has already been `fit`, we can still give it to `GridSearchCV`, which will repeatedly re-`fit` it during cross-validation.

In [None]:
# Takes 10+ seconds to run â€“ how many trees are being trained?
grids = GridSearchCV(pl, param_grid=hyperparameters, return_train_score=True)
grids.fit(X_train, y_train)

In [None]:
grids.best_params_

Recall, `fit` `GridSearchCV` objects are estimators on their own as well. This means we can compute the training and testing accuracies of the "best" random forest directly:

In [None]:
# Training accuracy.
grids.score(X_train, y_train)

In [None]:
# Testing accuracy.
grids.score(X_test, y_test)

Still not much better on the testing set! ðŸ¤·

### Training and validation accuracy vs. depth

Below, we plot how training and validation accuracy varied with tree depth. Note that the $y$-axis here is accuracy, and that larger accuracies are better (unlike with RMSE, where smaller was better).

In [None]:
index = grids.param_grid['clf__max_depth']
train = grids.cv_results_['mean_train_score']
valid = grids.cv_results_['mean_test_score']

In [None]:
pd.DataFrame({'train': train, 'valid': valid}, index=index).plot().update_layout(
    xaxis_title='max_depth', yaxis_title='Accuracy'
)

Unsurprisingly, training accuracy kept increasing, while validation accuracy leveled off around a depth of ~100.

## Summary, next time

### Summary

- See the grid search and multicollinearity sections for more specific "key takeaways".
- The `CountVectorizer` transformer can be used to perform the bag of words encoding.

### Next time

Metrics for measuring the performance of classifiers other than accuracy.