In [None]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = 'plotly'
TEMPLATE = 'seaborn'

import warnings
warnings.simplefilter('ignore')

# Lecture 24 – Decision Trees, Grid Search, Multicollinearity

## DSC 80, Spring 2023

### Agenda

- Cross-validation.
- Example: Decision trees 🌲.
- Grid search.
- Multicollinearity.

## Cross-validation

### Recap

- Suppose we've decided to fit a polynomial regressor on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$, but we're unsure of what degree of polynomial to use (1, 2, 3, ..., 25). 
    - Note that polynomial degree is a **hyperparameter** – it is something we can control _before_ our model is fit to the training data.

- Or, suppose we have a dataset of restaurant tips, bills, table sizes, days of the week, and so on, and want to decide which features to use in a linear model that predicts tips.

- Remember, more complicated models (that is, models with more features) don't necessarily **generalize** well to **unseen data**!

- **Goal**: Find the best hyperparameter, or best choice of features, so that our fit model **generalizes well to unseen data**.

### $k$-fold cross-validation

Instead of relying on a single validation set, we can create $k$ validation sets, where $k$ is some positive integer (5 in the example below).

<center><img src='imgs/k-fold.png' width=40%></center>

Since each data point is used for training $k-1$ times and validation once, the (averaged) validation performance should be a good metric of a model's ability to generalize to unseen data.

$k$-fold cross-validation (or simply "cross-validation") is **the** technique we will use for finding hyperparameters.

### $k$-fold cross-validation

First, **shuffle** the dataset randomly and **split** it into $k$ disjoint groups. Then:

- For each hyperparameter:
    - For each unique group:
        - Let the unique group be the "validation set".
        - Let all other groups be the "training set".
        - Train a model using the selected hyperparameter on the training set.
        - Evaluate the model on the validation set.
    - Compute the **average** validation score (e.g. RMSE) for the particular hyperparameter.
- Choose the hyperparameter with the best average validation score.

As a reminder, here's what "sample 1" looks like.

In [None]:
sample_1 = pd.read_csv(os.path.join('data', 'sample-1.csv'))
px.scatter(x=sample_1['x'], y=sample_1['y'], template=TEMPLATE)

### $k$-fold cross-validation in `sklearn`

- Let's perform $k$-fold cross validation in order to help us pick a degree for polynomial regression from the list [1, 2, ..., 25].

- We'll use $k=5$ since it's a common choice (and the default in `sklearn`).


- For the sake of this example, we'll suppose `sample_1` is our "training + validation data", i.e. that our test data is in some other dataset.
    - If this were not true, we'd first need to split `sample_1` into separate training and test sets.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [None]:
errs_df = pd.DataFrame()

for d in range(1, 26):
    pl = Pipeline([('poly', PolynomialFeatures(d)), ('lin-reg', LinearRegression())])
    
    # The `scoring` argument is used to specify that we want to compute the RMSE; 
    # the default is R^2. It's called "neg" RMSE because, 
    # by default, sklearn likes to "maximize" scores, and maximizing -RMSE is the same
    # as minimizing RMSE.
    errs = cross_val_score(pl, sample_1[['x']], sample_1['y'], 
                           cv=5, scoring='neg_root_mean_squared_error')
    errs_df[f'Deg {d}'] = -errs # Negate to turn positive (sklearn computed negative RMSE).
    
errs_df.index = [f'Fold {i}' for i in range(1, 6)]
errs_df.index.name = 'Validation Fold'

Soon, we'll look at how to implement this procedure without needing to `for`-loop over values of `d`.

### $k$-fold cross-validation in `sklearn`

Note that for each choice of degree (our hyperparameter), we have **five** RMSEs, one for each "fold" of the data. This means that in total, 125 models were trained/fit to data!

In [None]:
errs_df

We should choose the degree with the lowest **average** validation RMSE.

In [None]:
errs_df.mean().idxmin()

Note that if we didn't perform $k$-fold cross-validation, but instead just used a single validation set, we may have ended up with a different result:

In [None]:
errs_df.idxmin(axis=1)

***Note***: You may notice that the RMSEs in Folds 1 and 5 are significantly higher than in other folds. Can you think of reasons why, and how we might fix this?

### Another example: Tips

We can also use $k$-fold cross-validation to determine which subset of features to use in a linear model that predicts tips (though, as you'll see, the code is not pretty).

In [None]:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

As we should always do, we'll perform a train-test split on `tips` and will only use the training data for cross-validation.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = tips.drop('tip', axis=1)
y = tips['tip']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# A dictionary that maps names to Pipeline objects.
pipes = {
    'total_bill only': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size + OHE smoker': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size']),
             ('ohe', OneHotEncoder(), ['smoker'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
    'total_bill + size + OHE all': Pipeline([
        ('trans', ColumnTransformer(
            [('keep', FunctionTransformer(lambda x: x), ['total_bill', 'size']),
             ('ohe', OneHotEncoder(), ['smoker', 'sex', 'time', 'day'])], 
            remainder='drop')), 
        ('lin-reg', LinearRegression())
    ]),
}

In [None]:
pipe_df = pd.DataFrame()

for pipe in pipes:
    errs = cross_val_score(pipes[pipe], X_train, y_train,
                           cv=5, scoring='neg_root_mean_squared_error')
    pipe_df[pipe] = -errs
    
pipe_df.index = [f'Fold {i}' for i in range(1, 6)]
pipe_df.index.name = 'Validation Fold'

In [None]:
pipe_df

In [None]:
pipe_df.mean()

In [None]:
pipe_df.mean().idxmin()

Even though the third model has the lowest average validation RMSE, its average validation RMSE is very close to that of the other, simpler models, and as a result we'd likely use the simplest model in practice.

### Summary: Generalization

1. Split the data into two sets: <span style='color: blue'><b>training</b></span> and <span style='color: orange'><b>test</b></span>.

2. Use only the <span style='color: blue'><b>training</b></span> data when designing, training, and tuning the model.
    - Use <span style='color: green'><b>$k$-fold cross-validation</b></span> to choose hyperparameters and estimate the model's ability to generalize.
    - Do not ❌ look at the <span style='color: orange'><b>test</b></span> data in this step!
    
3. Commit to your final model and train it using the entire <span style='color: blue'><b>training</b></span> set.

4. Test the data using the <span style='color: orange'><b>test</b></span> data. If the performance (e.g. RMSE) is not acceptable, return to step 2.

5. Finally, train on **all available data** and ship the model to production! 🛳

🚨 This is the process you should **always** use! 🚨 

### Discussion Question 🤔

- Suppose you have a training dataset with 1000 rows.
- You want to decide between 20 hyperparameters for a particular model.
- To do so, you perform 10-fold cross-validation.
- **How many times is the first row in the training dataset (`X.iloc[0]`) used for training a model?**

## Example: Decision trees 🌲

<center><img src='imgs/taxonomy.png' width=50%></center>

Decision trees can be used for both regression and classification. We will start by discussing their use in **classification**.

### Example: Predicting diabetes

In [None]:
diabetes = pd.read_csv('data/diabetes.csv')
diabetes.head()

In [None]:
# 0 means no diabetes, 1 means yes diabetes.
diabetes['Outcome'].value_counts()

- `'Glucose'` is measured in mg/dL (milligrams per deciliter).

- `'BMI'` is calculated as $\text{BMI} = \frac{\text{weight (kg)}}{\left[ \text{height (m)} \right]^2}$.

- Let's use `'Glucose'` and `'BMI'` to predict whether or not a patient has diabetes (`'Outcome'`).

### Exploring the dataset

<span style='color: orange'><b>Class 0 (orange) is "no diabetes"</b></span> and <span style='color: blue'><b>class 1 (blue) is "diabetes"</b></span>.

In [None]:
fig = (
    diabetes.assign(Outcome=diabetes['Outcome'].astype(str))
            .plot(kind='scatter', x='Glucose', y='BMI', color='Outcome', 
                  color_discrete_map={'0': 'orange', '1': 'blue'},
                  title='Relationship between Glucose, BMI, and Diabetes',
                  template=TEMPLATE)
)
fig

### Building a decision tree

Let's build a decision tree and interpret the results. But first, a train-test split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(diabetes[['Glucose', 'BMI']], 
                                                    diabetes['Outcome'],
                                                    random_state=1)

The relevant class is `DecisionTreeClassifier`, from `sklearn.tree`.

In [None]:
from sklearn.tree import DecisionTreeClassifier

Note that we `fit` it the same way we `fit` earlier estimators.

_You may wonder what `max_depth=2` does – more on this soon!_

In [None]:
dt = DecisionTreeClassifier(max_depth=2)

In [None]:
dt.fit(X_train, y_train)

### Visualizing decision trees

Our fit decision tree is like a "flowchart", made up of a series of questions.

As before, <span style='color: orange'><b>orange is "no diabetes"</b></span> and <span style='color: blue'><b>blue  is "diabetes"</b></span>.

In [None]:
from sklearn.tree import plot_tree

In [None]:
plt.figure(figsize=(15, 5))
plot_tree(dt, feature_names=X_train.columns, class_names=['no db', 'yes db'], 
          filled=True, rounded=True, fontsize=15, impurity=False);

- To **classify a new data point**, we start at the top and answer the first question (i.e. "Glucose <= 129.5").
- If the answer is "**Yes**", we move to the **left** branch, otherwise we move to the right branch.
- We repeat this process until we end up at a leaf node, at which point we predict the most common class in that node.
    - Note that each node has a `value` attribute, which describes the number of **training** individuals of each class that fell in that node.

In [None]:
# Note that the left node at depth 2 has a `value` of [304, 78].
y_train.loc[X_train[X_train['Glucose'] <= 129.5].index].value_counts()

### Evaluating classifiers

The most common evaluation metric in classification is **accuracy**:

$$\text{accuracy} = \frac{\text{# data points classified correctly}}{\text{# data points}}$$

In [None]:
(dt.predict(X_train) == y_train).mean()

The `score` method of a classifier computes accuracy by default (just like the `score` method of a regressor computes $R^2$ by default). We want our classifiers to have **high accuracy**.

In [None]:
# Training accuracy – same number as above
dt.score(X_train, y_train)

In [None]:
# Testing accuracy
dt.score(X_test, y_test)

### Some questions...

- How did `sklearn` **learn** what questions to ask?

- Can we get it to ask more questions (i.e. build a **deeper** tree)?

### Training a decision tree

When we ask a question, we are effectively **splitting** a node into two children – the "yes" child and the "no" child.

Suppose the distribution within a node looks like this (colors represent classes):

<center>🟠🟠🟠🔵🔵🔵🔵🔵🔵🔵</center>

Question A **splits** the node like this:
- "Yes": 🟠🟠🔵🔵🔵
- "No": 🟠🔵🔵🔵🔵

Question B **splits** the node like this:
- "Yes": 🔵🔵🔵🔵🔵🔵
- "No": 🔵🟠🟠🟠

**Which question is "better"?**

Question B, because there is "less uncertainty" in the resulting nodes after splitting by Question B than there is after splitting by Question A. There are two common techniques for quantifying "uncertainty":
- Gini impurity (the default in `sklearn`).
- Entropy (from information theory).

Not the focus of our course, but read more!

### Tree depth

Decision trees are trained by **recursively** picking the best split until:
- all "leaf nodes" only contain training examples from a single class (good), or
- it is impossible to split leaf nodes any further (not good).

By default, there is no "maximum depth" for a decision tree. As such, without restriction, decision trees tend to be very deep.

In [None]:
dt_no_max = DecisionTreeClassifier()
dt_no_max.fit(X_train, y_train)

A decision tree fit on our training data has a depth of around 20! (It is so deep that `tree.plot_tree` errors when trying to plot it.)

In [None]:
dt_no_max.tree_.max_depth

At first, this tree seems "better" than our tree of depth 2, since its training accuracy is much much higher:

In [None]:
dt_no_max.score(X_train, y_train)

In [None]:
# Depth 2 tree.
dt.score(X_train, y_train)

But recall, we truly care about **test set performance**, and this decision tree has **worse accuracy on the test set than our depth 2 tree**.

In [None]:
dt_no_max.score(X_test, y_test)

In [None]:
# Depth 2 tree.
dt.score(X_test, y_test)

### Decision trees and overfitting

- Decision trees have a tendency to overfit. **Why is that?**

- Unlike linear classification techniques (like logistic regression or SVMs), **decision trees are non-linear**.
    - They are also "non-parametric" – there are no $w^*$s to learn.

- While being trained, decision trees ask enough questions to effectively **memorize** the correct response values in the training set. However, the relationships they learn are often overfit to the noise in the training set, and don't generalize well.

In [None]:
fig

- A decision tree whose depth is not restricted will achieve 100% accuracy on any training set, as long as there are no "overlapping values" in the training set.
    - Two values overlap when they have the same features $x$ but different response values $y$ (e.g. if two patients have the same glucose levels and BMI, but one has diabetes and one doesn't).

- **One solution**: Make the decision tree "less complex" by limiting the maximum depth.

Since `sklearn.tree`'s `plot_tree` can't visualize extremely large decision trees, let's create and visualize some smaller decision trees.

In [None]:
trees = {}
for d in [2, 4, 8]:
    trees[d] = DecisionTreeClassifier(max_depth=d, random_state=1)
    trees[d].fit(X_train, y_train)
    
    plt.figure(figsize=(15, 5), dpi=100)
    plot_tree(trees[d], feature_names=X_train.columns, class_names=['no db', 'yes db'], 
               filled=True, rounded=True, impurity=False)
    
    plt.show()

As tree depth increases, complexity increases, and our trees are more prone to overfitting.

**Question**: What is the "right" maximum depth to choose?

### Hyperparameters for decision trees

- `max_depth` is a hyperparameter for `DecisionTreeClassifier`.

- There are many more hyperparameters we can tweak; look at [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for examples.
    - `min_samples_split`: The minimum number of samples required to split an internal node.
    - `criterion`: The function to measure the quality of a split (`'gini'` or `'entropy'`).

- To ensure that our model generalizes well to unseen data, we need an efficient technique for trying different combinations of hyperparameters!

## Grid search

### Grid search

`GridSearchCV` takes in:
- an **un-`fit`** instance of an estimator, and
- a **dictionary** of hyperparameter values to try,

and performs $k$-fold cross-validation to find the **combination of hyperparameters with the best average validation performance**.

In [None]:
from sklearn.model_selection import GridSearchCV

The following dictionary contains the values we're considering for each hyperparameter. (We're using `GridSearchCV` with 3 hyperparameters, but we could use it with even just a single hyperparameter.)

In [None]:
hyperparameters = {
    'max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None], 
    'min_samples_split': [2, 5, 10, 20, 50, 100, 200],
    'criterion': ['gini', 'entropy']
}

Note that there are 140 **combinations** of hyperparameters we need to try. We need to find the **best combination** of hyperparameters, not the best value for each hyperparameter individually.

In [None]:
len(hyperparameters['max_depth']) * \
len(hyperparameters['min_samples_split']) * \
len(hyperparameters['criterion'])

`GridSearchCV` needs to be instantiated and `fit`.

In [None]:
searcher = GridSearchCV(DecisionTreeClassifier(), hyperparameters, cv=5)

In [None]:
searcher.fit(X_train, y_train)

After being `fit`, the `best_params_` attribute provides us with the best combination of hyperparameters to use.

In [None]:
searcher.best_params_

All of the intermediate results – validation accuracies for each fold, mean validation accuaries, etc. – are stored in the `cv_results_` attribute:

In [None]:
searcher.cv_results_['mean_test_score'] # Array of length 140.

In [None]:
# Rows correspond to folds, columns correspond to hyperparameter combinations.
pd.DataFrame(np.vstack([searcher.cv_results_[f'split{i}_test_score'] for i in range(5)]))

Note that the above DataFrame tells us that 5 * 140 = 700 models were trained in total!

Now that we've found the best combination of hyperparameters, we should fit a decision tree instance using those hyperparameters on our entire training set.

In [None]:
searcher.best_params_

In [None]:
final_tree = DecisionTreeClassifier(**searcher.best_params_)
final_tree

In [None]:
final_tree.fit(X_train, y_train)

In [None]:
# Training accuracy.
final_tree.score(X_train, y_train)

In [None]:
# Testing accuracy.
final_tree.score(X_test, y_test)

Remember, `searcher` itself is a model object (we had to `fit` it). After performing $k$-fold cross-validation, behind the scenes, `searcher` is trained on the entire training set using the optimal combination of hyperparameters.

In other words, `searcher` makes the same predictions that `final_tree` does!

In [None]:
searcher.score(X_train, y_train)

In [None]:
searcher.score(X_test, y_test)

### Choosing possible hyperparameter values

- A full grid search can take a **long time**.
    - In our previous example, we tried 140 combinations of hyperparameters.
    - Since we performed 5-fold cross-validation, we trained 700 decision trees under the hood.

- **Question**: How do we pick the possible hyperparameter values to try?

- **Answer**: Trial and error.
    - If the "best" choice of a hyperparameter was at an extreme, try increasing the range.
    - For instance, if you try `max_depth`s from 32 to 128, and 32 was the best, try including `max_depths` under 32.

### Key takeaways

- Decision trees are trained by finding the best questions to ask using the features in the training data. A good question is one that isolates classes as much as possible.
- Decision trees have a tendency to overfit to training data. One way to mitigate this is by restricting the maximum depth of the tree.
- To efficiently find hyperparameters through cross-validation, use `GridSearchCV`.
    - Specify which values to try for each hyperparameter, and `GridSearchCV` will try all **unique combinations of hyperparameters** and return the combination with the best average validation performance.
    - `GridSearchCV` is not the only solution – see [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) if you're curious.

## Multicollinearity

### Heights and weights

We have a dataset containing the weights and heights of 25,0000 18 year olds, taken from [here](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights).

In [None]:
people = pd.read_csv('data/SOCR-HeightWeight.csv').drop('Index', axis=1)
people.head()

In [None]:
people.plot(kind='scatter', x='Height (Inches)', y='Weight (Pounds)', 
            title='Weight vs. Height for 25,000 18 Year Olds', template=TEMPLATE)

### Motivating example

Suppose we fit a simple linear regression model that uses **height in inches** to predict **weight in pounds**.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)}$$

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(people[['Height (Inches)']], 
                                                            people['Weight (Pounds)'], 
                                                            random_state=1)

In [None]:
lr_one_feat = LinearRegression()
lr_one_feat.fit(X_train_1, y_train_1)

$w_0^*$ and $w_1^*$ are shown below, along with the model's **testing** RMSE.

In [None]:
lr_one_feat.intercept_, lr_one_feat.coef_

In [None]:
rmse_one_feat = mean_squared_error(y_test_1, 
                                   lr_one_feat.predict(X_test_1), 
                                   squared=False)
rmse_one_feat

Now, suppose we fit another regression model, that uses **height in inches** AND **height in centimeters** to predict weight.

$$\text{predicted weight (pounds)} = w_0 + w_1 \cdot \text{height (inches)} + w_2 \cdot \text{height (cm)}$$

In [None]:
people['Height (cm)'] = people['Height (Inches)'] * 2.54 # 1 inch = 2.54 cm

In [None]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(people[['Height (Inches)', 'Height (cm)']], 
                                                            people['Weight (Pounds)'], 
                                                            random_state=1)

In [None]:
lr_two_feat = LinearRegression()
lr_two_feat.fit(X_train_2, y_train_2)

What are $w_0^*$, $w_1^*$, $w_2^*$, and the model's testing RMSE?

In [None]:
lr_two_feat.intercept_, lr_two_feat.coef_

In [None]:
rmse_two_feat = mean_squared_error(y_test_2, 
                                   lr_two_feat.predict(X_test_2), 
                                   squared=False)
rmse_two_feat

**Observation**: The intercept is the same as before (roughly -81.17), as is the testing RMSE. However, the coefficients on `'Height (Inches)'` and `'Height (cm)'` are massive in size!

What's going on?

### Redundant features

Let's use simpler numbers for illustration. Suppose in the first model, $w_0^* = -80$ and $w_1^* = 3$.

$$\text{predicted weight (pounds)} = -80 + 3 \cdot \text{height (inches)}$$

In the second model, we have:

$$\begin{align*}\text{predicted weight (pounds)} &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \text{height (cm)} \\ &= w_0^* + w_1^* \cdot \text{height (inches)} + w_2^* \cdot \big( 2.54^* \cdot \text{height (inches)} \big) \\ &= w_0^* + \left(w_1^* + 2.54 \cdot w_2^* \right) \cdot \text{height (inches)} \end{align*}$$

In the first model, we already found the "best" intercept ($-80$) and slope ($3$) in a linear model that uses height in inches to predict weight.

**So, as long as $w_1^* + 2.54 \cdot w_2^* = 3$ in the second model, the second model's training predictions will be the same as the first, and hence they will also minimize RMSE.**

### Infinitely many parameter choices

**Issue**: There are an infinite number of $w_1^*$ and $w_2^*$ that satisfy $w_1^* + 2.54 \cdot w_2^* = 3$!

$$\text{predicted weight} = -80 - 10 \cdot \text{height (inches)} + \frac{13}{2.54} \cdot \text{height (cm)}$$

$$\text{predicted weight} = -80 + 10 \cdot \text{height (inches)} - \frac{7}{2.54} \cdot \text{height (cm)}$$

- Both prediction rules look very different, but actually make the same predictions.


- `lr.coef_` could return either set of coefficients, or any other of the infinitely many options. 


- But neither set of coefficients is **has any meaning!**

In [None]:
(-80 - 10 * people.iloc[:, 0] + (13 / 2.54) * people.iloc[:, 2]).head()

In [None]:
(-80 + 10 * people.iloc[:, 0] - (7 / 2.54) * people.iloc[:, 2]).head()

### Multicollinearity

- Multicollinearity occurs when features in a regression model are **highly correlated** with one another.
    - In other words, multicollinearity occurs when **a feature can be predicted using a linear combination of other features, fairly accurately**.

- When multicollinearity is present in the features, the **coefficients in the model** are uninterpretable – they have no meaning.
    - A "slope" represents "the rate of change of $y$ with respect to a feature", when all other features are held constant – but if there's multicollinearity, you can't hold other features constant.

- **Note: Multicollinearity doesn't impact a model's predictions!**
    - It doesn't impact a model's ability to generalize to unseen data.
    - If features are multicollinear in the training data, they will probably be multicollinear in the test data too.

- **Solutions**:
    - Manually remove highly correlated features.
    - Use a dimensionality reduction technique (such as PCA) to automatically reduce dimensions.

### Key takeaways

- Multicollinearity is present in a linear model when one feature can be accurately predicted using one or more other features.
    - In other words, it is present when a feature is **redundant**.
- Multicollinearity doesn't pose an issue for prediction; it doesn't hinder a model's ability to generalize. Instead, it renders the **coefficients** of a linear model meaningless.

## Summary, next time

### Summary

See the individual sections for more specific "key takeaways".

### Next time
- Multicollinearity and one hot encoding.
- Using text features in a predictive model.
- Metrics for measuring the performance of classifiers other than accuracy.