# Exercise: Model Learning

Our exercise includes:
- selecting and eliminating features
- hyperparameters
- cross-validation

## Set Up

### Extract

1\. Import `rpg-characters.csv` to a Pandas DataFrame

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2\. Focus on a dependent variable. It can be whatever you like. Then separate the independent variables.

In [30]:
# dependent/independent variable, target/features

3\. Use `train_test_split`. The testing set is 25% of our dataset. We can choose `random_state`, though it needs to be a constant value.

In [31]:
# train_test_split
from sklearn.model_selection import train_test_split

### Lasso (Least Absolute Shrinkage and Selection Operator)

[Lasso Linear Model](https://scikit-learn.org/stable/modules/linear_model.html#lasso)

[LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html)

1\. Use the `LassoCV` model.

`alphas` can go from general-purpose, heavy, light, or manual. Play around with it.

```py
# General-purpose regression: alphas=np.logspace(-4, 1, 50-100)
# Heavy regularization (sparse model): alphas=np.logspace(0, 3, 50)
# Light regularization: alphas=np.logspace(-6, 0, 50)
# Manual tuning: alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10]

lasso = LassoCV(
    alphas=np.logspace(-4, 1, 100), # general-purpose
    max_iter=100000,
)
```

There are no multiple dependent variables with `LassoCV`. It has to be a single dependent variable.

```
Dependent variables: weight, height, score ❌
Single dependent variable: score ✅
Single dependent variable: strength ✅
Single dependent variable: modifier1 ✅
```

Post `fit`, feature coefficients equal to `0` will be removed, feature coefficients not equal to `0` are selected, and pick the "best" alpha: `lasso.alpha_`.

In [32]:
# LassoCV
from sklearn.linear_model import LassoCV
from sklearn.metrics import median_absolute_error

# removed features
# print(f"Removed Features: {(lasso.coef_ == 0).sum()}") or print(f"Removed Features: {np.sum(lasso.coef_ == 0)}")

# selected features
# print(f"Selected Features: {X_train.columns[lasso.coef_ != 0]}")

2\. Use the `predict` method on the test set. Print the R<sup>2</sup> [.score](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV.score) and the median absolute error [median_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html) function.

In [33]:
# lasso.predict
# lasso.score
# mae = median_absolute_error

3\. Build a scatter plot: predicted versus actual.

In [34]:
# predicted versus actual values

4\. Build a scatter plot: residuals versus predicted values.

In [35]:
# residuals versus predicted values

### LinearRegression

Use the same dataset on our `LinearRegression`.

- `fit`
- `predict`
- Print the R^2 score.
- Print the `median_absolute_error`.

In [36]:
from sklearn.linear_model import LinearRegression

### Ridge regression

Ridge is an algorithm that detects highly correlated independent variables. It's useful for multicollinearity.

[Ridge Linear Model](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)

[RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)

Use the `RidgeCV` model on the same dataset.

`alphas` can go from general-purpose, heavy, light, or manual. Play around with it. With the `fit` method, we want to pick the "best" alpha: `ridge.alpha_`.

```py
# General-purpose regression: alphas=np.logspace(-4, 1, 50-100)
# Heavy regularization (sparse model): alphas=np.logspace(0, 3, 50)
# Light regularization: alphas=np.logspace(-6, 0, 50)
# Manual tuning: alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10]

ridge = RidgeCV(
    alphas=np.logspace(-4, 1, 100), # general-purpose
)
```

There _are_ multiple dependent variables with `RidgeCV`.

In [37]:
from sklearn.linear_model import RidgeCV

## Reflect

Pick from 3-5 dependent variables. Which models (`LassoCV`, `LinearRegression`, and `RidgeCV`) outperform the R<sup>2</sup> score? Are they the same?

Focus on the dependent `score float64` variable. Those models are aggressively correlated to the independent variables. Which should you pick: `LassoCV`, `LinearRegression`, `RidgeCV`?