![](../img/330-banner.png)

Lecture 10: Regression Evaluation Metrics
--------------

UBC 2022-23 W2

Instructor: Amir Abdi
 - Office Hours: Mondays 5-6 (or 5-7 if student turn-out was high)

<br><br><br>

## Announcements
- hw4, Feb 10, 11:59pm
- Midterm coming up soon on Feb 15

## Re-iterating some answer from last session

- When you calculate precision, recall, f1 score, by default only the positive label is evaluated, assuming **by default that the positive class is labeled 1**. 
  - In some scikit-learn models, this is configurable through the `pos_label` parameter
- When dealing with Multi-class classification, metric are defined per class (and can then be averaged across classes)

<img src=https://miro.medium.com/v2/resize:fit:1400/format:webp/1*a5jVpe1Q5DbGBFR3rakyig.png width="400"> <img src=https://miro.medium.com/v2/resize:fit:1400/format:webp/1*FPeagrSxOePdyF7zj0u4uw.png width="400">
<img src=https://miro.medium.com/v2/resize:fit:1400/format:webp/1*FB6XcD1Icb7sI8vff1BZuw.png width="400">

[image src](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2)

    - e.g. recall for cat = 4/(4+1+1) = 4/6 = 0.66
- AUC can be interpreted as evaluating the **ranking** of positive examples.
  - What's the probability that a **randomly picked positive sample** has a higher score according to the classifier than a **randomly picked negative sample**. 

<br><br><br><br><br><br><br><br><br>

## Learning outcomes 

From this lecture, students are expected to be able to:

- Carry out feature transformations on somewhat complicated dataset. 
- Visualize transformed features as a dataframe. 
- Use `Ridge` and `RidgeCV`.
- Explain how `alpha` hyperparameter of `Ridge` relates to the fundamental tradeoff. 
- Examine coefficients of transformed features.  
- Appropriately select a scoring metric given a regression problem.
- Interpret and communicate the meanings of different scoring metrics on regression problems.
    - MSE, RMSE, $R^2$, MAPE
- Apply log-transform on the target values in a regression problem with `TransformedTargetRegressor`.

# Housing Prices - End2End example

In this lecture, we'll be using [Kaggle House Prices dataset](https://www.kaggle.com/c/home-data-for-ml-course/). As usual, to run this notebook you'll need to download the data. For this dataset, train and test have already been separated. We'll be working with the train portion in this lecture. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import (
    ColumnTransformer,
    TransformedTargetRegressor,
    make_column_transformer,
)
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor

%matplotlib inline

## Dataset

In [None]:
df = pd.read_csv("../data/housing-kaggle/train.csv")
train_df, test_df = train_test_split(df, test_size=0.10, random_state=123)
train_df.head()

- Supervised machine learning problem: predicting **housing price** given features
  - Here, the target is `SalePrice`, which is continuous. 
  - it's a **regression problem** (as opposed to classification).

In [None]:
train_df.shape

## train and test `X` and `y`

In [None]:
X_train = train_df.drop(columns=["SalePrice"])
y_train = train_df["SalePrice"]

X_test = test_df.drop(columns=["SalePrice"])
y_test = test_df["SalePrice"]

## EDA

### Pandas `describe`

In [None]:
train_df.describe()

### Pandas `info`

In [None]:
train_df.info()

### `pandas_profiler` library

We do not have `pandas_profiling` in our course environment. You will have to install it in the environment on your own if you want to run the code below. 

```pip install -U pandas-profiling[notebook]```

The above worked for me. If this doesn't work for you, check the instructions [here](https://pypi.org/project/pandas-profiling/). 

In [None]:
from pandas_profiling import ProfileReport

In [None]:
profile = ProfileReport(train_df, title="Pandas Profiling Report")  # , minimal=True)

In [None]:
# profile

In [None]:
# profile.to_widgets()

<br><br><br><br>
**Warning on automated EDA tools**

- Do not blindly trust all the info given to you by automated tools. 
- How does pandas profiling figure out the data type?
    - You can look at the Python data type and say floats are numeric, strings are categorical.
    - However, in doing so you would miss out on various subtleties such as **some of the string features being ordinal** rather than truly categorical.
    - Also, it might think **free text data** is **categorical** (which is not right).

- In addition to tools such as above, it's important to go through data description to understand the data.
- The data description for our dataset is available [here](https://www.kaggle.com/c/home-data-for-ml-course/data?select=data_description.txt).     

### Feature types 

- We have mixed feature types and a bunch of missing values. 
- Now, let's identify feature types and transformations. 

- Let's get the numeric-looking columns. 

In [None]:
numeric_looking_columns = X_train.select_dtypes(include=np.number).columns.tolist()
print(numeric_looking_columns)

Not all numeric looking columns are necessarily numeric. 

In [None]:
train_df["MSSubClass"].unique()

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Also, month sold is more of a categorical feature than a numeric feature. 

In [None]:
train_df["MoSold"].unique() # Month Sold

In [None]:
# ID often doesn't help
drop_features = ["Id"]

# Looking at all columns, these are the actual numeric features
numeric_features = [
    "BedroomAbvGr",
    "KitchenAbvGr", 
    "LotFrontage",
    "LotArea",
    "OverallQual",
    "OverallCond",
    "YearBuilt",
    "YearRemodAdd",
    "MasVnrArea",
    "BsmtFinSF1",
    "BsmtFinSF2",
    "BsmtUnfSF",
    "TotalBsmtSF",
    "1stFlrSF",
    "2ndFlrSF",
    "LowQualFinSF",
    "GrLivArea",
    "BsmtFullBath",
    "BsmtHalfBath",
    "FullBath",
    "HalfBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "GarageYrBlt",
    "GarageCars",
    "GarageArea",
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "3SsnPorch",
    "ScreenPorch",
    "PoolArea",
    "MiscVal",
    "YrSold",
]

```{note}
I've not looked at all the features carefully. It might be appropriate to apply some other encoding on some of the numeric features above. 
```

In [None]:
set(numeric_looking_columns) - set(numeric_features) - set(drop_features)

We'll treat the above numeric-looking features as **categorical features**. 

- There are a bunch of ordinal features in this dataset. 
- Ordinal features with the same scale 
    - Poor (Po), Fair (Fa), Typical (TA), Good (Gd), Excellent (Ex)
    - These we'll be calling `ordinal_features_reg`.
- Ordinal features with different scales
    - These we'll be calling `ordinal_features_oth`.

In [None]:
ordinal_features_reg = [
    "ExterQual",
    "ExterCond",
    "BsmtQual",
    "BsmtCond",
    "HeatingQC",
    "KitchenQual",
    "FireplaceQu",
    "GarageQual",
    "GarageCond",
    "PoolQC",
]
ordering = [
    "Po",
    "Fa",
    "TA",
    "Gd",
    "Ex",
]  # if N/A it will just impute something, per below
ordering_ordinal_reg = [ordering] * len(ordinal_features_reg)
ordering_ordinal_reg

We'll pass the above as categories in our `OrdinalEncoder`. 

- There are a bunch more ordinal features using different scales.
  - These we'll be calling `ordinal_features_oth`. 
  - We are encoding them separately. 

In [None]:
ordinal_features_oth = [
    "BsmtExposure",
    "BsmtFinType1",
    "BsmtFinType2",
    "Functional",
    "Fence",
]
ordering_ordinal_oth = [
    ['NA', 'No', 'Mn', 'Av', 'Gd'],
    ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    ['NA', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    ['Sal','Sev','Maj2','Maj1','Mod','Min2','Min1','Typ'],    
    ['NA', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']
]

The remaining features are categorical features.

In [None]:
categorical_features = list(
    set(X_train.columns)
    - set(numeric_features)
    - set(ordinal_features_reg)
    - set(ordinal_features_oth)    
    - set(drop_features)
)
categorical_features

- We could also **engineer custom features**. 
  - e.g. **price per square foot** instead of **price** is decent feature engineering given our expert knowledge of housing market ;-)

## Applying feature transformations

- Since we have mixed feature types, let's use `ColumnTransformer` (using the `make_column_transformer` interface) to apply different transformations on different features types.  

In [None]:
from sklearn.compose import make_column_transformer

# Create 4 pipelines, each 
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

ordinal_transformer_reg = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_reg),
)

ordinal_transformer_oth = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OrdinalEncoder(categories=ordering_ordinal_oth),
)

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# Create the preprocessor
preprocessor = make_column_transformer(
    ("drop", drop_features),
    (numeric_transformer, numeric_features),
    (ordinal_transformer_reg, ordinal_features_reg),
    (ordinal_transformer_oth, ordinal_features_oth),    
    (categorical_transformer, categorical_features),
)

## Examining the preprocessed data

In [None]:
preprocessor.fit(X_train) # Calling fit to examine all the transformers.
preprocessor

In [None]:
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
new_columns = numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns

In [None]:
# Create new DataFrame by transforming the original DataFrame with the `preprocessor`
X_train_enc = pd.DataFrame(
    preprocessor.transform(X_train), 
    index=X_train.index, 
    columns=new_columns
)
X_train_enc.head()

In [None]:
X_train.shape

In [None]:
X_train_enc.shape

We went from 80 features to 263 features!! 

### Other possible preprocessing?  

- There is a lot of room for improvement.
- We're just using `SimpleImputer`.
    - In reality we'd want to go through this more carefully.
    - We may also want to drop some columns that are almost entirely missing.    
- We could also check for **outliers**, and do other exploratory data analysis (EDA).
- But for now this is good enough ...    

<br><br><br><br>

## Model building 

### Apply `Ridge` Regression

- Recall that we are going to use `Ridge()` instead of `LinearRegression()` in this course. 
    - It has a hyperparameter `alpha` which controls **regularization** and, in turn, controls the the **fundamental bias/variance tradeoff**. 

In [None]:
X_train.shape

In [None]:
lr_pipe = make_pipeline(preprocessor, Ridge(alpha=1.0))
pd.DataFrame(cross_validate(lr_pipe, X_train, y_train, cv=10, return_train_score=True))

- Quite a bit of **variance** in the test scores. 
- Performing **very poorly** in fold 8. Not sure why. 

### Tuning `alpha` hyperparameter of `Ridge`
- Recall that `Ridge` has a hyperparameter `alpha` that controls the fundamental tradeoff.
- This is like `C` in `LogisticRegression` but, annoyingly, `alpha` is the inverse of `C`.
- That is, large `C` is like small `alpha` and vice versa.
- Smaller `alpha`: lower training error (overfitting)

In [None]:
alphas = 10.0 ** np.arange(-5, 4, 1)
alphas

In [None]:
train_scores = []
cv_scores = []
for alpha in alphas:
    lr = make_pipeline(preprocessor, Ridge(alpha=alpha))
    results = cross_validate(lr, X_train, y_train, return_train_score=True)
    train_scores.append(np.mean(results["train_score"]))
    cv_scores.append(np.mean(results["test_score"]))

In [None]:
plt.semilogx(alphas, train_scores, label="train")
plt.semilogx(alphas, cv_scores, label="cv")
plt.legend()
plt.xlabel("alpha")
plt.ylabel("score");

In [None]:
best_alpha = alphas[np.argmax(cv_scores)]
best_alpha

- It seems alpha=100 is the best choice here.

- **Larger `alpha`** leads to **smaller coefficients**. (Why?)
- Smaller coefficients mean the predictions are less sensitive to changes in the data.
- Hence less chance of overfitting (seeing big dependencies when you shouldn't).

------------
**[Optionl / Bounus] + Reminder**

Larger alpha leads to smaller coefficients... here is why:


**Objective Function of Ridge Regression:**  
  
$$J(w) = ||y - Xw||^2_2 + alpha * ||w||^2_2$$

So, as you make the hyperparameter **`alpha` bigger**, $||w||^2_2$ will get smaller to make **$J(w)$ smaller**.

**[end of Optionl / Bounus]**

--------------

### `RidgeCV`

BTW, because it's so common to want to tune `alpha` with `Ridge`, sklearn provides a class called `RidgeCV`, which automatically tunes `alpha` based on cross-validation.

In [None]:
# -------- New Code (but nothing new, just mixing RidgeRgression with HParam tuning for Alpha) ------------
ridgecv_pipe = make_pipeline(preprocessor, RidgeCV(alphas=alphas, cv=10))
# ----------------------------------------------------------------------------------------------------------------

ridgecv_pipe.fit(X_train, y_train);

In [None]:
best_alpha = ridgecv_pipe.named_steps['ridgecv'].alpha_
best_alpha

Make some predictions on the `X_test` set

In [None]:
lr_tuned = make_pipeline(preprocessor, Ridge(alpha=best_alpha))
lr_tuned.fit(X_train, y_train)
lr_preds = lr_tuned.predict(X_test)
lr_preds[:10]

In [None]:
lr_preds.max(), lr_preds.min()

### Examine the coefficients 

Get the feature names of the transformed data. 

In [None]:
ohe_columns = list(
    preprocessor.named_transformers_["pipeline-4"]
    .named_steps["onehotencoder"]
    .get_feature_names_out(categorical_features)
)
new_columns = numeric_features + ordinal_features_reg + ordinal_features_oth + ohe_columns

In [None]:
df = pd.DataFrame(
    data={
        "features": new_columns,
        "coefficients": lr_tuned.named_steps["ridge"].coef_,
    }
)

<br><br><br><br><br>
Print features and their coefficients, sorted from biggest to smallest coefficient.

In [None]:
df.sort_values("coefficients",ascending=False)

So according to this model:

- As `OverallQual` feature gets bigger the housing price will get bigger.
- Presence of `Neighborhood_Edwards` will result in smaller house value. 

<br><br><br><br>

# Regression Metrics (score functions)

In [None]:
y_train.values[:4]

In [None]:
lr_tuned.predict(X_train)[:4]

- **How good** are the predictions on average across all samples?
- What's the **performance** of the model?

The above two questions are equivalent.
<br><br><br><br>

A number of popular scoring functions for regression:

- mean squared error (MSE)
- $R^2$
- root mean squared error (RMSE)
- mean absolute persentage error (MAPE)
- symmetric mean absolute persentage error (sMAPE) [**optional**]

See [sklearn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) for more details. 

## Mean squared error (MSE)

A common metric is mean squared error:

$$MSE(y, \hat{y}) = \frac{1}{n}  \sum_{i=1}^n (y_i - \hat{y_i})^2$$

In [None]:
y_test_pred = lr_tuned.predict(X_test)

In [None]:
np.mean((y_test_pred - y_test) ** 2)

**Perfect predictions: MSE = 0**

This is also implemented in sklearn:

In [None]:
from sklearn.metrics import mean_squared_error

# ------------ new function -------------
mean_squared_error(y_test, y_test_pred)

- MSE looks huge and unreasonable. There is an error of ~\$1 Billion!
- Is this score good or bad?

- In regression **our target has units**. 
  - The target is in **dollars**, the mean squared error is in **$dollars^2$** 
- The score also depends on the **scale** of the targets (imagine cents vs. dollars). 

## Root mean squared error or RMSE

- Root of MSE :)
- **Perfect predictions: MSE = 0**

$$RMSE(y, \hat{y}) = \sqrt{\frac{1}{n}  \sum_{i=1}^n (y_i - \hat{y_i})^2}$$


- The **unit** of **MSE** is in $dollars^2$.
- The **unit** of **RMSE** is $dollars$.


In [None]:
np.sqrt(mean_squared_error(y_test, y_test_pred))

Error of \$28,000 makes more sense.
<br><br><br><br><br>
Can we dig deeper?

In [None]:
plt.scatter(y_test, y_test_pred, alpha=0.3)
grid = np.linspace(y_train.min(), y_train.max(), 1000)
plt.plot(grid, grid, "--k")
plt.xlabel("true price")
plt.ylabel("predicte price");
plt.title("Test Data");

- Here we can see a few cases where our prediction is way off.
- Is there something weird about those houses, perhaps? **Outliers**? 
- Under the line means we're under-prediction, over the line means we're over-predicting.

## $R^2$

A common score is the $R^2$

- This is the score that `sklearn` uses by default when you call `score()` in regression:
- It is easy to compare performance of models with $R^2$
  - Example:
    - $R^2(model_A)=0.4$
    - $R^2(model_B)=0.8$
    - **model B is twice as good as model A**
- It represents the proportion of "variance of y" that has been **explained** by the "independent variables" in the model.

$$MSE(y, \hat{y}) = \frac{1}{n}  \sum_{i=1}^n (y_i - \hat{y_i})^2$$

$$R^2(y, \hat{y}) = \frac{\text{SSR}}{\text{SST}}  = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y_i})^2}{\sum_{i=1}^n (y_i - \bar{y})^2}$$

SSR: Sum of Square of Residuals  
SST: Sum of Square of Total

Key points:
- **Perfect Predictions: $R^2 = 1$**
- **Negative values are very bad: "worse than DummyRegressor" (very bad)**
- Question for you: What's the $R^2$ score of dummy model which always predicts the **mean**? (assume score on the train set)
  - Answer: ????

**Warning**: MSE is **commutative** but $R^2$ is not **commutative**

<br><br><br>
Wait a minute... what was **commutative**?

In [None]:
print(mean_squared_error(y_test, y_test_pred))
print(mean_squared_error(y_test_pred, y_test))

In [None]:
print(r2_score(y_test, y_test_pred))
print(r2_score(y_test_pred, y_test))

- [Read more about R2](https://en.wikipedia.org/wiki/Coefficient_of_determination) if interested.

## Mean Absolute Percentage Error (MAPE)

We got an RMSE of ~$30,000 before. 
- Question: Is \$30,000 good with respect to the scale of the target data? (imagine Vancouver vs. Regina, Saskatchewan)

- For a house worth \$600k, it seems reasonable! That's 5% error.
- For a house worth \$60k, that is terrible. It's 50% error.

We have both of these cases in our dataset.

Solution: Let's talk **percentages**!

In [None]:
plt.hist(y_train, bins=100);

How about looking at percent error? 

In [None]:
percent_errors = (y_test_pred - y_test) / y_test * 100.0
percent_errors

These are both positive (predict too high) and negative (predict too low).

We can look at the absolute percent error:

In [None]:
np.abs(percent_errors)

And, like MSE, we can take the average over examples. This is called mean absolute percent error (MAPE).

In [None]:
def my_mape(true, pred):
    return np.mean(np.abs((pred - true) / true))

In [None]:
my_mape(y_test, y_test_pred)

Let's use `sklearn` to calculate MAPE. 

In [None]:
from sklearn.metrics import mean_absolute_percentage_error

In [None]:
mean_absolute_percentage_error(y_test, y_test_pred)

- Ok, this is quite interpretable.
- On average, we have around **10% error**.

## Symmetric mean absolute percentage error (sMAPE)

[Optional / Bonus]

- When calculating percentage error, shall we put the "actual" in denominator or the "prediction"? Why?

Learn here: https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error
<br><br><br><br><br><br>

# Transforming the targets

- In Linear Regression models, we often minimize Least Squares Error (or similar, depending on the solvers/optimizers used)
- `.fit()` minimizes Least Squares Error, **not MAPE (percentage error)**. 

**When minimizing MSE, the expensive houses will dominate because they have the biggest error**

------------
Imagine you only have 2 samples in the test set:

Model A

- Example 1: Truth: \$50k, Prediction: \\$100k
- Example 2: Truth: \$500k, Prediction: \\$550k
- RMSE: $50k
- MAPE: 45%


Model B

- Example 1: Truth: \$50k, Prediction: \\$60k
- Example 2: Truth: \$500k, Prediction: \\$600k
- RMSE: $71k
- MAPE: 20%

------------

- How can we get `.fit()` to think about MAPE?
- A common practice which tends to work is **log transforming** the targets.
- That is, transform $y\rightarrow \log(y)$.

In [None]:
plt.hist(y_train, bins=100);

In [None]:
plt.hist(np.log10(y_train), bins=100);

We can incorporate this in our pipeline using `sklearn`. 

In [None]:
from sklearn.compose import TransformedTargetRegressor

In [None]:
# transformer for log transforming the target
best_alpha = 100
ttr = TransformedTargetRegressor(
    Ridge(alpha=best_alpha), 
    func=np.log1p,  #  Calculates natural logarithm of log(1 + x)
    inverse_func=np.expm1
) 
ttr_pipe = make_pipeline(preprocessor, ttr)

In [None]:
ttr_pipe.fit(X_train, y_train); # y_train automatically transformed

In [None]:
ttr_pipe.predict(X_train)  # predictions automatically un-transformed

<br><br><br><br>

Now evaluating our model on Test set:

In [None]:
y_test_pred = ttr_pipe.predict(X_test)

In [None]:
mean_absolute_percentage_error(y_test, y_test_pred)

In [None]:
np.sqrt(mean_squared_error(y_test, y_test_pred))

- We reduced MAPE from ~10% to ~8% with this trick! 
- We reduced RMSE from 28k to 22k with this trick! 

----------
**[Reminder from last session]**  
**[Study on your own]**

# Different scoring functions with `cross_validate`

- Let's try using MSE instead of the default $R^2$ score. 

In [None]:
pd.DataFrame(
    cross_validate(
        lr_tuned,
        X_train,
        y_train,
        return_train_score=True,
        scoring=["neg_mean_squared_error", "neg_mean_absolute_percentage_error"]
    )
)

In [None]:
def mape(true, pred):
    return mean_absolute_percentage_error(true, pred)

# make a scorer function that we can pass into cross-validation
mape_scorer = make_scorer(mape, greater_is_better=True)

pd.DataFrame(
    cross_validate(
        lr_tuned, X_train, y_train, return_train_score=True, scoring=mape_scorer
    )
)

In [None]:
import sklearn
sklearn.metrics.get_scorer_names()

In [None]:
scoring = {
    "r2": "r2",
    "neg_mape": "neg_mean_absolute_percentage_error", 
    "neg_rmse": "neg_root_mean_squared_error",
    "neg_mse": "neg_mean_squared_error",
}

pd.DataFrame(
    cross_validate(lr_tuned, X_train, y_train, return_train_score=True, scoring=scoring)
).T

In [None]:
mean_absolute_percentage_error(y_test, lr_tuned.predict(X_test))


**[End of Study on your own]**

---------------

## Using regression metrics with `scikit-learn`

- In `sklearn` you will notice that it has negative version of the metrics above (e.g., `neg_mean_squared_error`, `neg_root_mean_squared_error`). 
  - The reason for this is that scores return a value to **maximize**, the **higher the better**.
- If you define your own scorer function and if you do not want this interpretation, you can set the `greater_is_better` parameter to False 

## ❓❓ Questions for you

### (iClicker) Exercise 10.1 

**iClicker cloud join link: https://join.iclicker.com/EMMJ**

**Select all of the following statements which are TRUE.**

- (A) Outliers in the Training set adversely impact the performance of the model.
- (B) The `alpha` hyperparameter of `Ridge` has similar interpretation of `C` hyperparameter of `LogisticRegression`; higher `alpha` means more complex model. 
- (C) In regression, one should use MAPE instead of MSE when relative (percentage) error matters more than absolute error.
- (D) A lower RMSE value indicates a better model.
- (E) We can still use precision and recall for regression problems but now we have other metrics we can use as well.

## Summary

- House prices dataset target is price, which is numeric -> regression rather than classification
- There are corresponding versions of all the tools we used:
    - `DummyClassifier` -> `DummyRegressor`
    - `LogisticRegression` -> `Ridge`
- `Ridge` hyperparameter `alpha` is like `LogisticRegression` hyperparameter `C`, but opposite meaning
- We'll avoid `LinearRegression` in this course.

- Scoring metrics
- $R^2$ is the default .score(), it is unitless, 0 is bad, 1 is best
- MSE (mean squared error) is in units of target squared, hard to interpret; 0 is best
- RMSE (root mean squared error) is in the same units as the target; 0 is best
- MAPE (mean absolute percent error) is unitless; 0 is best, 1 is bad