# 📝 Exercise M7.03

As with the classification metrics exercise, we will evaluate the regression
metrics within a cross-validation framework to get familiar with the syntax.

We will use the Ames house prices dataset.

In [1]:
import pandas as pd
import numpy as np

ames_housing = pd.read_csv("../datasets/house_prices.csv")
data = ames_housing.drop(columns="SalePrice")
target = ames_housing["SalePrice"]
data = data.select_dtypes(np.number)
target /= 1000

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

The first step will be to create a linear regression model.

In [2]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

Then, use the `cross_val_score` to estimate the generalization performance of
the model. Use a `KFold` cross-validation with 10 folds. Make the use of the
$R^2$ score explicit by assigning the parameter `scoring` (even though it is
the default score).

In [10]:
from sklearn.model_selection import cross_val_score, KFold

cv = KFold(n_splits=10, shuffle=True, random_state=0)

cv_scores = cross_val_score(
    model, data, target, cv=cv, n_jobs=4, scoring='r2')

print(f"R2 scores of the regression model: "
      f"{cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

Accuracy of the regression model: 0.772 ± 0.141


Then, instead of using the $R^2$ score, use the mean absolute error. You need
to refer to the documentation for the `scoring` parameter.

In [11]:
cv_scores = cross_val_score(
    model, data, target, cv=cv, n_jobs=4, scoring='neg_mean_absolute_error')

print(f"Mean absolute error of the regression model: "
      f"{-cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

Mean absolute error of the regression model: 22.041 ± 2.429


Finally, use the `cross_validate` function and compute multiple scores/errors
at once by passing a list of scorers to the `scoring` parameter. You can
compute the $R^2$ score and the mean absolute error for instance.

In [21]:
from sklearn.model_selection import cross_validate

scoring = [
    'explained_variance',
    'max_error',
    'neg_mean_absolute_error',
    'neg_mean_squared_error',
    'neg_root_mean_squared_error',
    'neg_median_absolute_error',
    'r2',
    'neg_mean_absolute_percentage_error',
]

cv_scores = cross_validate(
    model, data, target, cv=cv, scoring=scoring
)
for score in scoring:
    if 'neg' in score:
        print(f"Score {score} of the regression model: "
              f"{-cv_scores['test_'+score].mean():.3f} ± {cv_scores['test_'+score].std():.3f}")
    else:
        print(f"Score {score} of the regression model: "
              f"{cv_scores['test_'+score].mean():.3f} ± {cv_scores['test_'+score].std():.3f}")

Score explained_variance of the regression model: 0.774 ± 0.139
Score max_error of the regression model: -265.302 ± 166.209
Score neg_mean_absolute_error of the regression model: 22.041 ± 2.429
Score neg_mean_squared_error of the regression model: 1463.919 ± 1037.629
Score neg_root_mean_squared_error of the regression model: 36.575 ± 11.233
Score neg_median_absolute_error of the regression model: 15.730 ± 1.680
Score r2 of the regression model: 0.772 ± 0.141
Score neg_mean_absolute_percentage_error of the regression model: 0.132 ± 0.018
