# Regression Metrics

Follow _Introduction to Machine Learning_ [Chapter 5](https://github.com/amueller/introduction_to_ml_with_python/blob/master/05-model-evaluation-and-improvement.ipynb) **Section 5.3.4 Regression Metrics** (p.306)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import mglearn

## Regression metrics

R-squared `sklearn.metrics.r2_score`
>It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance.

see [Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) for more information.

$\frac{\sum(y - \bar{y})^2 - \sum(y - \hat{y})^2}{\sum(y - \bar{y})^2} = 1 -  \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$

Mean-squared error `sklearn.metrics.mean_squared_error`

$\frac{1}{N} \sum (y - \hat{y})^2$

Root mean-squared error `sklearn.metrics.mean_squared_error(squared=False)`

$\sqrt{\frac{1}{N} \sum (y - \hat{y})^2}$

Mean absolute error `sklearn.metrics.mean_absolute_error`


$\frac{1}{N} \sum |y - \hat{y}|$

In [None]:
y = 10*np.random.rand(50)+2
y

In [None]:
y_pred = y + 2*np.random.rand(50)
y_pred[-1] = 1.6 * y[-1] #a slight outlier
y_pred

### Predicted vs actual plot

Combined with a line of unity to assess biases in predictons.

In [None]:
plt.scatter(y_pred, y)
plt.plot([0, 13], [0, 13])
plt.xlabel('predicted')
plt.ylabel('actual')
plt.grid(True)

### Residual plot

Assess distribution of errors and dependency on magnitude of predicted value.

In [None]:
plt.scatter(y_pred, y-y_pred)
plt.xlabel('predicted')
plt.ylabel('actual - predicted')
plt.grid(True)

In [None]:
# Calculate R2 manually
1 - sum((y-y_pred)**2)/sum((y-y.mean())**2)

In [None]:
from sklearn.metrics import r2_score
r2_score(y, y_pred)

In [None]:
# Calculate mse manually
sum((y-y_pred)**2) / len(y)

In [None]:
from sklearn.metrics import mean_squared_error as mse
mse(y, y_pred)

In [None]:
# Calculate rms manually
np.sqrt(sum((y-y_pred)**2) / len(y))

In [None]:
mse(y, y_pred, squared=False)

In [None]:
# Calculate mae manually
sum(np.abs(y-y_pred)) / len(y)

In [None]:
from sklearn.metrics import mean_absolute_error as mae
mae(y, y_pred)

## Negative $R^2$?

In [None]:
x = np.linspace(0, 10, 50)
x

In [None]:
np.random.seed(345)
y = -0.1*x**2+np.random.randn(50)+8
y

In [None]:
plt.plot(x, y, '.');
plt.xlabel('feature')
plt.ylabel('target')

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression(fit_intercept=False)
model.fit(x[:,None],y).score(x[:,None],y)

In [None]:
xline = np.linspace(0, 10, 20)
y_pred = model.predict(xline[:,None])

plt.plot(x, y, '.');
plt.plot(xline, y_pred, label='model');
plt.hlines(y=y.mean(), xmin=0, xmax=10, label='mean')
plt.legend()
plt.xlabel('feature')
plt.ylabel('target')

Why do we get a negative $R^2$?

It is possible to get a negative $R^2$ when the model performs worse than the constant-mean predictor

$R^2 = 1 -  \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$

For a *bad* model, the numerator will be larger than the denominator, making the value negative.

For linear regression it can happen when no intercept is fitted.

See:

https://stackoverflow.com/questions/30507245/negative-r2-on-training-data-for-linear-regression

https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative