In [6]:
import pandas as pd
import numpy as np

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [8]:

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

In [9]:
display(data.shape, target.shape)

(506, 13)

(506,)

In [14]:
df=pd.DataFrame(data).copy()
df['target']=target

In [16]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(data, target)

In [18]:
pred = lr.predict(data)

In [20]:
pred_df = pd.DataFrame({
    'ground truth': target,
    'prediction': pred
})

pred_df

Unnamed: 0,ground truth,prediction
0,24.0,30.003843
1,21.6,25.025562
2,34.7,30.567597
3,33.4,28.607036
4,36.2,27.943524
...,...,...
501,22.4,23.533341
502,20.6,22.375719
503,23.9,27.627426
504,22.0,26.127967


In [21]:
pred_df['difference'] = pred_df['prediction'] - pred_df['ground truth']
pred_df

Unnamed: 0,ground truth,prediction,difference
0,24.0,30.003843,6.003843
1,21.6,25.025562,3.425562
2,34.7,30.567597,-4.132403
3,33.4,28.607036,-4.792964
4,36.2,27.943524,-8.256476
...,...,...,...
501,22.4,23.533341,1.133341
502,20.6,22.375719,1.775719
503,23.9,27.627426,3.727426
504,22.0,26.127967,4.127967


In [22]:
pred_df['difference'].sum() / pred_df.shape[0]

1.2357264969740872e-15

The average deviations on average gave such a small error because the difference can be either positive (if we predicted more) or negative (if we predicted less), and when we sum the positive and negative values, they cancel each other out, and we end up with a small error.

#MAE

In [23]:
pred_df['abs'] = abs(pred_df['prediction'] - pred_df['ground truth'])
pred_df

Unnamed: 0,ground truth,prediction,difference,abs
0,24.0,30.003843,6.003843,6.003843
1,21.6,25.025562,3.425562,3.425562
2,34.7,30.567597,-4.132403,4.132403
3,33.4,28.607036,-4.792964,4.792964
4,36.2,27.943524,-8.256476,8.256476
...,...,...,...,...
501,22.4,23.533341,1.133341,1.133341
502,20.6,22.375719,1.775719,1.775719
503,23.9,27.627426,3.727426,3.727426
504,22.0,26.127967,4.127967,4.127967


In [24]:
pred_df['abs'].mean()

3.270862810900316


$$MAE = \frac{1}{n}\sum_i^n{|y - y_{pred}|}$$

In [25]:
from sklearn.metrics import mean_absolute_error

In [26]:
mean_absolute_error(pred_df['ground truth'], pred_df['prediction'])

3.270862810900316

MSE

In [27]:
pred_df['squared'] = (pred_df['prediction'] - pred_df['ground truth']) ** 2
pred_df

Unnamed: 0,ground truth,prediction,difference,abs,squared
0,24.0,30.003843,6.003843,6.003843,36.046135
1,21.6,25.025562,3.425562,3.425562,11.734478
2,34.7,30.567597,-4.132403,4.132403,17.076757
3,33.4,28.607036,-4.792964,4.792964,22.972499
4,36.2,27.943524,-8.256476,8.256476,68.169392
...,...,...,...,...,...
501,22.4,23.533341,1.133341,1.133341,1.284461
502,20.6,22.375719,1.775719,1.775719,3.153178
503,23.9,27.627426,3.727426,3.727426,13.893705
504,22.0,26.127967,4.127967,4.127967,17.040110


In [28]:
pred_df['squared'].mean()

21.894831181729202

$$MSE = \frac{1}{n}\sum_i^n{(y - y_{pred})^2}$$

In [29]:
from sklearn.metrics import mean_squared_error
mean_squared_error(pred_df['ground truth'], pred_df['prediction'])

21.894831181729202

Deviations were squared, so if our value changes in rubles, then mse gives the figure dollar squared, which complicates the interpretation of the result, it can be easily corrected by taking the square root of the MSE value - and this is the new metric RMSE.

RMSE


$$RMSE = \sqrt{\frac{1}{n}\sum_i^n{(y - y_{pred})^2}}$$

In [30]:
np.sqrt(pred_df['squared'].mean())

4.679191295697281

In [31]:
np.sqrt(mean_squared_error(pred_df['ground truth'], pred_df['prediction']))

4.679191295697281

 $R^2$

The first three metrics considered have an ideal value of 0 to strive for, but these metrics do not have thresholds above which we consider the model to be terrible, since that they can vary greatly from task to task.

In [32]:
y_true = [0, 3, 6, 2, 9]
y_pred = [1, 5, 9, 6, 14]

In [33]:
df = pd.DataFrame({
    'y_true': y_true,
    'y_pred': y_pred
})

df['difference'] = df['y_pred'] - df['y_true']
df

Unnamed: 0,y_true,y_pred,difference
0,0,1,1
1,3,5,2
2,6,9,3
3,2,6,4
4,9,14,5


In [34]:
mean_squared_error(y_true, y_pred)

11.0

In [40]:
y_pred = [1, 302, 6003, 2004, 9905]
y_true=[0,300,6000,2000,9900]

In [41]:
df = pd.DataFrame({
    'y_true': y_true,
    'y_pred': y_pred
})

df['difference'] = df['y_pred'] - df['y_true']
df

Unnamed: 0,y_true,y_pred,difference
0,0,1,1
1,300,302,2
2,6000,6003,3
3,2000,2004,4
4,9900,9905,5


In [42]:
mean_squared_error(y_true, y_pred)

11.0

The MSEs turned out to be the same because the differences in predictions are the same, but the second model obviously makes fewer errors than the first, so it is of higher quality, even though the values of the MSE metric are exactly the same.

$$R^2 = 1 - \frac{\frac{1}{n}\sum^{n}_{i}{(y - y_{pred})^2}}{\frac{1}{n}\sum^{n}_{i}{(y - \bar{y})^2}}$$

where            $\bar{y}=\frac{1}{l}\sum^{n}_{i}y_{i}$

In [43]:
pred_df['constant'] = pred_df['ground truth'].mean()
pred_df

Unnamed: 0,ground truth,prediction,difference,abs,squared,constant
0,24.0,30.003843,6.003843,6.003843,36.046135,22.532806
1,21.6,25.025562,3.425562,3.425562,11.734478,22.532806
2,34.7,30.567597,-4.132403,4.132403,17.076757,22.532806
3,33.4,28.607036,-4.792964,4.792964,22.972499,22.532806
4,36.2,27.943524,-8.256476,8.256476,68.169392,22.532806
...,...,...,...,...,...,...
501,22.4,23.533341,1.133341,1.133341,1.284461,22.532806
502,20.6,22.375719,1.775719,1.775719,3.153178,22.532806
503,23.9,27.627426,3.727426,3.727426,13.893705,22.532806
504,22.0,26.127967,4.127967,4.127967,17.040110,22.532806


In [44]:
mse_const = mean_squared_error(pred_df['ground truth'], pred_df['constant'])
mse_const

84.41955615616556

In [45]:
mse_model = mean_squared_error(pred_df['ground truth'], pred_df['prediction'])
mse_model

21.894831181729202

Which is better than the constant responses, meaning that our model describes the variance of the target variable, meaning that it captures more variation in the data than a simple sample mean.   
Now we can calculate the coefficient of determination: divide the MSE of our model by the variance of the target variable, and then subtract that value from 1:

In [46]:
1 - mse_model / mse_const

0.7406426641094095

In [47]:
from sklearn.metrics import r2_score

r2_score(pred_df['ground truth'], pred_df['prediction'])

0.7406426641094095

# - $R^2$

Sometimes it happens that the model turns out to be very bad, that its answers do not correlate in any way with the true ones, in which case the coefficient of determination can be negative.

In [48]:
y_true = [0, 1, 2, 3, 4]
y_pred = [10, -20, -30, 40, -50]


df = pd.DataFrame({
    'y_true': y_true,
    'y_pred': y_pred
})


df['difference'] = df['y_pred'] - df['y_true']
df

Unnamed: 0,y_true,y_pred,difference
0,0,10,10
1,1,-20,-21
2,2,-30,-32
3,3,40,37
4,4,-50,-54


In [49]:
variance = ((df['y_true'] - df['y_true'].mean()) ** 2).mean()
print(f'Sample variance {variance}')

Sample variance 2.0


In [50]:
mse = ((df['y_true'] - df['y_pred']) ** 2).mean()
print(f'MSE {mse}')

MSE 1170.0


In [51]:
mse / variance

585.0

In [52]:
r2_score(y_true, y_pred)

-584.0

<table>

<tr>
<td>
metrics
</td>

<td>
formula
</td>

<td>
range of Value
</td>

<td>
Ideal value
</td>
</tr>

<tr>
<td>
MAE (mean absolute error)
</td>

<td>
$$MAE = \frac{1}{n}\sum_i^n{|y - y_{pred}|}$$

</td>

<td>
[0, +$\infty$)
</td>

<td>
0
</td>
</tr>

<tr>
<td>
MSE (mean squared error)

</td>

<td>
$$MSE = \frac{1}{n}\sum_i^n{(y - y_{pred})^2}$$

</td>

<td>
[0, +$\infty$)
</td>

<td>
0
</td>
</tr>

<tr>
<td>
RMSE (root mean squared error).

</td>

<td>
$$RMSE = \sqrt{\frac{1}{n}\sum_i^n{(y - y_{pred})^2}}$$

</td>

<td>
[0, +$\infty$)
</td>

<td>
0
</td>
</tr>

<tr>
<td>
Coefficient of determination $R^{2}$

</td>

<td>
$$R^2 = 1 - \frac{\frac{1}{n}\sum^{n}_{i}{(y - y_{pred})^2}}{\frac{1}{n}\sum^{n}_{i}{(y - \bar{y})^2}}$$

</td>

<td>
(-$\infty$, 1]

</td>

<td>
1
</td>
</tr>
</table>