### 4.2.2 Regression model evaluation metrics

The ones we're going to cover are:

1. R^2 (pronounced r-squared) or coefficient of determination
2. Mean absolute error (MAE)
3. Mean squared error (MSE)

In [1]:
# Imports
import pandas as pd
import numpy as np

# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])

housing_df["target"] = housing["target"]

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
np.random.seed(42)

X = housing_df.drop("target", axis=1)
y = housing_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)


In [3]:
model.score(X_test, y_test)

0.806652667101436

In [4]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: target, Length: 4128, dtype: float64

In [6]:
y_test.mean()

np.float64(2.0550030959302323)

In [7]:
from sklearn.metrics import r2_score

# Fill an array with y_test mean
y_test_mean = np.full(len(y_test), y_test.mean())

In [8]:
y_test_mean[:10]

array([2.0550031, 2.0550031, 2.0550031, 2.0550031, 2.0550031, 2.0550031,
       2.0550031, 2.0550031, 2.0550031, 2.0550031])

In [9]:
r2_score(y_true=y_test,
         y_pred=y_test_mean)

0.0

In [10]:
r2_score(y_true=y_test,
         y_pred=y_test)

1.0

**Mean absolute error (MAE)**

MAE is the average of the absolute difference between predictions and actual values.
It gives you an idea of how wrong your models predections are.

In [11]:
# MAE
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test, y_preds)
mae

np.float64(0.32656738464147306)

In [12]:
df = pd.DataFrame(data={"actual values": y_test,
                         "predicted values": y_preds})
df["differences"] = df["predicted values"]-df["actual values"]
df.head(10)

Unnamed: 0,actual values,predicted values,differences
20046,0.477,0.4939,0.0169
3024,0.458,0.75494,0.29694
15663,5.00001,4.928596,-0.071414
20484,2.186,2.54024,0.35424
9814,2.78,2.33176,-0.44824
13311,1.587,1.66022,0.07322
7113,1.982,2.3431,0.3611
7668,1.575,1.66311,0.08811
18246,3.4,2.47489,-0.92511
5723,4.466,4.834478,0.368478


In [13]:
# MAE using formulaes and difference
np.abs(df["differences"]).mean()

np.float64(0.32656738464147306)

In [14]:
y_preds

array([0.4939   , 0.75494  , 4.9285964, ..., 4.8363785, 0.71782  ,
       1.67781  ])

In [15]:
y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: target, Length: 4128, dtype: float64

**Mean squared error (MSE)**

MSE is the mean of the square of the errors between actual and predicted values. 

In [16]:
# Mean squared error
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

np.float64(0.25336408094921037)

In [17]:
df["squared_differences"] = np.square(df["differences"])
df.head()

Unnamed: 0,actual values,predicted values,differences,squared_differences
20046,0.477,0.4939,0.0169,0.000286
3024,0.458,0.75494,0.29694,0.088173
15663,5.00001,4.928596,-0.071414,0.0051
20484,2.186,2.54024,0.35424,0.125486
9814,2.78,2.33176,-0.44824,0.200919


In [18]:
# Calculate MSE by hand
squared = np.square(df["differences"])
squared.mean()

np.float64(0.25336408094921037)

In [19]:
df.iloc[0]

actual values          0.477000
predicted values       0.493900
differences            0.016900
squared_differences    0.000286
Name: 20046, dtype: float64

In [20]:
df_large_error = df.copy()
df_large_error.iloc[0]["squared_differences"] = 16
df_large_error.head()

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_large_error.iloc[0]["squared_differences"] = 16


Unnamed: 0,actual values,predicted values,differences,squared_differences
20046,0.477,0.4939,0.0169,16.0
3024,0.458,0.75494,0.29694,0.088173
15663,5.00001,4.928596,-0.071414,0.0051
20484,2.186,2.54024,0.35424,0.125486
9814,2.78,2.33176,-0.44824,0.200919


In [21]:
# Calculate MSE with large error
df_large_error["squared_differences"].mean()

np.float64(0.25723998075298943)

In [22]:
df_large_error.iloc[1:100] = 20
df_large_error

Unnamed: 0,actual values,predicted values,differences,squared_differences
20046,0.47700,0.493900,0.016900,16.000000
3024,20.00000,20.000000,20.000000,20.000000
15663,20.00000,20.000000,20.000000,20.000000
20484,20.00000,20.000000,20.000000,20.000000
9814,20.00000,20.000000,20.000000,20.000000
...,...,...,...,...
15362,2.63300,2.219830,-0.413170,0.170709
16623,2.66800,1.947760,-0.720240,0.518746
18086,5.00001,4.836378,-0.163632,0.026775
2144,0.72300,0.717820,-0.005180,0.000027


In [23]:
df_large_error["squared_differences"].mean()

np.float64(0.7333102979585939)