# Training and Evaluating Models

## Importing the Preprocessing Pipeline

The preprocessing pipeline developed in previous notebooks ([e2e050](e2e050_pipelines.ipynb), [e2e051](e2e051_custom_transformers.ipynb), [e2e060](e2e060_spatial_clustering.ipynb)) is imported from the shared module [`utils/housing_preprocessing.py`](utils/housing_preprocessing.py).

In [None]:
from utils.housing_preprocessing import get_preprocessing_pipeline
preprocessing = get_preprocessing_pipeline(n_clusters=10)  # Default value for initial exploration

In [None]:
preprocessing

## Training and Evaluating Models

### Data Loading

The data loading with stratified train/test split is imported from [`utils/load_california.py`](utils/load_california.py), as developed in [e2e025](e2e025_train_test.ipynb).

In [None]:
from utils.load_california import load_housing_data
X_train, X_test, y_train, y_test = load_housing_data()

### Defining a Complete Pipeline with a Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(X_train, y_train)

In [None]:
y_pred = lin_reg.predict(X_test) # Make predictions with the test data

Now we can compare some of the predicted results with their actual labels:

In [None]:
print("Actual values:", list(y_test.iloc[:10]))
print("Predictions:", list(y_pred[:10].round(-2)))

and see the percentage error in these predictions:

In [None]:
error_ratios = y_pred.round(-2) / y_test - 1
print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))

but we can evaluate performance with the root mean squared error, as we had established:

In [None]:
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_test, y_pred)

An error of $68,812 for predictions of house values with a median price of $206,856 doesn't seem very useful.

### Trying Another Model: Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(X_train, y_train)

In [None]:
y_pred = tree_reg.predict(X_test)
root_mean_squared_error(y_test, y_pred)

We can see that the error measured on the test set is still too high to consider that we have a useful model (a house price estimate with an error of $68,000 is probably much worse than what a real estate professional would subjectively estimate).

### *Data Leakage*

However, we have just made a common mistake: by using the test set to compare models (Linear Regression vs. Decision Tree), we have committed **Model Selection Leakage**â€”a form of [data leakage](e2e025_train_test.ipynb#Data-Leakage).

Even though we never trained on test data directly, the test set performance influenced our model choice. This means:

- The test error is now an **optimistically biased** estimate of true generalization performance
- We have effectively "overfit to the test set"
- The test set can no longer serve as an unbiased proxy for unseen data

**The solution**: Keep the test set in a "vault" and use it **only once**, at the very end. For comparing models and tuning hyperparameters, we need a separate **validation set**.

## Validation Set


To avoid improperly using the test set during model development, the dataset is usually split into three parts:

- **Training**: To fit the model's **parameters**.
- **Validation**: To tune **hyperparameters** and **compare different models**. This set allows measuring intermediate performance without compromising the test set.
- **Test**: To evaluate the final model only once. It should remain untouched throughout the entire development process.

The validation set is used to compare models and tune hyperparameters. The test set is reserved for evaluating the final performance of the chosen model that will be deployed in production.

## *Cross-validation*

So far we have split the data into two sets: training and test. However, in many cases, performance will vary depending on the sampling we have done. If we stop fixing the sampling seed (`random_state` parameter of the `train_test_split` function), we will get different results (although in this case they don't vary much for either model).

A more efficient approach is **cross-validation**: instead of splitting the training set in two, it is divided into *k* sets (*folds*). Then the model is trained *k* times, each time leaving a different set as the validation set and the other *k-1* as the training set. The result is an array with *k* scores.

For example, the following code performs training with 10 different samplings. The results will be similar to what we could obtain by running the code 10 times without fixing the random sampling seed. The trade-off is obvious: the computational cost is also multiplied by 10.

This introduces the concept of the **validation set**. The validation set is used to compare models and **tune hyperparameters**, and it changes in each iteration of cross-validation. The test set is reserved for evaluating the final performance of the chosen model once it has been trained.

[<img src="./img/cross-validation.png" width="500">](https://www.researchgate.net/figure/Train-test-cross-validation-split-methodology-used-in-this-paper-The-first-operation_fig2_340567535)

[<img src="./img/cross-validation2.png" width="700">](https://www.statology.org/validation-set-vs-test-set/)

In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(estimator = tree_reg, 
                              X = X_train,
                              y = y_train,
                              scoring = "neg_root_mean_squared_error",
                              cv = 10) # 10-fold cross-validation

print(tree_rmses)
pd.Series(tree_rmses).describe()

The `scoring` parameter of the `cross_val_score` function expects a **utility function** (higher is better) rather than a **cost function** (lower is better), so the score is actually the negative of the RMSE. It is a negative value, so we need to flip the sign of the output to get the RMSE values. `cross_val_score` will seek to maximize the score, so by maximizing the negative of the RMSE, we minimize the RMSE.

There are multiple string identifiers for evaluation metrics that can be used in `scoring`, which can be found in the [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

We can see that we obtain a mean RMSE of $67,431 **on validation** with a standard deviation of $3,623. This provides more detailed information (and less dependent on the sampling we performed) about the model's performance.

Now we have an evaluation metric for our decision tree model that we can compare with others without touching the test set.

## Comparing Another Model (*Random Forest*)

*Random Forest* is a model that trains multiple decision trees (***ensemble learning***) on random subsets of features and averages their predictions.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 10)
pd.Series(forest_rmses).describe()

Random Forest achieves an improvement ($47,328 mean error) on the validation set compared to the simple decision tree.

Although it is still a high error, it is the best model we have so far. Assuming we stick with it, we could finally train the chosen model on the entire training set and evaluate its performance on the test set we kept separate before putting it into production.

In [None]:
y_pred = forest_reg.fit(X_train, y_train).predict(X_test)
forest_rmse = root_mean_squared_error(y_test, y_pred)
forest_rmse