# Phase 2: Model Training and Experimentation

This notebook will focus on training and optimizing our model. Our objectives include
1. Experiment tracking with MLflow
2. Cross-validation
3. Hyperparameter optimization
4. Model evaluation and selection
5. Model versioning and registration

In [1]:
# ignore this: required for run_notebooks.sh
%pip install --upgrade pip --quiet
%pip install mlflow optuna --quiet

/Users/nic/git/AmesHousingPredictor/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.
/Users/nic/git/AmesHousingPredictor/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
import optuna
import xgboost as xgb

print("Loading data...")
import ames_notebooks
from app.data_ingestion.read_data import DataReader

reader = DataReader()
train_data, test_data = reader.load_train_test()
print("Train shape:", train_data.shape)
print("Test shape:", test_data.shape)

[32m2025-11-21 11:48:34.457[0m | [34m[1mDEBUG   [0m | [36mapp.config.settings[0m:[36m<module>[0m:[36m29[0m - [34m[1mloaded settings: {
    "DATA_DIRECTORY": "data",
    "RAW_DATA_DIRECTORY": "data/raw",
    "PROCESSED_DATA_DIRECTORY": "data/processed",
    "KAGGLE_COMPETITION": "house-prices-advanced-regression-techniques",
    "KAGGLE_DOWNLOAD_PATH": "data/house-prices-advanced-regression-techniques.zip",
    "PROD_MODEL_NAME": "prod",
    "LOG_LEVEL": "INFO",
    "LOG_FILE": "logs/app.log",
    "MLFLOW_EXPERIMENT_NAME": "ames-housing-pricing-experiment",
    "MLFLOW_TRACKING_URI": "http://127.0.0.1:8500"
}[0m


Loading data...
Train shape: (1460, 80)
Test shape: (1459, 79)


In [3]:
# get preprocessing pipeline
from app.pipelines.preprocessing import get_fitted_pipelines
feature_preprocessor, target_transformer = get_fitted_pipelines(train_data)

In [4]:
X = train_data.drop('SalePrice', axis=1)
y = train_data['SalePrice']

# apply pipelines/transformations
X_processed = feature_preprocessor.transform(X)
y_processed = target_transformer.transform(y)

# Split data into train and validation sets
# use _val to prevent confusion between the test dataset
X_train, X_val, y_train, y_val = train_test_split(
    X_processed, y_processed, 
    test_size=0.2, 
    random_state=42
)

# Apply preprocessing
print("Applying preprocessing pipeline...")
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)

Applying preprocessing pipeline...
X_train shape: (1168, 241)
X_val shape: (292, 241)


## Baseline Model Development

Let's start with a simple baseline model using XGBoost with default parameters. This will give us a reference point for further improvements.

In [6]:
from app.pipelines.training import XGBModelTrainer

model = xgb.XGBRegressor(random_state=42, n_jobs=-1)

trainer = XGBModelTrainer()
baseline_model = trainer.train(model, "xgboost-baseline", X_train, y_train, X_val, y_val)

print("\nTraining Metrics:")
for metric, value in trainer.train_metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nValidation Metrics:")
for metric, value in trainer.val_metrics.items():
    print(f"{metric}: {value:.4f}")

2025/11/21 11:48:41 INFO mlflow.tracking.fluent: Experiment with name 'ames-housing-pricing-experiment211125114838' does not exist. Creating a new experiment.


[0]	validation_0-rmse:0.33887
[99]	validation_0-rmse:0.14973


  self.get_booster().save_model(fname)
  self.get_booster().load_model(fname)
Successfully registered model 'xgboost-baseline'.
2025/11/21 11:48:59 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: xgboost-baseline, version 1
Created version '1' of model 'xgboost-baseline'.
[32m2025-11-21 11:49:00.487[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m64[0m - [1m
Training Metrics:[0m
[32m2025-11-21 11:49:00.487[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_rmse: 0.0078[0m
[32m2025-11-21 11:49:00.488[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_mae: 0.0054[0m
[32m2025-11-21 11:49:00.488[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_r2: 0.9996[0m
[32m2025-11-21 11:49:00.489[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m

üèÉ View run train-run-xgboost-baseline at: http://127.0.0.1:8500/#/experiments/4/runs/894fb9fcb34648709f91617b04acfabf
üß™ View experiment at: http://127.0.0.1:8500/#/experiments/4

Training Metrics:
train_rmse: 0.0078
train_mae: 0.0054
train_r2: 0.9996

Validation Metrics:
val_rmse: 0.1497
val_mae: 0.1038
val_r2: 0.8799


## Hyperparameter Optimization with Optuna

Now that we have a baseline model, let's use Optuna to find better hyperparameters for our XGBoost model. We'll define an objective function that Optuna will optimize using cross-validation scores.

In [7]:
# objective function for Optuna
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 1.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 1.0, log=True),
        'random_state': 42
    }
    
    # get model with suggested parameters
    model = xgb.XGBRegressor(**params)
    
    # cross-validation
    cv_scores = cross_val_score(
        model, 
        X_train, 
        y_train, 
        cv=5, 
        scoring='neg_root_mean_squared_error',
        n_jobs=-1
    )
    
    # mean negative RMSE (Optuna minimizes objective)
    return -cv_scores.mean()

# Create and run Optuna study
study = optuna.create_study(direction='minimize')
optuna.logging.set_verbosity(optuna.logging.WARNING)
study.optimize(objective, n_trials=50, show_progress_bar=True)

print("Best trial:")
trial = study.best_trial

print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

[I 2025-11-21 11:49:01,216] A new study created in memory with name: no-name-8c0817a7-2283-426e-9535-7df996b411c6


  0%|          | 0/50 [00:00<?, ?it/s]

Best trial:
  Value:  0.12440085207991047
  Params: 
    max_depth: 4
    learning_rate: 0.04342459396051902
    n_estimators: 252
    min_child_weight: 7
    subsample: 0.6680163611417497
    colsample_bytree: 0.656687543807701
    reg_alpha: 1.0168201851932558e-07
    reg_lambda: 0.9412045882345911


In [8]:
best_params = study.best_params
best_params['random_state'] = 42
model = xgb.XGBRegressor(**best_params)

optimized_model = trainer.train(model, "xgboost-optimized", X_train, y_train, X_val, y_val)

print("\nTraining Metrics:")
for metric, value in trainer.train_metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nValidation Metrics:")
for metric, value in trainer.val_metrics.items():
    print(f"{metric}: {value:.4f}")

[0]	validation_0-rmse:0.41905
[100]	validation_0-rmse:0.14442
[200]	validation_0-rmse:0.13685
[251]	validation_0-rmse:0.13639


  self.get_booster().save_model(fname)
  self.get_booster().load_model(fname)
Successfully registered model 'xgboost-optimized'.
2025/11/21 11:49:48 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: xgboost-optimized, version 1
Created version '1' of model 'xgboost-optimized'.
[32m2025-11-21 11:49:49.140[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m64[0m - [1m
Training Metrics:[0m
[32m2025-11-21 11:49:49.142[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_rmse: 0.0723[0m
[32m2025-11-21 11:49:49.143[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_mae: 0.0518[0m
[32m2025-11-21 11:49:49.144[0m | [1mINFO    [0m | [36mapp.pipelines.training[0m:[36mtrain[0m:[36m66[0m - [1mtrain_r2: 0.9657[0m
[32m2025-11-21 11:49:49.144[0m | [1mINFO    [0m | [36mapp.pipelines.training

üèÉ View run train-run-xgboost-optimized at: http://127.0.0.1:8500/#/experiments/4/runs/2017799efa5d456895dceb41b39fa408
üß™ View experiment at: http://127.0.0.1:8500/#/experiments/4

Training Metrics:
train_rmse: 0.0723
train_mae: 0.0518
train_r2: 0.9657

Validation Metrics:
val_rmse: 0.1364
val_mae: 0.0897
val_r2: 0.9003


## Model Comparison

Let's compare the performance of our baseline and optimized models to see the improvement from hyperparameter optimization.

In [9]:
from app.pipelines.training import evaluate_model

# compare models on validation set
baseline_metrics = evaluate_model(baseline_model, X_val, y_val)
optimized_metrics = evaluate_model(optimized_model, X_val, y_val)

print("Baseline Model Metrics:")
for metric, value in baseline_metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nOptimized Model Metrics:")
for metric, value in optimized_metrics.items():
    print(f"{metric}: {value:.4f}")

improvement = (baseline_metrics['rmse'] - optimized_metrics['rmse']) / baseline_metrics['rmse'] * 100
print(f"\nRMSE Improvement: {improvement:.2f}%")

Baseline Model Metrics:
rmse: 0.1497
mae: 0.1038
r2: 0.8799

Optimized Model Metrics:
rmse: 0.1324
mae: 0.0882
r2: 0.9060

RMSE Improvement: 11.54%


In [10]:
from app.pipelines.preprocessing import get_fitted_pipelines
feature_preprocessor, target_transformer = get_fitted_pipelines(train_data)

from app.inference.predict import AmesPredictor
predictor = AmesPredictor(feature_engineer=feature_preprocessor, model_name="xgboost-optimized")
predictor.model

[32m2025-11-21 00:03:52.484[0m | [1mINFO    [0m | [36mapp.inference.predict[0m:[36m__init__[0m:[36m45[0m - [1mmlflow tracking uri set to http://127.0.0.1:5001[0m


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

  self.get_booster().load_model(fname)
[32m2025-11-21 00:04:21.704[0m | [1mINFO    [0m | [36mapp.inference.predict[0m:[36mget_model[0m:[36m24[0m - [1mloaded model with id m-c783c8021046478789333ce39d5d8005[0m


mlflow.pyfunc.loaded_model:
  artifact_path: mlflow-artifacts:/1/models/m-c783c8021046478789333ce39d5d8005/artifacts
  flavor: mlflow.xgboost
  run_id: c8d35a215ca445c5b9d3f3c332e93d1c

In [11]:
# Example: predict on a single row from the test set (keeps original columns)
example_row = test_data.iloc[[0]]  # DataFrame with one row
scaled_prediction = predictor.predict(example_row)
prediction = target_transformer.inverse_transform(scaled_prediction)
print("Example prediction (SalePrice):", prediction)

Example prediction (SalePrice): [119894.87]
