# Optuna XGBoost Optimization Analysis

This notebook analyzes the results of the hyperparameter optimization campaign for the California Housing dataset using XGBoost.

In [None]:
import optuna
import mlflow
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from optuna.visualization.matplotlib import plot_optimization_history, plot_param_importance, plot_parallel_coordinate

import sys
import os
sys.path.append("../src")

# Set output directory path
OUTPUT_DIR = "../outputs"
DB_PATH = f"sqlite:///{OUTPUT_DIR}/optuna_study.db"
MLFLOW_URI = f"file:///{os.path.abspath(OUTPUT_DIR)}/mlruns"

## 1. Load Optimization Study

In [None]:
study = optuna.load_study(
    study_name="xgboost-housing-optimization", 
    storage=DB_PATH
)

print(f"Number of trials: {len(study.trials)}")
print(f"Best trial value (neg MSE): {study.best_value}")
print(f"Best params: {study.best_params}")

## 2. Optimization History
Visualizing the improvement of the objective value over trials.

In [None]:
plot_optimization_history(study)
plt.title("Optimization History")
plt.show()

## 3. Hyperparameter Importance
Which hyperparameters had the most impact on model performance?

In [None]:
plot_param_importance(study)
plt.title("Hyperparameter Importance")
plt.show()

## 4. Parallel Coordinates Plot
Visualizing high-dimensional relationships between hyperparameters and the objective.

In [None]:
plot_parallel_coordinate(study)
plt.title("Parallel Coordinates")
plt.show()

## 5. Baseline Comparison
We compare the best tuned model against a baseline XGBoost model with default parameters.

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from data_loader import load_and_split_data
import numpy as np

X_train, X_test, y_train, y_test = load_and_split_data()

# Baseline Model
baseline = XGBRegressor(random_state=42, objective="reg:squarederror")
baseline.fit(X_train, y_train)
base_preds = baseline.predict(X_test)
base_rmse = np.sqrt(mean_squared_error(y_test, base_preds))

# Tuned Model (using tracked best params)
tuned = XGBRegressor(**study.best_params, random_state=42, objective="reg:squarederror")
tuned.fit(X_train, y_train)
tuned_preds = tuned.predict(X_test)
tuned_rmse = np.sqrt(mean_squared_error(y_test, tuned_preds))

print(f"Baseline RMSE: {base_rmse:.4f}")
print(f"Tuned RMSE:    {tuned_rmse:.4f}")
print(f"Improvement:   {(base_rmse - tuned_rmse)/base_rmse * 100:.2f}%")

## Insights
- **Optimization History**: Shows how the TPE sampler focused on promising regions after the initial startup trials.
- **Importance**: Typically `learning_rate` and `max_depth` are most critical for XGBoost.
- **Parallel Coordinates**: Helps identify if certain combinations (e.g., high depth + low learning rate) yield the best results.