# Hyperparameter Tuning Deep Dive

This notebook explores the powerful hyperparameter tuning capabilities of the `ML_Engine` library using the sample dataset `ds_sample.csv`. We will cover:

1.  **Setup**: Loading the real dataset.
2.  **Basic Training**: Training a model with default parameters.
3.  **Manual Tuning**: Training a model with manually specified parameters.
4.  **Automated Tuning with Optuna**: Using `train_model` with `tuning_method='optuna'`.
5.  **Inspecting & Visualizing Optuna Results**: How to analyze the output of a tuning run.

## 1. Setup & Data Loading

In [28]:
import pandas as pd
import numpy as np
import os
import yaml
from sklearn.model_selection import train_test_split
from ML_Engine.models import training, configs
from ML_Engine.utils.logger import get_logger

logger = get_logger(__name__)

# --- WORKAROUND: Manually load model configs to prevent empty results ---
try:
    base_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
    config_path = os.path.join(base_dir, 'configs', 'model_defaults.yml')
    
    if os.path.exists(config_path) and not configs._MODEL_CONFIGS:
        print(f"Manually loading model configs from: {config_path}")
        with open(config_path, 'r') as f:
            yaml_configs = yaml.safe_load(f)
        
        for problem_type, models in yaml_configs.items():
            for model_name, config in models.items():
                if model_name in configs.MODEL_CLASS_MAP:
                    config['class'] = configs.MODEL_CLASS_MAP[model_name]
        
        configs._MODEL_CONFIGS = yaml_configs
        print("Model configs loaded successfully.")
except Exception as e:
    print(f"Warning: Could not manually load model configs: {e}")
# -----------------------------------------------------------------

# Load the real dataset
data_path = os.path.join('dataset', 'adult_census_sample.csv')
full_df = pd.read_csv(data_path)

# Define target and features to drop
target_col = 'income'
drop_cols = ['income']  # Keep all other features

# Prepare X and y
X = full_df.drop(columns=drop_cols)
y = full_df[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
logger.info(f"Dataset loaded and split. X_train shape: {X_train.shape}")

2026-02-10 22:06:12,398 - __main__ - INFO - Dataset loaded and split. X_train shape: (4000, 14)


## 2. Basic Training (No Tuning)

This is the simplest case, where we train a model using the default parameters defined in `model_defaults.yml`.

In [29]:
model, info = training.train_model(
    model_name='LGBMClassifier',
    X_train=X_train,
    y_train=y_train,
    problem_type='Classification'
)

logger.info(f"Trained LGBMClassifier with default parameters. Score on test set: {model.score(X_test, y_test):.4f}")

[LightGBM] [Info] Number of positive: 960, number of negative: 3040
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000192 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 566
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.240000 -> initscore=-1.152680
[LightGBM] [Info] Start training from score -1.152680
2026-02-10 22:06:12,539 - __main__ - INFO - Trained LGBMClassifier with default parameters. Score on test set: 0.8700


## 3. Manual Tuning

You can easily override the default parameters by passing a `model_params` dictionary.

In [30]:
manual_params = {
    'n_estimators': 250,
    'learning_rate': 0.05,
    'max_depth': 15
}

model, info = training.train_model(
    model_name='LGBMClassifier',
    X_train=X_train,
    y_train=y_train,
    problem_type='Classification',
    model_params=manual_params
)

logger.info(f"Trained LGBMClassifier with manual parameters. Score on test set: {model.score(X_test, y_test):.4f}")

[LightGBM] [Info] Number of positive: 960, number of negative: 3040
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000188 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 566
[LightGBM] [Info] Number of data points in the train set: 4000, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.240000 -> initscore=-1.152680
[LightGBM] [Info] Start training from score -1.152680
2026-02-10 22:06:12,748 - __main__ - INFO - Trained LGBMClassifier with manual parameters. Score on test set: 0.8710


## 4. Automated Tuning with Optuna

This is the most powerful feature. We define a search space in our `tuning_spaces.yml` and let Optuna find the best combination of hyperparameters.

In [31]:
# Define path to tuning config
tuning_config_path = os.path.join(base_dir, 'configs', 'tuning_spaces.yml')

# Now, run the tuning
# Run Optuna tuning with error handling
try:
    tuned_model, tuning_info = training.train_model(
        model_name='RandomForestClassifier',
        X_train=X_train,
        y_train=y_train,
        problem_type='Classification',
        tuning_method='optuna',
        tuning_config_path=tuning_config_path if os.path.exists(tuning_config_path) else None,
        tuning_metric='Accuracy',
        n_trials=20, # Let Optuna try 20 different combinations
        cv=3
    )

except RuntimeError as e:
    if "zero total variance" in str(e):
        print("Optuna tuning failed with zero variance error. Using default model.")
        # Fallback to default model
        tuned_model, _ = training.train_model(
            model_name='RandomForestClassifier',
            X_train=X_train,
            y_train=y_train,
            problem_type='Classification'
        )
        tuning_info = {}
    else:
        raise


logger.info(f"Trained RandomForestClassifier with Optuna tuning. Score on test set: {tuned_model.score(X_test, y_test):.4f}")

[I 2026-02-10 22:06:12,788] A new study created in memory with name: no-name-56f4339d-80f7-4416-b3e9-a069a6653c0f
[I 2026-02-10 22:06:16,297] Trial 0 finished with value: 0.8495007934892268 and parameters: {'n_estimators': 144, 'max_depth': 48, 'min_samples_split': 15, 'min_samples_leaf': 6, 'criterion': 'gini'}. Best is trial 0 with value: 0.8495007934892268.
[I 2026-02-10 22:06:18,724] Trial 1 finished with value: 0.847001667958219 and parameters: {'n_estimators': 64, 'max_depth': 44, 'min_samples_split': 13, 'min_samples_leaf': 8, 'criterion': 'entropy'}. Best is trial 0 with value: 0.8495007934892268.
[I 2026-02-10 22:06:22,127] Trial 2 finished with value: 0.8524998565983326 and parameters: {'n_estimators': 258, 'max_depth': 14, 'min_samples_split': 5, 'min_samples_leaf': 2, 'criterion': 'entropy'}. Best is trial 2 with value: 0.8524998565983326.
[I 2026-02-10 22:06:24,935] Trial 3 finished with value: 0.8517492941451256 and parameters: {'n_estimators': 158, 'max_depth': 18, 'min_

2026-02-10 22:06:42,907 - __main__ - INFO - Trained RandomForestClassifier with Optuna tuning. Score on test set: 0.8610


## 5. Inspecting & Visualizing Optuna Results

The `tuning_info` object returned from the run contains the full Optuna `study` object. We can use this to see the best parameters and visualize the optimization process.

In [32]:
study = tuning_info['study']

print(f"Best cross-validated accuracy: {study.best_value:.4f}")
print(f"Best parameters found: {study.best_params}")

Best cross-validated accuracy: 0.8538
Best parameters found: {'n_estimators': 200, 'max_depth': 47, 'min_samples_split': 3, 'min_samples_leaf': 2, 'criterion': 'entropy'}


In [33]:
from optuna.visualization import plot_optimization_history, plot_param_importances
import plotly.io as pio

# Plot how the accuracy improved over the trials
fig1 = plot_optimization_history(study)
fig1.show()
# Save the plot
output_dir = os.path.join('outputs', '03_Hyperparameter_Tuning_Deep_Dive')
os.makedirs(output_dir, exist_ok=True)
plot_path = os.path.join(output_dir, 'optimization_history.png')
pio.write_image(fig1, plot_path, width=1000, height=600)
print(f"Plot saved to: {plot_path}")



Plot saved to: outputs\03_Hyperparameter_Tuning_Deep_Dive\optimization_history.png


In [34]:
# Plot which hyperparameters were most important
fig2 = plot_param_importances(study)
fig2.show()
# Save the plot
output_dir = os.path.join('outputs', '03_Hyperparameter_Tuning_Deep_Dive')
os.makedirs(output_dir, exist_ok=True)
plot_path = os.path.join(output_dir, 'param_importances.png')
pio.write_image(fig2, plot_path, width=1000, height=600)
print(f"Plot saved to: {plot_path}")



Plot saved to: outputs\03_Hyperparameter_Tuning_Deep_Dive\param_importances.png
