# Modeling and Fine Tuning

> Owner: Daniel Soukup - Created: 2025.11.01

In this notebook, we load the processed data and fit our models.

## Data loading

Let's load our processed data and create feature/target dataframes for both train and test.

In [0]:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
processed_learn = dataiku.Dataset("processed_learn")
processed_learn_df = processed_learn.get_dataframe()

processed_test = dataiku.Dataset("processed_test")
processed_test_df = processed_test.get_dataframe()

processed_learn_df.shape, processed_test_df.shape

In [0]:
processed_learn_df.head()

In [0]:
TARGET = 'income'

In [0]:
X_train, y_train = processed_learn_df.drop(columns=TARGET), processed_learn_df[TARGET]
X_test, y_test = processed_test_df.drop(columns=TARGET), processed_test_df[TARGET]

Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.

In [0]:
y_train.mean()

**Important Note:** We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model.

## Modeling

Our current approach will focus on optimizin an XGBoost binary classifier. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:
- using stratified splitting for cross-validation
- adjusting the evaluation metric from accuracy to AUC (still sensitive but less so than accuracy)
- experimenting with class-balancing methods, such as class weights and upsampling.

### Fit Baseline

In [0]:
from typing import Dict, Any
import xgboost as xgb
from xgboost import XGBClassifier

def cross_val_score_xgb(param: Dict[str, Any]) -> float:
    """
    Fit model with cross validation using the provided params.

    Return the avg out-of-fold accuracy.
    """
    dtrain = xgb.DMatrix(X_train, label=y_train)

    results = xgb.cv(
        params=param,
        dtrain=dtrain,
        num_boost_round=param.get("n_estimators"), # default 10
        nfold=3,
        seed=42,
        verbose_eval=False,
        stratified=param.get("stratified_cv"), # default False
    )

    return results

# we wont change these
BASE_PARAMS = {
    "verbosity": 0,
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "stratified_cv": True
}

param = BASE_PARAMS.copy()
param.update({
    "n_estimators": 10,
    "max_depth": 2,
    }
)

In [0]:
results = cross_val_score_xgb(param)
results.tail(3)

### Optimize Hyperparameters

Next, we'll look to optimize the model hyperparameters. As we do this, our experiments will be tracked using MLflow.

In [0]:
import dataiku

project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('lV6oqreY')

In [0]:
def objective(trial):
    """
    Capture a single param combination and model fitting,
    evaluated using cross-validation.
    """
    param = BASE_PARAMS.copy()
    param.update({
        "subsample": trial.suggest_float("subsample", 0.2, 1.0), # default 1 - all rows
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), # default 1 - all columns
        "n_estimators": trial.suggest_int("n_estimators", 10, 100, step=10), # default 100
        "max_depth": trial.suggest_int("max_depth", 2, 20, step=2) # default 3
    })
    
    with mlflow_handle.start_run(run_name="trial", nested=True):
        result = cross_val_score_xgb(param)
        best_score = result["test-auc-mean"].values[-1]
        
        # logging
        mlflow_handle.log_params(param)
        mlflow_handle.log_metrics(
            {
                'best_score': best_score
            }
        )
        
        return best_score

In [0]:
import optuna

N_TRIALS = 20

with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
    
    mlflow_handle.set_experiment("xgboost_hp_tuning")
     
    with mlflow_handle.start_run(run_name="study", nested=True) as study_run:
        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=N_TRIALS, timeout=600)
        
        # logging
        best_params = study.best_trial.params
        mlflow_handle.log_params(best_params)
        mlflow_handle.log_metrics(
            {
                'best_score': study.best_trial.value
            }
        )
        
        # refit best model
        print("Fitting best model...")
        model = XGBClassifier(**study.best_trial.params)
        model = model.fit(X_train, y_train)
        
        # logging - disabled
#         mlflow_handle.xgboost.log_model(model)

Lets see the best results:

In [0]:
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

### Tuning Analysis

Let's see how the HP choices impacted performance:

In [0]:
study_df = study.trials_dataframe()
study_df.head()

We will look at different projections of the HP space and the best observed values:

In [0]:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()

pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations"
)
fig.show()

The best performing models were found with the higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the the number of estimators is high).

In [0]:
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations",

)
fig.update_xaxes(
    scaleanchor="x",
  )
fig.show()

In our experiemnts, the high scores also corresponded with smaller col samples (how many col's each estimater used) and lower max depth. The small col sample again helps avoid overfitting.

In [0]:
pivot = pd.pivot_table(study_df, index="params_n_estimators", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations",

)
fig.update_xaxes(
    scaleanchor="x",
  )
fig.show()

This is confirmed here again: the combination of low col sample rate and low max depth allows us to train longer and get better results. Given that the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger stepsize for boosting rounds.

### Predict

We save the predicted class and probabilities both calculated:

In [0]:
predictions_learn_df = pd.DataFrame(
    {
        TARGET: y_train,
        'pred': model.predict(X_train),
        'pred_proba': model.predict_proba(X_train)[:, 1]
    }
)

predictions_test_df = pd.DataFrame(
    {
        TARGET: y_test,
        'pred': model.predict(X_test),
        'pred_proba': model.predict_proba(X_test)[:, 1]
    }
)

## Save predictions

We finally save the results to their own datasets which can be used for evaluation:

In [0]:
# Write recipe outputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn.write_with_schema(predictions_learn_df)

predictions_test = dataiku.Dataset("predictions_test")
predictions_test.write_with_schema(predictions_test_df)

In [0]:
study_data = dataiku.Dataset("xgboost_study")
study_data.write_with_schema(study_df)