# Modeling and Fine Tuning

**Owner: Daniel Soukup - Created: 2025.11.01**

In this notebook, we load the processed data and fit our models focusing on optimizing variations of XGBoost classifiers across multiple hyperparameters that balance variance and bias, while addressing the class imbalance discussed during EDA. We chose to go in-depth on comparing variations of this single model type to allow focus on the details.

**NOTE:** due to randomness in the model fitting and tuning process, rerunning the notebook might change the outputs (such as top predictors) and add inconsistencies with the current markdown.

## Data loading

Let's load our processed data and create feature/target dataframes for both train and test.

In [0]:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
processed_learn = dataiku.Dataset("processed_learn")
processed_learn_df = processed_learn.get_dataframe()

processed_test = dataiku.Dataset("processed_test")
processed_test_df = processed_test.get_dataframe()

processed_learn_df.shape, processed_test_df.shape

In [0]:
processed_learn_df.head()

We notice that some special characters can cause issues with our model - we address this here.

In [0]:
processed_learn_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_learn_df.columns]
processed_test_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_test_df.columns]

In [0]:
TARGET = 'income'

In [0]:
X_train, y_train = processed_learn_df.drop(columns=TARGET), processed_learn_df[TARGET]
X_test, y_test = processed_test_df.drop(columns=TARGET), processed_test_df[TARGET]

Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.

In [0]:
y_train.mean()

**Important Note:** We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model as an unbiased estimate of our model on completely unseen data.

## Modeling

Our current approach will focus on optimizing XGBoost binary classifiers. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:
- using stratified splitting for cross-validation,
- adjusting the evaluation metric from accuracy to AUC-PR (to select the best HP for a balance of high precision and recall),
- experimenting with class-balancing methods, such as class weights tuned with other hyperparameters.

### Experiment Tracking

We will be recording our models and detailed metrics under two main experiments that will capture multiple runs.

In [0]:
project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('lV6oqreY')

TUNING_XP = "xgboost_hp_tuning"
BASELINE_XP = "baseline_xp"

### Fit Baseline

As mentioned, we'll be using sample weights to adjust for the class imbalance:

In [0]:
from sklearn.utils.class_weight import compute_sample_weight
from typing import Union

def get_sample_weights(multiplier: Union[int, None]) -> np.array:
    """
    Weight the minority sample higher to contribute more to the training loss.
    """
    if multiplier:
        return compute_sample_weight({0: 1, 1: multiplier}, y_train)
    else:
        return None

weights = get_sample_weights(10)
weights

In order to compare model variations, we need to split the train set into train and validation. For this, we set up our cross-validation helper and define base parameters for our model:

In [0]:
from typing import Dict, Any
import xgboost as xgb

def cross_val_score_xgb(param: Dict[str, Any]) -> float:
    """
    Fit model with 3-fold cross validation using the provided params.

    Return the avg out-of-fold metric as specified in the provided params.
    """
    dtrain = xgb.DMatrix(
        X_train,
        label=y_train,
        weight=get_sample_weights(param.get('multiplier')),
    )

    results = xgb.cv(
        params=param,
        dtrain=dtrain,
        num_boost_round=param.get("n_estimators"), # default 10
        nfold=3,
        seed=42,
        verbose_eval=False,
        stratified=param.get("stratified_cv"), # default False
    )

    return results

# we wont change these
BASE_PARAMS = {
    "verbosity": 0,
    "objective": "binary:logistic",
    "eval_metric": "aucpr", # adjusted for the imbalance
    "stratified_cv": True # adjusted for the imbalance
}

param = BASE_PARAMS.copy()
param.update({
    "n_estimators": 10,
    "max_depth": 2,
    "multiplier": 1,
    }
)

Lets test our function with logging the run:

In [0]:
def run_cv_with_logging(param: dict) -> pd.DataFrame:
    """
    Log the CV run with MLflow to BASELINE_XP:base_run.
    
    Model are not saved unsave autologged.
    """
    
    with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
        mlflow_handle.set_experiment(BASELINE_XP)

        with mlflow_handle.start_run(run_name="base_run", nested=True):
            result = cross_val_score_xgb(param)

            best_score = result["test-aucpr-mean"].values[-1]

            # logging
            mlflow_handle.log_params(param)
            mlflow_handle.log_metrics(
                {
                    'best_score': best_score
                }
            )
            
    return result

results = run_cv_with_logging(param)
results.tail(3)

Lets try with a large multiplier:

In [0]:
param = BASE_PARAMS.copy()
param.update({
    "n_estimators": 10,
    "max_depth": 2,
    "multiplier": 10,
    }
)

results = run_cv_with_logging(param)
results.tail(3)

We can see that the multiplier has a massive effect on the aucpr score. 

We can also test the recommended `scale_pos_weight` parameter that helps balance classes. A typical value to consider based on the XGBoost recommendations is `sum(negative instances) / sum(positive instances)` and is supposed to assign a weight independent of the sample for the whole positive class - we're supposed to get similar results.

In [0]:
scale_pos_weight = (1 - y_train).sum()/y_train.sum()
scale_pos_weight

In [0]:
param = BASE_PARAMS.copy()
param.update({
    "n_estimators": 10,
    "max_depth": 2,
    "scale_pos_weight": scale_pos_weight
    }
)

results = run_cv_with_logging(param)
results.tail(3)

Interestingly, we don't see as much of a difference so we'll leave this and explore in the future.

In [0]:
BASE_PARAMS.update({"scale_pos_weight": scale_pos_weight})

### Optimize Hyperparameters - Main Run

Next, we'll look to optimize the model hyperparameters more systematically. 

The function below defines the HP space to explore (parameters and their ranges), focusing on 5 such parameters with known strong effect on model performance and regularization:

In [0]:
def objective(trial) -> float:
    """
    Capture a single param combination and model fitting,
    evaluated using cross-validation.
    """
    param = BASE_PARAMS.copy()
    param.update({
        "subsample": trial.suggest_float("subsample", 0.2, 1.0), # default 1 - all rows
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), # default 1 - all columns
        "n_estimators": trial.suggest_int("n_estimators", 10, 100, step=10), # default 100
        "max_depth": trial.suggest_int("max_depth", 2, 20, step=2), # default 3
        "multiplier": trial.suggest_int("multiplier", 1, 50)
    })
    
    with mlflow_handle.start_run(run_name="trial", nested=True):
        result = cross_val_score_xgb(param)
        best_score = result["test-aucpr-mean"].values[-1]
        
        # logging
        mlflow_handle.log_params(param)
        mlflow_handle.log_metrics(
            {
                'best_score': best_score
            }
        )
        
        return best_score

Finally, we are ready to run our study, currently consisting of 40 trials:

In [0]:
from xgboost import XGBClassifier
import optuna

N_TRIALS = 40

with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
    
    mlflow_handle.set_experiment(TUNING_XP)
     
    with mlflow_handle.start_run(run_name="study", nested=True) as study_run:
        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=N_TRIALS, timeout=600)
        
        # logging
        best_params = study.best_trial.params
        mlflow_handle.log_metrics(
            {
                'best_score': study.best_trial.value
            }
        )
        
        # refit best model
        model = XGBClassifier(**study.best_trial.params)
        model = model.fit(X_train, y_train)
        
        # log best params & model
        mlflow_handle.log_params(model.get_xgb_params())
        mlflow_handle.xgboost.log_model(
            model,
            "xgboost_model",
            input_example=X_train.head(10),
            pip_requirements=['xgboost==2.1.1']
        )

Lets see the best results:

In [0]:
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

### Tuning Analysis

Let's see how the HP choices impacted performance:

In [0]:
study_df = study.trials_dataframe()
study_df.head()

We will look at different projections of the HP space and the best observed values:

In [0]:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()

pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations"
)
fig.show()

The best performing models were found with the mid-to-higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the number of estimators is high).

In [0]:
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations",

)
fig.update_xaxes(
    scaleanchor="x",
  )
fig.show()

In our experiemnts, the high scores also corresponded with smaller col samples (how many col's each estimater used) unless max depth was singificantly lowered. The small col sample again helps avoid overfitting although the patterns are maybe less clear.

In [0]:
pivot = pd.pivot_table(study_df, index="params_n_estimators", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations",

)
fig.update_xaxes(
    scaleanchor="x",
  )
fig.show()

While the patterns might not be the most clear here, we can see that having high boosting rounds and high sample leads to lower scores (the bottom right corner, likely overfitting again).

Given that some of the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger step size for boosting rounds.

Finally, we look at the multiplier effect:

In [0]:
pd.pivot_table(study_df, index="params_multiplier", values="value", aggfunc='mean').T

On average, the higher the multiplier the better the aucpr score we got which is also show in the heatmaps below. It looks like we get the most benefit around >40 weighting.

In [0]:
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations"
)
fig.show()

This patter is nicely shown the heatmap above and below as well.

In [0]:
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_max_depth", values="value", aggfunc='max')
fig = px.imshow(
    pivot,
    color_continuous_scale="blues",
    title="Best values across combinations"
)
fig.show()

### Predict

We save the predicted class and probabilities both calculated:

In [0]:
predictions_learn_df = pd.DataFrame(
    {
        TARGET: y_train,
        'pred': model.predict(X_train),
        'pred_proba': model.predict_proba(X_train)[:, 1]
    }
)

predictions_test_df = pd.DataFrame(
    {
        TARGET: y_test,
        'pred': model.predict(X_test),
        'pred_proba': model.predict_proba(X_test)[:, 1]
    }
)

## Interpretation 

Finally lets look at the feature importances for our model too (top 20):

In [0]:
fig = pd.DataFrame(
    {
        'importance': model.feature_importances_,
    },
    index=model.feature_names_in_
).sort_values('importance').tail(20).plot(kind="bar", backend='plotly')

fig.show()

Observations:
- the sex dummy variable shows a strong effect on the model (top-5 importance across multiple runs) which aligns with the well-known pay imbalance between genders,
- employment indicators such as class of worker, cap gains and losses natually showed up as well,
- education fields (high school and college) showed high in the ranked list,
- as well as age although not as pronounced.

All these findings align with our expectations and EDA. Our model picked up on the gender bias in our data (there are much more high earner males in the dataset than female) which can definitely be addressed in future model iterations - please see the slides for more info.

In [0]:
# gender imbalance
processed_learn_df.groupby(TARGET).mean("sex_Male")["sex_Male"]

79% of high income earners were male, as opposed to 46% of low income. This statistical inparity is a strong signal for the model to pick up on and use for classification.

## Save predictions

We finally save the results to their own datasets which can be used for evaluation:

In [0]:
# Write recipe outputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn.write_with_schema(predictions_learn_df)

predictions_test = dataiku.Dataset("predictions_test")
predictions_test.write_with_schema(predictions_test_df)

In [0]:
study_data = dataiku.Dataset("xgboost_study")
study_data.write_with_schema(study_df)