# Modeling and Fine Tuning

> Owner: Daniel Soukup - Created: 2025.11.01

In this notebook, we load the processed data and fit our models.

## Data loading

Let's load our processed data and create feature/target dataframes for both train and test.

In [0]:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
processed_learn = dataiku.Dataset("processed_learn")
processed_learn_df = processed_learn.get_dataframe()

processed_test = dataiku.Dataset("processed_test")
processed_test_df = processed_test.get_dataframe()

processed_learn_df.shape, processed_test_df.shape

In [0]:
processed_learn_df.head()

In [0]:
TARGET = 'income'

In [0]:
X_train, y_train = processed_learn_df.drop(columns=TARGET), processed_learn_df[TARGET]
X_test, y_test = processed_test_df.drop(columns=TARGET), processed_test_df[TARGET]

Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.

In [0]:
y_train.mean()

**Important Note:** We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model.

## Modeling

Our current approach will focus on optimizin an XGBoost binary classifier. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training.

### Fit Baseline

In [0]:
from typing import Dict, Any
import xgboost as xgb
from xgboost import XGBClassifier

def cross_val_score_xgb(param: Dict[str, Any]) -> float:
    """
    Fit model with cross validation using the provided params.
    
    Return the avg out-of-fold accuracy.
    """
    dtrain = xgb.DMatrix(X_train, label=y_train)

    results = xgb.cv(
        params=param,
        dtrain=dtrain,
        nfold=3,
        seed=42,
        verbose_eval=False,
        stratified=param.get("stratified_cv", False),
    )
    
    return results

# we wont change these
BASE_PARAMS = {
    "verbosity": 0,
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "stratified_cv": True
}

param = BASE_PARAMS.copy()
param.update({
    "n_estimators": 50,
    "max_depth": 2,
    }
)

cross_val_score_xgb(param)

### Optimize Hyperparameters

Next, we'll look to optimize the model hyperparameters.

In [0]:
def objective(trial):
    """
    Capture a single param combination and model fitting,
    evaluated using cross-validation.
    """
    param = BASE_PARAMS.copy()
    param.update({
        "subsample": trial.suggest_float("subsample", 0.2, 1.0), # default 1 - all rows
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), # default 1 - all columns
        "n_estimators": trial.suggest_int("n_estimators", 10, 200, step=10), # default 100
        "max_depth": trial.suggest_int("max_depth", 2, 20, step=2) # default 3
    })
    
    result = cross_val_score_xgb(param)
    
    trial.set_user_attr("n_estimators", len(result))
    
    best_score = result["test-auc-mean"].values[-1]
    
    return best_score

In [0]:
import optuna

study = optuna.create_study(direction="maximize")

study.optimize(objective, n_trials=3, timeout=600)

Lets see the best results:

In [0]:
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

In [0]:
import dataiku

project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('lV6oqreY')

with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:

    # Note: if you don't call this (i.e. when no experiment is specified), the default one is used
    mlflow_handle.set_experiment("test")

    with mlflow_handle.start_run(run_name="my_run"):
        # ...your MLflow code...
        mlflow_handle.log_param("a", 1)
        mlflow_handle.log_metric("b", 2)

        # This uses the regular MLflow APIs

### Refit Best Model

Now that we found the best parameters, we refit on the whole dataset:

In [0]:
model = XGBClassifier(**study.best_trial.params)
model = model.fit(X_train, y_train)

In [0]:
model

In [0]:
model.score(X_train, y_train)

We can get a quick sense of model accuracy however this number is quite unreliable (we achieved slightly better result than our dummy all-0 prediction).

### Predict

We save the predicted class and probabilities both calculated:

In [0]:
predictions_learn_df = pd.DataFrame(
    {
        TARGET: y_train,
        'pred': model.predict(X_train),
        'pred_proba': model.predict_proba(X_train)[:, 1]
    }
)

predictions_test_df = pd.DataFrame(
    {
        TARGET: y_test,
        'pred': model.predict(X_test),
        'pred_proba': model.predict_proba(X_test)[:, 1]
    }
)

## Save predictions

We finally save the results to their own datasets which can be used for evaluation:

In [0]:
# Write recipe outputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn.write_with_schema(predictions_learn_df)

predictions_test = dataiku.Dataset("predictions_test")
predictions_test.write_with_schema(predictions_test_df)