# XGBoost Training – Adult Dataset

This notebook trains an XGBoost classifier on the prepared Adult dataset using:

- The modelling-ready dataset: `adult_model_ready.csv`.
- A preprocessing pipeline (one-hot encoding for categorical features).
- Hyperparameter tuning with `RandomizedSearchCV`.

At the end, the best model is evaluated on the held-out test set and persisted to disk.

## 1. Imports and configuration

In [20]:
from pathlib import Path

import numpy as np
import pandas as pd
from joblib import dump
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.model_selection import train_test_split  # noqa: F401  # imported as requested
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier

RANDOM_STATE = 42

## 2. Load modelling-ready dataset

We load `adult_model_ready.csv` generated by the preprocessing notebook. This file:

- Has no missing values in key socio-economic fields.
- Contains the original `income` target.
- Preserves the original split (`train` / `test`) in a `split` column.

In [21]:
PROJECT_ROOT = Path('..').resolve()
model_path = PROJECT_ROOT / 'data' / 'processed' / 'adult' / 'adult_model_ready.csv'

df = pd.read_csv(model_path)
print(f'Loaded dataset with shape: {df.shape}')
print('Income distribution:\n', df['income'].value_counts())

Loaded dataset with shape: (45194, 16)
Income distribution:
 income
<=50K    33988
>50K     11206
Name: count, dtype: int64


## 3. Define features, target, and train/test split

- Target: `income` → converted to binary (1 if `>50K`, 0 otherwise).
- Features: all columns except `income` and `split`.
- Train/Test:
  - Training on rows where `split == "train"`.
  - Testing on rows where `split == "test"`.

In [22]:
split_col = df['split']
y = (df['income'] == '>50K').astype(int)

train_mask = split_col == 'train'
test_mask = split_col == 'test'

X_train = df.loc[train_mask].drop(columns=['income', 'split'])
y_train = y.loc[train_mask]
X_test = df.loc[test_mask].drop(columns=['income', 'split'])
y_test = y.loc[test_mask]

print(f'Training features shape: {X_train.shape}')
print(f'Test features shape: {X_test.shape}')

Training features shape: (30139, 14)
Test features shape: (15055, 14)


## 4. Preprocessing pipeline

We build a preprocessing step that:

- One-hot encodes all categorical columns.
- Passes numeric columns through unchanged.

This preprocessing is wrapped together with the XGBoost model in a single `Pipeline`.

In [23]:
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
numeric_cols = X_train.select_dtypes(exclude=['object']).columns.tolist()

print('Categorical columns:', categorical_cols)
print('Numeric columns    :', numeric_cols)

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocess = ColumnTransformer(
    transformers=[
        ('categorical', categorical_transformer, categorical_cols),
        ('numeric', 'passthrough', numeric_cols),
    ]
)

model = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=300,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.9,
    colsample_bytree=0.9,
    n_jobs=-1,
    random_state=RANDOM_STATE,
)

pipeline = Pipeline(
    steps=[
        ('preprocess', preprocess),
        ('model', model),
    ]
)

Categorical columns: ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
Numeric columns    : ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']


## 5. Hyperparameter optimization (RandomizedSearchCV)

We use `RandomizedSearchCV` with stratified 5-fold cross-validation to search over:

- Number of trees (`n_estimators`).
- Depth of trees (`max_depth`).
- Learning rate.
- Subsample and column subsample ratios.
- Regularization (`min_child_weight`, `gamma`, `reg_lambda`).

The search is scored with ROC AUC and the best model is refit on the full training data.

In [24]:
param_distributions = {
    'model__n_estimators': [200, 300, 400, 600],
    'model__max_depth': [3, 4, 5, 6, 8],
    'model__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'model__subsample': [0.6, 0.8, 1.0],
    'model__colsample_bytree': [0.6, 0.8, 1.0],
    'model__min_child_weight': [1, 5, 10],
    'model__gamma': [0, 0.1, 0.3],
    'model__reg_lambda': [1, 3, 5, 10],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=40,
    scoring='roc_auc',
    n_jobs=-1,
    cv=cv,
    verbose=0,
    random_state=RANDOM_STATE,
    refit=True,
)

random_search.fit(X_train, y_train)

print('Best parameters:', random_search.best_params_)
print('Best CV ROC AUC:', random_search.best_score_)

Best parameters: {'model__subsample': 1.0, 'model__reg_lambda': 3, 'model__n_estimators': 600, 'model__min_child_weight': 1, 'model__max_depth': 6, 'model__learning_rate': 0.05, 'model__gamma': 0.3, 'model__colsample_bytree': 1.0}
Best CV ROC AUC: 0.9278956824532658


## 6. Evaluation on held-out test set

We evaluate the best model on the original test split using:

- Classification report (precision, recall, F1).
- ROC AUC.
- Confusion matrix.

In [25]:
best_model = random_search.best_estimator_

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print('\nClassification report (test set):\n', classification_report(y_test, y_pred, digits=4))
test_roc_auc = roc_auc_score(y_test, y_proba)
print(f'Test ROC AUC: {test_roc_auc:.4f}')

cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:\n', cm)


Classification report (test set):
               precision    recall  f1-score   support

           0     0.8957    0.9390    0.9168     11355
           1     0.7801    0.6643    0.7176      3700

    accuracy                         0.8715     15055
   macro avg     0.8379    0.8016    0.8172     15055
weighted avg     0.8673    0.8715    0.8678     15055

Test ROC AUC: 0.9282
Confusion matrix:
 [[10662   693]
 [ 1242  2458]]


## 7. Save the best model

Finally, we persist the tuned pipeline (preprocessing + XGBoost model) using `joblib` so it can be reused later for inference or further analysis.

In [26]:
models_dir = PROJECT_ROOT / 'models'
models_dir.mkdir(parents=True, exist_ok=True)

model_file = models_dir / 'xgb_adult_model.joblib'
dump(best_model, model_file)
print(f'Saved model to: {model_file}')

Saved model to: /Users/villafuertech/Documents/Academic/University/Septimo_Semestre/Trusthworthy_ML/Projects/3_Project/fairness-project/models/xgb_adult_model.joblib
