# XGBoost API Demo

This notebook gives a **minimal, self-contained demonstration** of:

- The **native XGBoost Python API** (fit / predict / predict_proba / feature importances).
- The **wrapper layer** used in the Employee Attrition project:
  - A scikit-learn `Pipeline` that combines preprocessing and XGBoost.
  - Accessing the trained model and feature names from inside the pipeline.
  - Simple threshold tuning on top of `predict_proba`.

The goal is to mirror the way XGBoost is used in the main project notebook, but on a small sample dataset so the structure of the API is clear.


In [1]:
# Core imports for this API demo
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier

## 1. Load a sample binary classification dataset

For this API demo we use scikit-learn's built-in **breast cancer** dataset.  
To mimic the employee attrition setting (mixed numeric + categorical features),  
we create a simple **categorical feature** by binning one of the numeric columns.


In [2]:
# Load dataset as a pandas DataFrame
data = load_breast_cancer(as_frame=True)
X_raw = data.frame.drop(columns=["target"])
y = data.target  # binary labels (0/1)

# Create a simple categorical feature by binning 'mean radius'
X = X_raw.copy()
X["radius_group"] = pd.cut(
    X["mean radius"],
    bins=[X["mean radius"].min() - 1, 12, 18, X["mean radius"].max() + 1],
    labels=["small", "medium", "large"]
)

# Train/test split (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,radius_group
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,medium
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,large
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,large
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,small
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,large


## 2. Native XGBoost API (no wrapper layer)

In this section we use **XGBClassifier directly** on numeric features only (no wrapper, no pipeline).  
The steps are:

1. Drop the synthetic `radius_group` feature so we only keep numeric columns.  
2. Configure an `XGBClassifier` with key hyperparameters.  
3. Call `fit()` to train the model on the training set.  
4. Use `predict()` for hard labels and `predict_proba()` for probabilities.  
5. Evaluate performance with Accuracy, F1-score, ROC-AUC, and a classification report.

This cell demonstrates the **core XGBoost workflow** used in the main project in its simplest form: clean numeric data in → train model → get p


In [3]:
# For the native XGBoost demo, drop the synthetic categorical feature
X_train_num = X_train.drop(columns=["radius_group"])
X_test_num = X_test.drop(columns=["radius_group"])

# Configure a native XGBoost classifier
xgb_native = XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1
)

# Train the model
xgb_native.fit(X_train_num, y_train)

# Predictions and predicted probabilities
y_pred = xgb_native.predict(X_test_num)
y_proba = xgb_native.predict_proba(X_test_num)[:, 1]

print("Native XGBoost – numeric only")
print(f"Accuracy : {accuracy_score(y_test, y_pred):.3f}")
print(f"F1-score : {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC  : {roc_auc_score(y_test, y_proba):.3f}")
print()
print(classification_report(y_test, y_pred))

Native XGBoost – numeric only
Accuracy : 0.947
F1-score : 0.959
ROC-AUC  : 0.994

              precision    recall  f1-score   support

           0       0.95      0.90      0.93        42
           1       0.95      0.97      0.96        72

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



Here we inspect which numeric features matter most to the **native XGBoost model**.  
We take `xgb_native.feature_importances_`, wrap it in a `pandas.Series` with the column names as the index, sort the values in descending order, and display the **top 10 most important features**.  
This helps us see which variables XGBoost relies on most when predicting the binary labels.


In [4]:
# Top feature importances from the native XGBoost model
native_importances = pd.Series(
    xgb_native.feature_importances_,
    index=X_train_num.columns
).sort_values(ascending=False)

native_importances.head(10)

worst perimeter         0.225325
worst radius            0.186433
mean concave points     0.115287
worst concave points    0.084676
worst area              0.055128
worst compactness       0.033726
worst concavity         0.026196
texture error           0.025838
concavity error         0.023321
mean concavity          0.023273
dtype: float32

## 3. Wrapper layer with scikit-learn Pipeline

In many tabular ML problems, we don’t want to call XGBoost directly on raw data.  
Instead, we wrap data preprocessing and the model into a single **scikit-learn `Pipeline`**:

- We first **identify categorical and numeric columns**.
- A `ColumnTransformer` applies **one-hot encoding** to categoricals and **standard scaling** to numerics.
- An `XGBClassifier` is then trained on this transformed feature space.
- The `Pipeline` object exposes a **single interface** with `fit`, `predict`, and `predict_proba`, and it automatically applies the same preprocessing to any new data.

This pattern makes the code cleaner, reduces the risk of data leakage, and allows us to treat “preprocessing + XGBoost” as one reusable model component.



In [5]:
# Identify categorical and numeric columns
categorical_cols = ["radius_group"]
numeric_cols = [c for c in X.columns if c not in categorical_cols]

# Preprocessing: one-hot encode categoricals, standardize numerics
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(), numeric_cols),
    ]
)

# XGBoost classifier (same configuration as before)
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1
)

# Combined preprocessing + model pipeline
xgb_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("clf", xgb_model),
])

# Train the full pipeline on raw X (numeric + categorical)
xgb_pipeline.fit(X_train, y_train)

# Evaluate on the test set
y_pred_pipe = xgb_pipeline.predict(X_test)
y_proba_pipe = xgb_pipeline.predict_proba(X_test)[:, 1]

print("XGBoost + preprocessing pipeline")
print(f"Accuracy : {accuracy_score(y_test, y_pred_pipe):.3f}")
print(f"F1-score : {f1_score(y_test, y_pred_pipe):.3f}")
print(f"ROC-AUC  : {roc_auc_score(y_test, y_proba_pipe):.3f}")
print()
print(classification_report(y_test, y_pred_pipe))

XGBoost + preprocessing pipeline
Accuracy : 0.956
F1-score : 0.966
ROC-AUC  : 0.995

              precision    recall  f1-score   support

           0       0.95      0.93      0.94        42
           1       0.96      0.97      0.97        72

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



## 4. Accessing the trained XGBoost model and feature importances

The pipeline wrapper keeps preprocessing and the classifier together,  
but we can still **reach inside** the pipeline to:

- Get the trained XGBoost model.  
- Reconstruct the transformed **feature names** (including one-hot encoded columns).  
- View feature importances in a human-readable way.


In [6]:
# Access the fitted preprocessor and XGBoost model from the pipeline
fitted_preprocessor = xgb_pipeline.named_steps["preprocess"]
fitted_xgb = xgb_pipeline.named_steps["clf"]

# Get feature names created by the one-hot encoder
ohe = fitted_preprocessor.named_transformers_["cat"]
encoded_cat_names = list(ohe.get_feature_names_out(categorical_cols))

# Final list of all features seen by XGBoost
all_feature_names = encoded_cat_names + numeric_cols

# Feature importances from the pipeline's XGBoost model
pipeline_importances = pd.Series(
    fitted_xgb.feature_importances_,
    index=all_feature_names
).sort_values(ascending=False)

pipeline_importances.head(10)

worst perimeter         0.266257
worst radius            0.139660
mean concave points     0.120133
worst concave points    0.076687
worst compactness       0.055995
worst area              0.043156
concavity error         0.039118
worst concavity         0.028683
mean area               0.025770
mean texture            0.018867
dtype: float32

## 5. Example: threshold tuning on top of `predict_proba`

 we can also go **one step beyond** the default 0.5 threshold:  

- We keep the trained pipeline as-is.  
- We vary the decision threshold on the predicted probabilities.  
- For each threshold we compute accuracy and F1-score, and choose a value that fits our goal 
  (e.g., higher recall or higher F1).

In [7]:
thresholds = np.linspace(0.1, 0.9, 9)

print("Threshold tuning for the XGBoost pipeline:")
for t in thresholds:
    y_pred_t = (y_proba_pipe >= t).astype(int)
    acc = accuracy_score(y_test, y_pred_t)
    f1 = f1_score(y_test, y_pred_t)
    print(f"Threshold {t:.1f} -> Accuracy={acc:.3f}, F1={f1:.3f}")

Threshold tuning for the XGBoost pipeline:
Threshold 0.1 -> Accuracy=0.956, F1=0.966
Threshold 0.2 -> Accuracy=0.965, F1=0.973
Threshold 0.3 -> Accuracy=0.956, F1=0.966
Threshold 0.4 -> Accuracy=0.956, F1=0.966
Threshold 0.5 -> Accuracy=0.956, F1=0.966
Threshold 0.6 -> Accuracy=0.947, F1=0.958
Threshold 0.7 -> Accuracy=0.939, F1=0.951
Threshold 0.8 -> Accuracy=0.956, F1=0.965
Threshold 0.9 -> Accuracy=0.947, F1=0.957


## 6. Summary

This API demo notebook showed:

- How the **native XGBoost classifier** is configured and used (`fit`, `predict`, `predict_proba`, `feature_importances_`).  
- How a **scikit-learn Pipeline** wraps preprocessing + XGBoost into a single object that works directly on DataFrames.  
- How to access the **inner model and feature names** from the pipeline.  
- How to perform simple **threshold tuning** using `predict_proba`.

These are the same patterns used in the main Employee Attrition project notebook, just demonstrated on a small, self-contained dataset.
