
# 🚢 CatBoost in 10 Minutes: Titanic Survival — Training, Explainability, & What‑Ifs

This mini‑tutorial introduces **gradient boosting** (trees) and **CatBoost** with a real, mixed‑type dataset (*Titanic*).
You’ll train a model, evaluate it, interpret predictions with **SHAP**, and try an **interactive what‑if simulator**.

**Why CatBoost?** It handles **categoricals** & **missing values** natively, needs minimal preprocessing, and gives calibrated probabilities.


In [None]:

# ▶️ Install (run once)
!pip -q install catboost seaborn scikit-learn ipywidgets



## 0) What is Gradient Boosting (GBDT)? *30‑second mental model*

- We fit **many small decision trees** sequentially.  
- Each new tree focuses on **errors** of the current model, improving step‑by‑step.  
- CatBoost = a high‑quality GBDT library with smart handling of **categorical features** and **overfitting**.



## 1) Imports & dataset (mixed numeric/categorical + NaNs)

We use Seaborn’s **Titanic** dataset. The target is `survived` (renamed `y`).  
We keep a small but diverse set of features (class, sex, age, fare, family, etc.).


In [None]:

import pandas as pd, numpy as np, seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, RocCurveDisplay
from catboost import CatBoostClassifier, Pool
import ipywidgets as W
from IPython.display import display, HTML
import matplotlib.pyplot as plt

# Load Titanic
df = sns.load_dataset("titanic")
cols = ['survived','pclass','sex','age','sibsp','parch','fare','embarked','class','who','adult_male','alone']
df = df[cols].rename(columns={'survived':'y'})

# Split
X, y = df.drop(columns='y'), df['y']
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

# Categorical columns by dtype
cat_cols = X.select_dtypes(include=['object','category','bool']).columns.tolist()

def make_cats_safe(df, cat_cols):
    \"\"\"Ensure categoricals are plain strings with 'NA' instead of missing.\"\"\"
    df = df.copy()
    for c in cat_cols:
        df[c] = df[c].astype('object')
        df[c] = df[c].where(df[c].notna(), 'NA').astype(str)
    return df

Xtr = make_cats_safe(Xtr, cat_cols)
Xte = make_cats_safe(Xte, cat_cols)

print(\"Categoricals:\", cat_cols)
print(\"Residual NaNs in categoricals (train,test):\",
      Xtr[cat_cols].isna().sum().sum(), Xte[cat_cols].isna().sum().sum())



## 2) Model: CatBoostClassifier (GBDT for classification)

Key knobs (kept modest for speed):
- `iterations` (# trees), `learning_rate` (step size), `depth` (tree depth)  
- `eval_metric='AUC'` because class imbalance & calibrated ranking


In [None]:

params = dict(
    loss_function='Logloss',
    eval_metric='AUC',
    iterations=600,
    learning_rate=0.05,
    depth=6,
    random_seed=42,
    verbose=False
)

model = CatBoostClassifier(**params)

train_pool = Pool(Xtr, ytr, cat_features=cat_cols)
valid_pool = Pool(Xte, yte, cat_features=cat_cols)

model.fit(train_pool, eval_set=valid_pool)

proba = model.predict_proba(valid_pool)[:,1]
pred  = (proba >= 0.5).astype(int)

print(f\"AUC:      {roc_auc_score(yte, proba):.3f}\")
print(f\"Accuracy: {accuracy_score(yte, pred):.3f}\")
print(classification_report(yte, pred, digits=3))



## 3) ROC curve — quick visual of ranking quality


In [None]:

RocCurveDisplay.from_predictions(yte, proba)
plt.title(\"CatBoost ROC — Titanic\"); plt.show()



## 4) Why did the model predict that? (SHAP)

CatBoost can return **ShapValues**: feature‑wise contributions that explain an individual prediction.


In [None]:

# SHAP values for validation set (last column is expected value / base)
sv = model.get_feature_importance(valid_pool, type='ShapValues')

row_idx = 0  # try different rows
shap_vals = sv[row_idx, :-1]
base = sv[row_idx, -1]

row = Xte.iloc[row_idx]
p_row = model.predict_proba(row.to_frame().T)[0,1]

contrib = pd.DataFrame({
    'feature': Xtr.columns,
    'value': [row[c] for c in Xtr.columns],
    'shap': shap_vals
}).sort_values('shap', ascending=False)

display(HTML(f\"<h4>Passenger preview — predicted survival p = {p_row:.3f}</h4>\"))
display(row.to_frame().T)
display(contrib.head(10).style.format({'shap':'{:.3f}'}).set_caption(\"Top positive contributions ↑\"))
display(contrib.tail(10).style.format({'shap':'{:.3f}'}).set_caption(\"Top negative contributions ↓\"))



## 5) What‑If Survival Simulator (interactive)

Adjust inputs and see the **probability** update instantly.  
This mirrors the training schema & preprocessing (categoricals ⇒ strings with `'NA'`).


In [None]:

pclass_to_class = {1:'First', 2:'Second', 3:'Third'}

def predict_live(pclass, sex, age, sibsp, parch, fare, embarked, alone):
    row = {
        'pclass': pclass,
        'sex': sex,
        'age': None if age < 0 else age,  # allow -1 to mean NA
        'sibsp': sibsp,
        'parch': parch,
        'fare': float(fare),
        'embarked': embarked,
        'class': pclass_to_class[pclass],
        'who': 'woman' if sex=='female' else ('man' if (age >= 16 or age < 0) else 'child'),
        'adult_male': bool(sex=='male' and (age >= 16 or age < 0)),
        'alone': bool(alone)
    }
    row_df = make_cats_safe(pd.DataFrame([row]), cat_cols)
    p = model.predict_proba(row_df)[0,1]
    html = f\"\"\"\n    <div style='padding:8px;border-left:6px solid #0a7; background:#eefaf5; width:max-content'>\n      <b>Predicted survival probability:</b> {p:.3f}\n    </div>\"\"\"
    display(HTML(html))

W.interact(
    predict_live,
    pclass=W.IntSlider(value=2, min=1, max=3, step=1, description='pclass'),
    sex=W.Dropdown(options=['male','female'], value='female', description='sex'),
    age=W.FloatSlider(value=28, min=-1, max=80, step=1, description='age (-1=NA)'),
    sibsp=W.IntSlider(value=0, min=0, max=6, step=1, description='siblings'),
    parch=W.IntSlider(value=0, min=0, max=6, step=1, description='parents'),
    fare=W.FloatLogSlider(value=32, base=10, min=0, max=3, step=0.01, description='fare'),
    embarked=W.Dropdown(options=sorted(X['embarked'].dropna().unique()), value='S', description='embarked'),
    alone=W.Checkbox(value=True, description='alone')
);



## 6) Takeaways

- **GBDT** = many small trees learned sequentially; each tree corrects prior errors.  
- **CatBoost** shines on **categoricals** & **messy real‑world data** with minimal prep.  
- You trained, evaluated (AUC/ROC), **explained** with SHAP, and ran **what‑if** scenarios.

**Next steps:** try monotonic constraints (e.g., higher `fare` ⇒ non‑decreasing survival), add text features, or tune `iterations/learning_rate/depth`.
