# Diabetes Prediction: Kaggle Playground Series S5E12

**TL;DR (Too Long; Didn't Read):** This notebook predicts whether a person has diabetes based on health metrics. We use an ensemble of three gradient boosting models. The key insight: only the last 22,000 training samples match the test distribution.

## How This Notebook Was Created

This entire project was built using [Claude Code](https://claude.com/claude-code), Anthropic's command-line interface for Claude. Here's what that process looked like:

1. We asked Claude to find a good Kaggle competition
2. Claude searched active competitions and recommended this one (tabular data, good for gradient boosting)
3. We downloaded the data using the Kaggle CLI (Command Line Interface)
4. Claude read the competition discussion forums (via Chrome extension) and extracted key insights
5. Claude wrote Python scripts, ran experiments, and iterated based on results
6. We submitted predictions directly from the terminal

Everything below is the actual approach we used. No cherry-picking, no hiding failures.

---

## What Is This Competition About?

Kaggle hosts machine learning competitions. This one is part of the "Playground Series" - monthly competitions with synthetic (artificially generated) datasets. They're designed for learning and practice.

**The Task:** Given health measurements for a person, predict whether they have been diagnosed with diabetes.

**The Data:**
- 700,000 training examples (people with known diabetes status)
- 300,000 test examples (people we need to predict)
- 25 features (measurements about each person)

**The Metric:** AUC-ROC (Area Under the Receiver Operating Characteristic Curve). This measures how well we can rank people by their likelihood of having diabetes. 1.0 = perfect, 0.5 = random guessing.

## The Critical Insight: Distribution Shift

From reading the competition discussion forums, we learned something important:

> The first ~678,000 training samples have a **different statistical distribution** than the test set. Only the **last ~22,000 training samples** match the test distribution.

What does this mean in plain English?

Imagine you're training to predict house prices in San Francisco, but 97% of your training data is actually from rural Kansas. If you validate your model on that Kansas data, you'll get misleading results. You need to validate on San Francisco data.

That's exactly what's happening here. The competition organizers (deliberately or accidentally) created training data where most of it doesn't represent the test set. If we use all 700,000 samples for cross-validation, we get AUC scores around 0.727. But when we submit to the leaderboard, we get ~0.696.

**The fix:** Only use the last 22,000 samples for validation. This gives us scores that actually predict leaderboard performance.

## Setup

First, let's install the packages we need and load the data.

In [None]:
# Install required packages (run once)
# !pip install lightgbm xgboost catboost scikit-learn pandas numpy

In [None]:
import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import Ridge

print(f"LightGBM: {lgb.__version__}")
print(f"XGBoost: {xgb.__version__}")
print(f"CatBoost: {cb.__version__}")

In [None]:
# Load data
# If running on Kaggle, the data is at /kaggle/input/playground-series-s5e12/
# If running locally, adjust the path

DATA_PATH = "/kaggle/input/playground-series-s5e12/"
# DATA_PATH = "../data/playground-series-s5e12/"  # For local runs

train = pd.read_csv(f"{DATA_PATH}train.csv")
test = pd.read_csv(f"{DATA_PATH}test.csv")
sample_sub = pd.read_csv(f"{DATA_PATH}sample_submission.csv")

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

## Understanding the Features

Let's look at what information we have about each person.

In [None]:
# Show first few rows
train.head()

In [None]:
# Feature types
print("Columns and their types:")
print(train.dtypes)

**Feature Categories:**

1. **Numerical (15 features):** Things we can measure with numbers
   - `age` - How old the person is
   - `bmi` - Body Mass Index (weight relative to height)
   - `systolic_bp`, `diastolic_bp` - Blood pressure measurements
   - `cholesterol_total`, `hdl_cholesterol`, `ldl_cholesterol` - Fat in blood
   - `triglycerides` - Another type of blood fat
   - Various lifestyle metrics (alcohol, activity, sleep, screen time, diet)

2. **Categorical (6 features):** Categories/labels
   - `gender`, `ethnicity`, `education_level`, `income_level`
   - `smoking_status`, `employment_status`

3. **Binary (3 features):** Yes/No flags
   - `family_history_diabetes` - Does diabetes run in the family?
   - `hypertension_history` - History of high blood pressure?
   - `cardiovascular_history` - History of heart disease?

4. **Target:** `diagnosed_diabetes` - What we're predicting (0 = no, 1 = yes)

## Preprocessing

Machine learning models need numbers, not text. We need to convert categorical features (like "Male"/"Female") into numbers (like 0/1).

We use Label Encoding: assign a unique number to each category.

In [None]:
# Configuration
SEED = 42  # For reproducibility
VAL_SIZE = 22000  # Last 22K samples for validation

# Identify columns
target_col = "diagnosed_diabetes"
id_col = "id"
feature_cols = [c for c in train.columns if c not in [target_col, id_col]]
cat_cols = train[feature_cols].select_dtypes(include=["object"]).columns.tolist()

print(f"Categorical columns: {cat_cols}")

In [None]:
# Label encode categorical features
for col in cat_cols:
    le = LabelEncoder()
    # Fit on both train and test to handle all possible values
    combined = pd.concat([train[col], test[col]], axis=0).astype(str)
    le.fit(combined)
    train[col] = le.transform(train[col].astype(str))
    test[col] = le.transform(test[col].astype(str))

# Prepare data
X = train[feature_cols]
y = train[target_col]
X_test = test[feature_cols]
test_ids = test[id_col]

print(f"Features shape: {X.shape}")
print(f"Target distribution: {y.value_counts(normalize=True).to_dict()}")

## Validation Strategy

This is the most important part. We split the data so the last 22,000 samples become our validation set.

In [None]:
# Split: last 22K for validation
X_train = X.iloc[:-VAL_SIZE]
y_train = y.iloc[:-VAL_SIZE]
X_val = X.iloc[-VAL_SIZE:]
y_val = y.iloc[-VAL_SIZE:]

print(f"Training set: {len(X_train)} samples")
print(f"Validation set: {len(X_val)} samples")
print(f"\nTarget rates:")
print(f"  Training: {y_train.mean():.4f}")
print(f"  Validation: {y_val.mean():.4f}")

## Model Training

We train three different gradient boosting models and combine them:

1. **LightGBM (Light Gradient Boosting Machine)** - Fast, memory-efficient
2. **XGBoost (eXtreme Gradient Boosting)** - The classic choice
3. **CatBoost (Categorical Boosting)** - Handles categories well

### What is Gradient Boosting?

Gradient boosting builds many simple decision trees, where each new tree tries to fix the mistakes of the previous trees. It's like having a team of experts where each one specializes in the cases the others got wrong.

In [None]:
# LightGBM
print("Training LightGBM...")

lgb_params = {
    "objective": "binary",
    "metric": "auc",
    "learning_rate": 0.03,
    "num_leaves": 63,
    "max_depth": 8,
    "min_child_samples": 50,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "n_estimators": 2000,
    "random_state": SEED,
    "n_jobs": -1,
    "verbose": -1,
}

lgb_model = lgb.LGBMClassifier(**lgb_params)
lgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(100), lgb.log_evaluation(200)],
)

lgb_val_pred = lgb_model.predict_proba(X_val)[:, 1]
lgb_test_pred = lgb_model.predict_proba(X_test)[:, 1]
print(f"LightGBM Val AUC: {roc_auc_score(y_val, lgb_val_pred):.5f}")

In [None]:
# XGBoost
print("Training XGBoost...")

# XGBoost needs categorical columns marked
X_train_xgb = X_train.copy()
X_val_xgb = X_val.copy()
X_test_xgb = X_test.copy()

for col in cat_cols:
    X_train_xgb[col] = X_train_xgb[col].astype("category")
    X_val_xgb[col] = X_val_xgb[col].astype("category")
    X_test_xgb[col] = X_test_xgb[col].astype("category")

xgb_params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "learning_rate": 0.03,
    "max_depth": 8,
    "min_child_weight": 50,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "n_estimators": 2000,
    "random_state": SEED,
    "enable_categorical": True,
    "tree_method": "hist",
}

xgb_model = xgb.XGBClassifier(**xgb_params)
xgb_model.fit(X_train_xgb, y_train, eval_set=[(X_val_xgb, y_val)], verbose=200)

xgb_val_pred = xgb_model.predict_proba(X_val_xgb)[:, 1]
xgb_test_pred = xgb_model.predict_proba(X_test_xgb)[:, 1]
print(f"XGBoost Val AUC: {roc_auc_score(y_val, xgb_val_pred):.5f}")

In [None]:
# CatBoost
print("Training CatBoost...")

cat_indices = [X_train.columns.get_loc(c) for c in cat_cols]

cb_params = {
    "objective": "Logloss",
    "eval_metric": "AUC",
    "learning_rate": 0.03,
    "depth": 8,
    "iterations": 2000,
    "random_seed": SEED,
    "verbose": 200,
    "early_stopping_rounds": 100,
}

cb_model = cb.CatBoostClassifier(**cb_params)
cb_model.fit(X_train, y_train, eval_set=(X_val, y_val), cat_features=cat_indices)

cb_val_pred = cb_model.predict_proba(X_val)[:, 1]
cb_test_pred = cb_model.predict_proba(X_test)[:, 1]
print(f"CatBoost Val AUC: {roc_auc_score(y_val, cb_val_pred):.5f}")

## Ensembling

Combining multiple models usually works better than any single model. We use Ridge regression to find optimal weights for each model's predictions.

### Why does ensembling work?

Different models make different mistakes. By averaging their predictions, errors tend to cancel out. It's like asking three doctors for a diagnosis instead of one.

In [None]:
# Find optimal weights using Ridge regression
val_preds = np.column_stack([lgb_val_pred, xgb_val_pred, cb_val_pred])
test_preds = np.column_stack([lgb_test_pred, xgb_test_pred, cb_test_pred])

ridge = Ridge(alpha=1.0)
ridge.fit(val_preds, y_val)

weights = ridge.coef_
weights = np.maximum(weights, 0)  # Keep non-negative
weights = weights / weights.sum()  # Normalize to sum to 1

print("Optimal weights:")
print(f"  LightGBM: {weights[0]:.4f}")
print(f"  XGBoost:  {weights[1]:.4f}")
print(f"  CatBoost: {weights[2]:.4f}")

In [None]:
# Create ensemble predictions
ensemble_val = val_preds @ weights
ensemble_test = test_preds @ weights

# Compare results
print("\nResults (Validation on last 22K samples):")
print(f"  LightGBM:  {roc_auc_score(y_val, lgb_val_pred):.5f}")
print(f"  XGBoost:   {roc_auc_score(y_val, xgb_val_pred):.5f}")
print(f"  CatBoost:  {roc_auc_score(y_val, cb_val_pred):.5f}")
print(f"  Ensemble:  {roc_auc_score(y_val, ensemble_val):.5f}")

## Create Submission

In [None]:
submission = pd.DataFrame({
    "id": test_ids,
    "diagnosed_diabetes": ensemble_test,
})

submission.to_csv("submission.csv", index=False)
print("Submission saved!")
submission.head()

## What We Learned

1. **Distribution shift matters.** Using the wrong validation set gave us misleading scores (0.727 vs 0.696 on leaderboard).

2. **Read the discussions.** The insight about the last 22K samples came from community members analyzing the data.

3. **Ensembles help.** Combining three gradient boosting models improved over any single model.

4. **Simple approaches work.** We didn't need complex feature engineering. The key was getting the validation strategy right.

## What We'd Try Next

- Add neural network models (TabNet, FT-Transformer) for more diversity
- Use the original dataset for target encoding features
- Weighted sample refit (give higher weight to last 22K samples)

---

*This notebook was created with Claude Code. The full project is at: https://github.com/bedwards/caggle*