# 03 ‚Äî CatBoost Training & Tuning

> **Objective:** To train and tune a CatBoost classifier for overqualification prediction, using validation feedback and (optionally) grid search over key hyperparameters to approach leaderboard-level accuracy.

This notebook covers:
1. [**Data preparation**](#data-preparation) ‚Äî load, clean, split, and define categorical indices  
2. [**Baseline training**](#baseline-training) ‚Äî single train/val split and full fit  
3. [**Cross-validation**](#cross-validation) ‚Äî stratified K-fold and mean accuracy  
4. [**Hyperparameter tuning**](#hyperparameter-tuning) ‚Äî grid search over depth, learning_rate, l2_leaf_reg  
5. [**Final model**](#final-model) ‚Äî retrain on full training set with chosen parameters

### üß† Context

The hackathon evaluation metric was **accuracy** on a held-out test set (Public/Private leaderboard). We use **stratified K-fold cross-validation** and a **train/validation split** to estimate generalization and avoid overfitting. CatBoost handles categorical features natively and supports **early stopping** on a validation set.

---
### üß∞ Imports

In [1]:
import sys
from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

sys.path.insert(0, str(Path().resolve().parent))

from src.config import VAL_SIZE, RANDOM_STATE, N_FOLDS
from src.data import load_train, split_X_y, get_train_val_split
from src.preprocess import clean
from src.features import add_features, get_categorical_feature_names
from src.model import build_model
from src.evaluate import run_validation
from src.hyperparameter_tuning import grid_search_cv

### üì• Data Preparation <a id="data-preparation"></a>

In [2]:
df = load_train()
df = clean(df)
df = add_features(df)

X, y = split_X_y(df, target_col="overqualified")
y = y.astype(int)

cat_names = [c for c in get_categorical_feature_names() if c in X.columns]
cat_indices = [i for i, c in enumerate(X.columns) if c in cat_names]

print("Feature matrix shape:", X.shape)
print("Categorical feature indices:", cat_indices[:5], "...")
print("Target distribution:", y.value_counts().to_dict())

Feature matrix shape: (7709, 23)
Categorical feature indices: [0, 1, 2, 3, 4] ...
Target distribution: {0: 4745, 1: 2964}


### üèÉ Baseline Training <a id="baseline-training"></a>

Single train/validation split; CatBoost with early stopping on the validation set.

In [3]:
train_df, val_df = get_train_val_split(df, val_size=VAL_SIZE, random_state=RANDOM_STATE)
X_train, y_train = split_X_y(train_df, target_col="overqualified")
X_val, y_val = split_X_y(val_df, target_col="overqualified")
y_train = y_train.astype(int)
y_val = y_val.astype(int)

model = build_model(iterations=500, learning_rate=0.05, depth=6, early_stopping_rounds=20)
model.fit(X_train, y_train, cat_features=cat_indices, eval_set=(X_val, y_val))

val_pred = model.predict(X_val)
print("Validation accuracy:", round(accuracy_score(y_val, val_pred), 4))

0:	learn: 0.6246149	test: 0.6115435	best: 0.6115435 (0)	total: 64.7ms	remaining: 32.3s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.6640726329
bestIteration = 9

Shrink model to first 10 iterations.
Validation accuracy: 0.6641


### üìä Cross-Validation <a id="cross-validation"></a>

Stratified 5-fold CV to estimate mean accuracy and variance.

In [4]:
cv_results = run_validation(
    X, y,
    model=None,
    n_folds=N_FOLDS,
    random_state=RANDOM_STATE,
    cat_indices=cat_indices,
    iterations=500,
    learning_rate=0.05,
    depth=6,
    early_stopping_rounds=20,
)
print(f"CV accuracy: {cv_results['mean_accuracy']:.4f} ¬± {cv_results['std_accuracy']:.4f}")
print("Fold accuracies:", [round(a, 4) for a in cv_results["fold_accuracies"]])

CV accuracy: 0.6640 ¬± 0.0053
Fold accuracies: [0.6719, 0.6608, 0.6608, 0.6686, 0.658]


### üîß Hyperparameter Tuning <a id="hyperparameter-tuning"></a>

Optional grid search over `depth`, `learning_rate`, and `l2_leaf_reg`. (Small grid to keep runtime reasonable.)

In [5]:
param_grid = {
    "depth": [5, 6, 7],
    "learning_rate": [0.03, 0.05, 0.08],
    "l2_leaf_reg": [2.0, 3.0, 5.0],
}

# Reduce grid or n_folds for faster run (e.g. 2 folds, 1 value per param for a quick test)
tuning_results = grid_search_cv(
    X, y,
    param_grid=param_grid,
    cat_indices=cat_indices,
    n_folds=3,
    random_state=RANDOM_STATE,
    early_stopping_rounds=15,
)

best = max(tuning_results, key=lambda x: x["mean_accuracy"])
print("Best params:", best["params"])
print("Best CV accuracy:", round(best["mean_accuracy"], 4))

Best params: {'depth': 7, 'learning_rate': 0.08, 'l2_leaf_reg': 5.0}
Best CV accuracy: 0.6771


### ‚úÖ Final Model <a id="final-model"></a>

Retrain on the **full** training set with the chosen hyperparameters (or defaults). The same logic is run by `python3 -m src.train` from the project root, which also saves the model and artifacts for `predict.py`.

In [6]:
# Use best params from tuning if available; otherwise defaults
try:
    final_params = best["params"]
except NameError:
    final_params = {"depth": 6, "learning_rate": 0.05, "l2_leaf_reg": 3.0}

final_model = build_model(
    iterations=500,
    early_stopping_rounds=20,
    random_seed=RANDOM_STATE,
    **final_params,
)
final_model.fit(X, y, cat_features=cat_indices)

print("Final model trained on full training set.")
print("To save and generate submission: run from terminal: python3 -m src.train && python3 -m src.predict")

0:	learn: 0.6488520	total: 9.3ms	remaining: 4.64s
100:	learn: 0.7314827	total: 1.2s	remaining: 4.73s
200:	learn: 0.7666364	total: 2.36s	remaining: 3.51s
300:	learn: 0.8002335	total: 3.44s	remaining: 2.27s
400:	learn: 0.8264366	total: 4.47s	remaining: 1.1s
499:	learn: 0.8516020	total: 5.5s	remaining: 0us
Final model trained on full training set.
To save and generate submission: run from terminal: python3 -m src.train && python3 -m src.predict


---
## üìù Summary

We trained a CatBoost classifier with early stopping and optional grid search. Validation and CV accuracy guide parameter choice; the production pipeline in `src/train.py` retrains on the full training data and saves the model for submission generation.

**Next step:** `04_evaluation_interpretability.ipynb` ‚Äî feature importance and model interpretability.