# CatBoosted Boosted Neural Net — tuned


## Import libraries

Load all required packages for data processing (pandas, numpy), machine learning (sklearn), gradient boosting (CatBoost), neural networks (TensorFlow/Keras), and evaluation metrics. Set random seeds for reproducibility.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix, f1_score, balanced_accuracy_score
from catboost import CatBoostClassifier

RANDOM_SEED = 42
tf.random.set_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Load dataset and prepare features

Read the filtered NFL dataset, convert play_type labels ('run'→0, 'pass'→1), remove the play_id column if present, separate features (X_raw) from target (y), and identify categorical columns for special handling by CatBoost.

In [2]:
df = pd.read_csv('../dataset/nfl_filtered.csv')
df['play_type'] = df['play_type'].map({'run': 0, 'pass': 1}).astype(int)

if 'play_id' in df.columns:
    df = df.drop(columns=['play_id'])

X_raw = df.drop(columns=['play_type'])
y = df['play_type']
categorical_cols = X_raw.select_dtypes(include=['object']).columns.tolist()
print(f'Rows: {len(df)}, features: {X_raw.shape[1]}')
print(f'Categorical cols: {categorical_cols}')

Rows: 318668, features: 19
Categorical cols: ['posteam', 'defteam', 'posteam_type', 'game_half', 'side_of_field']


## Split data into train/validation/test sets (70/15/15)

Perform stratified split to maintain class balance: first split 70% train and 30% temp, then split temp equally into 15% validation and 15% test. Map categorical column names to their integer indices for CatBoost's categorical feature handling.

In [3]:
X_train_raw, X_temp_raw, y_train, y_temp = train_test_split(
    X_raw, y, test_size=0.3, random_state=RANDOM_SEED, stratify=y
)
X_val_raw, X_test_raw, y_val, y_test = train_test_split(
    X_temp_raw, y_temp, test_size=0.5, random_state=RANDOM_SEED, stratify=y_temp
)

cat_feature_indices = [X_train_raw.columns.get_loc(col) for col in categorical_cols]
print('Splits ->', X_train_raw.shape, X_val_raw.shape, X_test_raw.shape)

Splits -> (223067, 19) (47800, 19) (47801, 19)


## Hyperparameter search for CatBoost

Loop through a grid of CatBoost configurations (varying tree depth, learning rate, iterations, and regularization parameters). Train each config on the training set with early stopping on validation AUC. Select the model with the highest validation AUC as the best base learner.

Between depths

- 4 : best train AUC 0.8074
- 6 : best train AUC 0.8084
- 8 : best train AUC 0.8088 best one depth 8 kept

In [None]:
cat_param_grid = [
    {"depth": 8, "learning_rate": 0.04, "iterations": 2000, "l2_leaf_reg": 7, "bagging_temperature": 1.2, "random_strength": 2.0}
] 

best_cat_model = None
best_params = None
best_val_auc = -np.inf

for i, params in enumerate(cat_param_grid, 1):
    print(f"Training CatBoost config {i}/{len(cat_param_grid)}: {params}")
    model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="AUC",
        od_type="Iter",
        od_wait=80,
        random_seed=RANDOM_SEED,
        verbose=100,
        **params,
    )
    model.fit(
        X_train_raw,
        y_train,
        cat_features=cat_feature_indices,
        eval_set=(X_val_raw, y_val),
        use_best_model=True,
    )
    val_proba = model.predict_proba(X_val_raw)[:, 1]
    val_auc = roc_auc_score(y_val, val_proba)
    print(f" -> Validation AUC: {val_auc:.4f}\n")
    if val_auc > best_val_auc:
        best_val_auc = val_auc
        best_params = params
        best_cat_model = model

print("Best CatBoost params:", best_params)
print(f"Best validation AUC: {best_val_auc:.4f}")

cat_val_proba = best_cat_model.predict_proba(X_val_raw)[:, 1]
cat_test_proba = best_cat_model.predict_proba(X_test_raw)[:, 1]

cat_val_auc = roc_auc_score(y_val, cat_val_proba)
cat_test_auc = roc_auc_score(y_test, cat_test_proba)
print(f"CatBoost (best) Validation AUC: {cat_val_auc:.4f}")
print(f"CatBoost (best) Test AUC: {cat_test_auc:.4f}")

Training CatBoost config 1/1: {'depth': 8, 'learning_rate': 0.04, 'iterations': 2000, 'l2_leaf_reg': 7, 'bagging_temperature': 1.2, 'random_strength': 2.0}


0:	test: 0.7688212	best: 0.7688212 (0)	total: 138ms	remaining: 4m 35s
100:	test: 0.8006828	best: 0.8006828 (100)	total: 6.14s	remaining: 1m 55s
200:	test: 0.8031612	best: 0.8031612 (200)	total: 11.7s	remaining: 1m 44s
300:	test: 0.8045287	best: 0.8045287 (300)	total: 17.2s	remaining: 1m 37s
400:	test: 0.8066016	best: 0.8066016 (400)	total: 23.4s	remaining: 1m 33s


## Generate out-of-fold (OOF) CatBoost probabilities for stacking

Use 5-fold cross-validation on the training set: for each fold, train CatBoost on 4 folds and predict on the held-out fold to create leak-free training meta-features. Then train a final CatBoost model on the full training set to generate validation and test probabilities. This prevents the neural network from seeing predictions made on data the CatBoost was trained on.

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)
oof_proba = np.zeros(len(y_train))

for fold, (tr_idx, val_idx) in enumerate(kf.split(X_train_raw), 1):
    X_tr, X_val_fold = X_train_raw.iloc[tr_idx], X_train_raw.iloc[val_idx]
    y_tr, y_val_fold = y_train.iloc[tr_idx], y_train.iloc[val_idx]

    fold_model = CatBoostClassifier(
        loss_function="Logloss",
        eval_metric="AUC",
        od_type="Iter",
        od_wait=80,
        random_seed=RANDOM_SEED + fold,
        verbose=0,
        **best_params,
    )
    fold_model.fit(
        X_tr,
        y_tr,
        cat_features=cat_feature_indices,
        eval_set=(X_val_fold, y_val_fold),
        use_best_model=True,
        verbose=0,
    )
    oof_proba[val_idx] = fold_model.predict_proba(X_val_fold)[:, 1]

# Final CatBoost trained on full training split (still hold out val/test for evaluation)
cat_model_full = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="AUC",
    od_type="Iter",
    od_wait=80,
    random_seed=RANDOM_SEED,
    verbose=100,
    **best_params,
)
cat_model_full.fit(
    X_train_raw,
    y_train,
    cat_features=cat_feature_indices,
    eval_set=(X_val_raw, y_val),
    use_best_model=True,
)

proba_train = oof_proba
proba_val = cat_model_full.predict_proba(X_val_raw)[:, 1]
proba_test = cat_model_full.predict_proba(X_test_raw)[:, 1]

print("OOF stacking ready -> train/val/test probabilities computed.")

0:	test: 0.7688212	best: 0.7688212 (0)	total: 177ms	remaining: 5m 53s
100:	test: 0.8004798	best: 0.8004798 (100)	total: 17.6s	remaining: 5m 31s
100:	test: 0.8004798	best: 0.8004798 (100)	total: 17.6s	remaining: 5m 31s
200:	test: 0.8029605	best: 0.8029666 (199)	total: 34.1s	remaining: 5m 5s
200:	test: 0.8029605	best: 0.8029666 (199)	total: 34.1s	remaining: 5m 5s
300:	test: 0.8044308	best: 0.8044308 (300)	total: 51.3s	remaining: 4m 49s
300:	test: 0.8044308	best: 0.8044308 (300)	total: 51.3s	remaining: 4m 49s
400:	test: 0.8065616	best: 0.8065626 (399)	total: 1m 10s	remaining: 4m 39s
400:	test: 0.8065616	best: 0.8065626 (399)	total: 1m 10s	remaining: 4m 39s
500:	test: 0.8074426	best: 0.8074492 (499)	total: 1m 28s	remaining: 4m 25s
500:	test: 0.8074426	best: 0.8074492 (499)	total: 1m 28s	remaining: 4m 25s
600:	test: 0.8079215	best: 0.8079215 (600)	total: 1m 47s	remaining: 4m 10s
600:	test: 0.8079215	best: 0.8079215 (600)	total: 1m 47s	remaining: 4m 10s
700:	test: 0.8082499	best: 0.8082547 (

## Encode categorical features for neural network

Convert categorical string columns to integer codes using LabelEncoder (fit on training set only to prevent leakage). Append the CatBoost probability as an additional numeric feature. Separate features into categorical (for embedding layers) and numerical (for dense layers + CatBoost meta-feature).

In [None]:
label_encoders = {}
X_train_enc = X_train_raw.copy()
X_val_enc = X_val_raw.copy()
X_test_enc = X_test_raw.copy()

for col in categorical_cols:
    le = LabelEncoder()
    le.fit(X_train_enc[col].astype(str))
    label_encoders[col] = le
    X_train_enc[col] = le.transform(X_train_enc[col].astype(str))
    X_val_enc[col] = le.transform(X_val_enc[col].astype(str))
    X_test_enc[col] = le.transform(X_test_enc[col].astype(str))

# Append CatBoost probabilities as a numeric meta-feature (OOF for train)
X_train_enc['cat_proba'] = proba_train
X_val_enc['cat_proba'] = proba_val
X_test_enc['cat_proba'] = proba_test

categorical_for_nn = categorical_cols
numerical_for_nn = [c for c in X_train_enc.columns if c not in categorical_for_nn]
print('NN categorical:', categorical_for_nn)
print('NN numerical:', numerical_for_nn)

NN categorical: ['posteam', 'defteam', 'posteam_type', 'game_half', 'side_of_field']
NN numerical: ['yardline_100', 'qtr', 'down', 'ydstogo', 'goal_to_go', 'score_differential', 'drive', 'posteam_timeouts_remaining', 'defteam_timeouts_remaining', 'shotgun', 'no_huddle', 'quarter_seconds_remaining', 'half_seconds_remaining', 'game_seconds_remaining', 'cat_proba']


## Prepare input tensors for the neural network

Extract categorical features as separate arrays (one per column) for the embedding inputs. Standardize numerical features using StandardScaler (fit on training, transform on validation and test). Combine all inputs into lists ready for the multi-input Keras model.

In [None]:
X_train_cat = [X_train_enc[col].values for col in categorical_for_nn]
X_val_cat = [X_val_enc[col].values for col in categorical_for_nn]
X_test_cat = [X_test_enc[col].values for col in categorical_for_nn]

scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train_enc[numerical_for_nn])
X_val_num = scaler.transform(X_val_enc[numerical_for_nn])
X_test_num = scaler.transform(X_test_enc[numerical_for_nn])

train_inputs = X_train_cat + [X_train_num]
val_inputs = X_val_cat + [X_val_num]
test_inputs = X_test_cat + [X_test_num]

print('Prepared NN inputs:')
print('  categorical tensors:', len(categorical_for_nn))
print('  numeric shape:', X_train_num.shape)

Prepared NN inputs:
  categorical tensors: 5
  numeric shape: (223067, 15)


## Build the CatBoosted MLP architecture

Create a multi-input neural network: each categorical feature gets its own embedding layer (converting integer codes to dense vectors), then all embeddings are concatenated with the standardized numeric features (including CatBoost probability). Pass through two hidden layers with ReLU activation, dropout for regularization, and L2 weight decay. Output layer uses sigmoid for binary classification. Compile with Adam optimizer, binary cross-entropy loss, and track both accuracy and AUC.

In [None]:
embedding_dims = {
    'posteam': 8,
    'defteam': 8,
    'posteam_type': 2,
    'game_half': 2,
    'side_of_field': 4
}

cat_inputs = []
cat_embeds = []
for col in categorical_for_nn:
    vocab = len(label_encoders[col].classes_)
    dim = embedding_dims.get(col, 4)
    inp = keras.Input(shape=(1,), name=f'{col}_input')
    emb = keras.layers.Embedding(input_dim=vocab, output_dim=dim, name=f'{col}_emb')(inp)
    cat_inputs.append(inp)
    cat_embeds.append(keras.layers.Flatten()(emb))

cat_concat = keras.layers.Concatenate(name='cat_concat')(cat_embeds)
num_input = keras.Input(shape=(X_train_num.shape[1],), name='num_input')

combined = keras.layers.Concatenate(name='features')([cat_concat, num_input])
reg = keras.regularizers.l2(1e-4)

x = keras.layers.Dense(128, activation='relu', kernel_regularizer=reg)(combined)
x = keras.layers.Dropout(0.35)(x)
x = keras.layers.Dense(64, activation='relu', kernel_regularizer=reg)(x)
x = keras.layers.Dropout(0.25)(x)
out = keras.layers.Dense(1, activation='sigmoid')(x)

catnn_model = keras.Model(inputs=cat_inputs + [num_input], outputs=out, name='CatBoosted_MLP')
catnn_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0007),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.AUC(name='auc')]
)
catnn_model.summary()

## Train the neural network with early stopping

Fit the model using callbacks: EarlyStopping monitors validation AUC and restores the best weights if performance plateaus for 6 epochs, while ReduceLROnPlateau lowers the learning rate if validation loss stops improving. Train for up to 60 epochs with batch size 256.

In [None]:
callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_auc', patience=6, restore_best_weights=True, verbose=1),
    keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=4, min_lr=1e-6, verbose=1)
]
history = catnn_model.fit(
    train_inputs,
    y_train.values,
    validation_data=(val_inputs, y_val.values),
    epochs=60,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

Epoch 1/60
[1m872/872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.7259 - auc: 0.7920 - loss: 0.5515 - val_accuracy: 0.7348 - val_auc: 0.8080 - val_loss: 0.5314 - learning_rate: 7.0000e-04
Epoch 2/60
[1m872/872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.7259 - auc: 0.7920 - loss: 0.5515 - val_accuracy: 0.7348 - val_auc: 0.8080 - val_loss: 0.5314 - learning_rate: 7.0000e-04
Epoch 2/60
[1m872/872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7327 - auc: 0.8019 - loss: 0.5372 - val_accuracy: 0.7346 - val_auc: 0.8086 - val_loss: 0.5283 - learning_rate: 7.0000e-04
Epoch 3/60
[1m872/872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7327 - auc: 0.8019 - loss: 0.5372 - val_accuracy: 0.7346 - val_auc: 0.8086 - val_loss: 0.5283 - learning_rate: 7.0000e-04
Epoch 3/60
[1m872/872[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7330 - auc: 

In [None]:
def find_best_threshold(y_true, proba, metric_fn=accuracy_score, thresholds=np.linspace(0.1, 0.9, 17)):
    best_t, best_score = 0.5, -np.inf
    for t in thresholds:
        preds = (proba >= t).astype(int)
        score = metric_fn(y_true, preds)
        if score > best_score:
            best_score = score
            best_t = t
    return best_t, best_score

# Validation predictions
catnn_val_proba = catnn_model.predict(val_inputs, verbose=0).flatten()

best_t_acc_cat, best_acc_cat = find_best_threshold(y_val, cat_val_proba, accuracy_score)
best_t_acc_catnn, best_acc_catnn = find_best_threshold(y_val, catnn_val_proba, accuracy_score)

print(f"Best val accuracy threshold (CatBoost): {best_t_acc_cat:.2f} -> acc {best_acc_cat:.4f}")
print(f"Best val accuracy threshold (CatBoosted MLP): {best_t_acc_catnn:.2f} -> acc {best_acc_catnn:.4f}")

Best val accuracy threshold (CatBoost): 0.50 -> acc 0.7344
Best val accuracy threshold (CatBoosted MLP): 0.50 -> acc 0.7346


## Tune decision threshold on validation set

Define a function to scan classification thresholds from 0.1 to 0.9 and find the one that maximizes a chosen metric (accuracy by default). Generate validation predictions from both CatBoost and the CatBoosted MLP, then find the optimal threshold for each model that maximizes validation accuracy.

## Evaluate both models on test set with tuned thresholds

Apply the optimal thresholds (found on validation) to the test set predictions. Calculate test accuracy and AUC for both the standalone CatBoost model and the CatBoosted MLP. Display classification report and confusion matrix for the hybrid model to show precision, recall, and where predictions are correct/incorrect.

In [None]:
# CatBoost test metrics with tuned threshold
cat_test_preds = (cat_test_proba >= best_t_acc_cat).astype(int)
cat_test_acc = accuracy_score(y_test, cat_test_preds)
cat_test_auc = roc_auc_score(y_test, cat_test_proba)

# CatBoosted MLP test metrics with tuned threshold
catnn_test_proba = catnn_model.predict(test_inputs, verbose=0).flatten()
catnn_test_preds = (catnn_test_proba >= best_t_acc_catnn).astype(int)
catnn_test_acc = accuracy_score(y_test, catnn_test_preds)
catnn_test_auc = roc_auc_score(y_test, catnn_test_proba)

print(f"CatBoost -> Test Accuracy @ {best_t_acc_cat:.2f}: {cat_test_acc:.4f} | AUC: {cat_test_auc:.4f}")
print(f"CatBoosted MLP -> Test Accuracy @ {best_t_acc_catnn:.2f}: {catnn_test_acc:.4f} | AUC: {catnn_test_auc:.4f}\n")

print("CatBoosted MLP Classification Report:")
print(classification_report(y_test, catnn_test_preds))
print("Confusion Matrix:\n", confusion_matrix(y_test, catnn_test_preds))

CatBoost -> Test Accuracy @ 0.50: 0.7321 | AUC: 0.8063
CatBoosted MLP -> Test Accuracy @ 0.50: 0.7331 | AUC: 0.8068

CatBoosted MLP Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.70      0.69     19876
           1       0.78      0.75      0.77     27925

    accuracy                           0.73     47801
   macro avg       0.73      0.73      0.73     47801
weighted avg       0.74      0.73      0.73     47801

Confusion Matrix:
 [[13989  5887]
 [ 6871 21054]]
