<a href="https://www.kaggle.com/code/aaronisomaisom3/s5e8-xgboost-optuna-starter?scriptVersionId=254843285" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [5]:
!pip install optuna --quiet

## Approach Summary: Playground Series - S5E8 Kaggle Competition
The goal is to keep it simple by using a tuned XGBoost classifier.

### 1. Data Preparation
- Load the training, test and original datasets from CSV files.
- Convert categorical features to Pandas `'category'` dtype for native XGBoost support.

### 2. Hyperparameter Tuning
- Use **Optuna** to optimize key XGBoost parameters (`max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `n_estimators`, `reg_alpha`, `reg_lambda`).
- Evaluate each parameter set using **stratified 5-fold cross-validation** with the ROC-AUC metric.

### 3. Model Training
- Train the final **XGBClassifier** model on the entire training data with the best Optuna parameters.

### 4. Prediction and Submission
- Predict probabilities for the test set using `predict_proba`.
- Create the submission file with columns: `id` and `y`, where `y` is the probability for the positive class.

In [6]:
# Author: Aaron Isom
# Kaggle Playground-Series-S5e8 - Binary Classification with a Bank Dataset
import pandas as pd
import numpy as np
import optuna
import warnings

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

warnings.filterwarnings('ignore')
tune = False # Toggle for Optuna tuning and Final Submission

In [7]:
# Optuna objective for XGBoost
def objective(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "n_estimators": trial.suggest_int("n_estimators", 100, 10000, step=100),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0, log=True)
    }
    
    model = XGBClassifier(**params, objective='binary:logistic', eval_metric='auc', random_state=42, device='cuda', n_jobs=-1,
                          enable_categorical=True, tree_method='hist')
    
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    return cross_val_score(model, X, y, cv=cv, scoring='roc_auc').mean()

In [8]:
# Load data
train = pd.read_csv('/kaggle/input/playground-series-s5e8/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s5e8/test.csv')
original = pd.read_csv('/kaggle/input/bank-marketing-dataset-full/bank-full.csv', delimiter=";")
submission = pd.read_csv('/kaggle/input/playground-series-s5e8/sample_submission.csv')

original['y'] = original['y'].replace({'yes': 1, 'no': 0})

train = pd.concat([train, original], axis=0, ignore_index=True)

# Features for training (drop id and target)
X = train.drop(['id', 'y'], axis=1)
y = train['y']

# Features for test set (drop only id)
X_test = test.drop(['id'], axis=1)

# Find object columns
cat_cols = X.select_dtypes(include='object').columns

# Encode object and category columns to ensure unique values are mapped
for col in X.select_dtypes(include=['object', 'category']).columns:
    le = LabelEncoder()
    le.fit(list(X[col].astype(str)) + list(X_test[col].astype(str)))
    X[col] = le.transform(X[col].astype(str))
    X_test[col] = le.transform(X_test[col].astype(str))

In [9]:
if tune:
    # Optuna Study
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=100, timeout=5400, show_progress_bar=True)
    best_params = study.best_trial.params
    print('Best Parameters:', best_params)
    print('Best Trial:', study.best_trial)

else:
    best_params = {'max_depth': 8, 'learning_rate': 0.013438247465442936, 'subsample': 0.8008903067253942, 'colsample_bytree': 0.5816817925051649, 'n_estimators': 6500, 
        'reg_alpha': 0.026068275170423927, 'reg_lambda': 0.0013608054178647067}
    
# Final XGBoost model
final_model = XGBClassifier(**best_params, objective='binary:logistic', eval_metric='auc', random_state=42, device='cuda',  n_jobs=-1, 
                          enable_categorical=True, tree_method='hist')

final_model.fit(X, y)

In [10]:
# Final submission
preds = final_model.predict_proba(X_test)[:, 1]
submission['y'] = preds
submission.to_csv('submission.csv', index=False)
display(submission)
print('Submission file saved.')

Unnamed: 0,id,y
0,750000,0.001581
1,750001,0.040677
2,750002,0.000146
3,750003,0.000034
4,750004,0.013161
...,...,...
249995,999995,0.000069
249996,999996,0.046899
249997,999997,0.774435
249998,999998,0.000525


Submission file saved.
