# XGBoost Optuna Example  
In this notebook, we made a basic implementation of XGBoost hyperparameter tuning using Optuna in Kaggle's Playground Series S4E10: Loan Approval Prediction. We evaluate the performance of both the base model and the tuned model on unseen data (X_test) using the ROC AUC metric.

## Imports and Loads

In [1]:
# Imports and settings
import pandas as pd
import numpy as np

# settings
import warnings
warnings.filterwarnings('ignore')
random_state = 2024
np.seed = random_state

# Metrics and model selection
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score

# Encoding
from sklearn.preprocessing import OrdinalEncoder

# Models
from xgboost import XGBClassifier

In [2]:
#Optuna settings
import optuna
import logging

## Config level of msgs
logging.getLogger("optuna").setLevel(logging.INFO) # Set to WARNING to reduce mensages

In [3]:
#train_data = pd.read_csv('/kaggle/input/credit-risk-dataset/credit_risk_dataset.csv')

In [4]:
# Load data
train_data = pd.read_csv('/kaggle/input/playground-series-s4e10/train.csv').drop('id',axis=1)

test_data = pd.read_csv('/kaggle/input/playground-series-s4e10/test.csv')
ids = test_data['id'].astype(np.int32)
test_data.drop('id', axis=1, inplace=True)

print('Data loaded successfully!')

Data loaded successfully!


## Determine train and test splits

In [5]:
# Determine the dependent (target) and numeric
target_variable = "loan_status"

# Determine the numerical variables
num_cols = [col for col in train_data.columns if train_data[col].dtype in ['int64', 'float64'] 
            and col != target_variable]

# Determine the categorical variables
cat_cols = [col for col in train_data.columns if train_data[col].dtype == 'object']

In [6]:
y = train_data['loan_status'].reset_index(drop=True)
X = train_data.drop('loan_status', axis=1).reset_index(drop=True)

In [7]:
oe = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_oe = pd.DataFrame(oe.fit_transform(X[cat_cols]), columns = cat_cols).fillna(0).astype(int)

test_oe = pd.DataFrame(oe.transform(test_data[cat_cols]), columns = cat_cols).fillna(0).astype(int)

In [8]:
X_oe_concated = pd.concat([X_oe, X[num_cols]], axis= 1)
test_oe_concated = pd.concat([test_oe, test_data[num_cols]], axis= 1)

In [9]:
X, test_data = X_oe_concated.copy(),test_oe_concated.copy()

for col in cat_cols:
    X[col] = X[col].astype("category")
    test_data[col] = test_data[col].astype("category")

In [10]:
# Simple train_test split to evaluate the optimization
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=random_state)

# Base model and Hyperparameter Tunning

## Base XGBoost model

In [11]:
# XGB base params
base_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'enable_categorical': True,
    'n_estimators': 1000,
    'eta': 0.3,
    'verbosity':0,
}

In [12]:
# Base XGBoost model
base_xgb = XGBClassifier(**base_params).fit(X_train, y_train)
base_xgb

In [13]:
# To evaluate the model performance we use cross_val_score
score = cross_val_score(base_xgb, X_train, y_train, scoring="roc_auc", cv=5).mean()
print(f'Base ROC_AUC : {score}')

Base ROC_AUC : 0.9428497614842855


## Optuna Objective XGBoost  
First, we define the search space. For simplicity, we comment out some hyperparameters, making the search easier.

In [14]:
# Optuna Objective XGB
def objective(trial):
    param = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'enable_categorical': True,
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'n_estimators': trial.suggest_int('n_estimators', 500, 1500),
        #'subsample': trial.suggest_float('subsample', 0.7, 0.9),
        #'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.9),
        #'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'eta': trial.suggest_float('eta', 0.01, 0.1),
        #'reg_alpha': trial.suggest_float("reg_alpha", 1e-6, 1e-1),
        'reg_lambda': trial.suggest_float("reg_lambda", 1, 20),
        #'gamma': trial.suggest_float("gamma", 1e-8, 1e-1),
    }

    clf = XGBClassifier(**param)
    # Función objetivo
    score = cross_val_score(clf, X_train, y_train, scoring="roc_auc", cv=5).mean()
    return score

In [15]:
%%time
# Run the optimization
study = optuna.create_study(
    study_name='XGBoost_optimization', 
    direction='maximize')

study.optimize(objective, n_trials=30, timeout=3600)

[I 2024-10-23 11:50:05,183] A new study created in memory with name: XGBoost_optimization
[I 2024-10-23 11:50:18,745] Trial 0 finished with value: 0.9547103775443435 and parameters: {'max_depth': 4, 'n_estimators': 1106, 'eta': 0.09841469842426465, 'reg_lambda': 7.85061641795577}. Best is trial 0 with value: 0.9547103775443435.
[I 2024-10-23 11:50:59,882] Trial 1 finished with value: 0.952971948019053 and parameters: {'max_depth': 10, 'n_estimators': 1145, 'eta': 0.03369067240957974, 'reg_lambda': 18.9263800243201}. Best is trial 0 with value: 0.9547103775443435.
[I 2024-10-23 11:51:10,167] Trial 2 finished with value: 0.9554704924751457 and parameters: {'max_depth': 5, 'n_estimators': 670, 'eta': 0.0729616303536814, 'reg_lambda': 5.808618704598761}. Best is trial 2 with value: 0.9554704924751457.
[I 2024-10-23 11:51:24,812] Trial 3 finished with value: 0.9555880719862472 and parameters: {'max_depth': 4, 'n_estimators': 1153, 'eta': 0.0458340597764879, 'reg_lambda': 4.914137339290766}.

CPU times: user 31min 5s, sys: 10.7 s, total: 31min 15s
Wall time: 8min 4s


In [16]:
#  Results
print('N trials: ', len(study.trials))
print('Best trial:')
trial = study.best_trial 

print('  Valor: {}'.format(trial.value))
print('  Params: ')
for key, value in trial.params.items():
    print('    {}: {}'.format(key, value))

# Optuna params
optuna_params = trial.params

N trials:  30
Best trial:
  Valor: 0.9562592785427324
  Params: 
    max_depth: 3
    n_estimators: 876
    eta: 0.09972966165848221
    reg_lambda: 7.111021085628628


Now we update the base parameters with the Optuna parameters and construct the XGBoost model.

In [17]:
base_params.update(optuna_params)

xgb_model = XGBClassifier(**base_params)
xgb_model

## Evaluate the performance of both models  

We found that the optimized model had a better ROC AUC score than the base model.

In [18]:
xgb_model.fit(X_train, y_train)
opt_preds = xgb_model.predict_proba(X_test)[:,1]
score = roc_auc_score(y_test, opt_preds)
print(f'Optimization ROC_AUC score : {score}')

Optimization ROC_AUC score : 0.9603179836612082


In [19]:
#base_xgb.fit(X_train, y_train) # this was already fitted
base_preds = base_xgb.predict_proba(X_test)[:,1]
score = roc_auc_score(y_test, base_preds)
print(f'Base ROC_AUC score : {score}')

Base ROC_AUC score : 0.9482751462309656


## Optuna's Tuning Visualization  
We can appreciate the history plot, slice plot, and the hyperparameter importance plot to better understand our results.

In [20]:
#!pip install plotly

In [21]:
fig_1 = optuna.visualization.plot_optimization_history(study)
fig_2 = optuna.visualization.plot_slice(study)
fig_3 = optuna.visualization.plot_param_importances(study)

fig_1.show()
fig_2.show()
fig_3.show()

# Final Predict to submission

In [22]:
# Now we fit the optimizated model with the entire data to make tha final prediction
xgb_model.fit(X, y)
preds = xgb_model.predict_proba(test_data)[:,1]

In [23]:
predictions = preds
predictions

array([0.9976725 , 0.02069541, 0.53832364, ..., 0.01236033, 0.21029922,
       0.95791847], dtype=float32)

In [24]:
submission = pd.DataFrame()
submission["id"] = ids
submission["loan_status"] = predictions
submission.to_csv("submission.csv",header=True, index=False)
submission.head()

Unnamed: 0,id,loan_status
0,58645,0.997672
1,58646,0.020695
2,58647,0.538324
3,58648,0.00955
4,58649,0.033415
