I trained the XGBoost models using the preprocessing from [this notebook](https://github.com/carljvh6/Kaggle/blob/main/Steel_plate_defects/1_Data_exploration_preprocessing.ipynb). First, I trained it using default parameters, then using Optuna to optimise the hyperparameters. This approach yielded the [best results](https://www.kaggle.com/code/carljvh/xgb-baseline-b?scriptVersionId=172166730) (note that this was a submission after the competition deadline) I ultimately got for this competition.

I used the SciKitLearn API, noting the following
* It was the easiest to implement as a multilabel predictor, the SciKitLearn API does that for you. Using DMatrix, you would need to set up a different model for each binary prediction, i.e. every potential label that you would like to predict.
* The SciKitLearn models do not allow for optmisation using GPU and only use CPU for training. Using the native XGBoost API with DMatrix datasets allows for GPU optmisation, which would be significantly faster, which is of particular importance when doing the hyperparameter tweaking.
* I used cross-validation during training time to validate the models as well as validate it using a test set "X_t"  that I separated during preprocessing, these data are labelled and can be used for local validation of the models. Then there is another test dataset from the test.csv file that contains unlabelled data. This data only gets evaluated once submitted on Kaggle, where the score is then calculated.

# Setup

In [1]:
from pathlib import Path
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle:
    !pip install -Uqq fastai
    path = Path('/kaggle/input/playground-series-s4e3')
else:
    import zipfile,kaggle
    path = Path('playground-series-s4e3')
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict, cross_validate, cross_val_score
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, make_scorer, classification_report
import xgboost as xgb
from xgboost import XGBClassifier
import optuna

playground-series-s4e3.zip: Skipping, found more recently modified local copy (use --force to force download)


In [2]:
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')
target_classes = ["Pastry", "Z_Scratch", "K_Scatch", "Stains", "Dirtiness", "Bumps", "Other_Faults"]
targets_df = train_df[target_classes]
X_train, X_test, y_train, y_test = train_test_split(train_df.drop(target_classes + ['id'], axis=1), 
                                                    targets_df, test_size=0.1, random_state=40)

categorical = ['TypeOfSteel_A300', 'TypeOfSteel_A400', 'Outside_Global_Index']
numerical = list(set(train_df.columns) - set(categorical) - set(target_classes))
numerical.remove('id')

X = pd.get_dummies(X_train, columns=categorical)
y = y_train
X_t = pd.get_dummies(X_test, columns=categorical)
X_t.insert(len(X_t.columns)-1, 'Outside_Global_Index_0.7', 0)
y_t = y_test

# Baseline

In [3]:
%%time
xgb_sk_base = XGBClassifier(objective='binary:logistic')

def auc_score(estimator, X, y):
    y_prob = estimator.predict_proba(X)
    return roc_auc_score(y, y_prob, multi_class="ovr")

scores = cross_val_score(xgb_sk_base, X, y, scoring=auc_score, cv=5)
print(f'ROC AUC during cross validation: {scores.mean()}')

ROC AUC during cross validation: 0.8695190595359483
CPU times: user 1min 50s, sys: 1.04 s, total: 1min 51s
Wall time: 7.33 s


In [4]:
xgb_sk_base.fit(X, y)
preds = xgb_sk_base.predict_proba(X_t)
score = roc_auc_score(y_t, preds, multi_class='ovr')
score
print(f'ROC AUC on the test set: {score}')

ROC AUC on the test set: 0.8744220473564709


# Optuna for optimisation of hyperparameters

With the default parameters I got a ROC_AUC score of 0.8744 on the test set and ROC AUC: 0.8695 on 5-fold cross-validation, we will now see if we can improve that using Optuna for Bayesian optimisation.

See [here](https://xgboost.readthedocs.io/en/stable/parameter.html) for the documentation containing all the hyperparameters with their default settings.

See [this](https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html) for the XGB docs on optimising hyperparameters

## Trial 1

In [6]:
%%time
def objective(trial):
    clf = XGBClassifier(
        objective="binary:logistic",
        verbosity=0,
        learning_rate=trial.suggest_float("learning_rate", 1e-3, 0.5, log=True), #alias eta
        min_split_loss=trial.suggest_int("min_split_loss", 0, 30), #alias gamma
        reg_lambda=trial.suggest_int("reg_lambda", 1, 30),
        max_depth=trial.suggest_int("max_depth", 1, 10),
        subsample=trial.suggest_float("subsample", 0.05, 1.0),
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.05, 1.0),
        min_child_weight=trial.suggest_int("min_child_weight", 1, 20),
        tree_method='hist',
        n_estimators=100,
        
        # predictor='gpu_predictor'
    )

    score = cross_val_score(clf, X, y, scoring=auc_score, cv=5).mean()

    # score = cross_val_score(clf, X, y, cv=5, scoring='roc_auc').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

[I 2024-05-18 18:11:12,490] A new study created in memory with name: no-name-52d3f095-6422-4c6d-9475-9e3a638734c2
[I 2024-05-18 18:11:15,864] Trial 0 finished with value: 0.8411229492772497 and parameters: {'learning_rate': 0.007238756199821027, 'min_split_loss': 24, 'reg_lambda': 25, 'max_depth': 10, 'subsample': 0.5069477618138633, 'colsample_bytree': 0.16520559215517258, 'min_child_weight': 18}. Best is trial 0 with value: 0.8411229492772497.
[I 2024-05-18 18:11:20,666] Trial 1 finished with value: 0.8289227418347395 and parameters: {'learning_rate': 0.0018209400392888574, 'min_split_loss': 30, 'reg_lambda': 14, 'max_depth': 6, 'subsample': 0.5767450282681376, 'colsample_bytree': 0.8932312400905561, 'min_child_weight': 8}. Best is trial 0 with value: 0.8411229492772497.
[I 2024-05-18 18:11:24,759] Trial 2 finished with value: 0.8449900854727715 and parameters: {'learning_rate': 0.0019895196957494753, 'min_split_loss': 10, 'reg_lambda': 24, 'max_depth': 9, 'subsample': 0.644006024762

CPU times: user 1h 9min 59s, sys: 48 s, total: 1h 10min 47s
Wall time: 4min 26s


Best is trial 41 with value: 0.8838653787221867
Trial 41 finished with value: 0.8838653787221867 and parameters: {'learning_rate': 0.07840922749254706, 'min_split_loss': 2, 'reg_lambda': 15, 'max_depth': 5, 'subsample': 0.653997769164365, 'colsample_bytree': 0.8263332660331102, 'min_child_weight': 1

CPU times: user 1h 9min 59s, sys: 48 s, total: 1h 10min 47s
Wall time: 4min 26s
___

<previous>
Best hyperparameters: {'learning_rate': 0.10258040309793745, 'min_split_loss': 3, 'reg_lambda': 23, 'max_depth': 9, 'subsample': 0.7588835041823632, 'colsample_bytree': 0.9101347208809124, 'min_child_weight': 14}

Best is trial 25 with value: 0.8829777101244968.

CPU times: user 26min 55s, sys: 16.4 s, total: 27min 12s
Wall time: 3min 24s

### Validation of trial 1

In [7]:
best_params = {'learning_rate': 0.07840922749254706, 'min_split_loss': 2, 'reg_lambda': 15, 'max_depth': 5, 'subsample': 0.653997769164365, 'colsample_bytree': 0.8263332660331102, 'min_child_weight': 1}
xgb_tuned = XGBClassifier(**best_params, n_estimators=1000)
xgb_tuned.fit(X, y)

score = auc_score(xgb_tuned, X_t, y_t)
print(f'Results on test set using the best parameters from this trial {score}')

Results on test set using the best parameters from this trial 0.8867965676748469


## Trial 2

In [7]:
%%time
def objective(trial):
    clf = XGBClassifier(
        objective="binary:logistic",
        verbosity=0,
        learning_rate=trial.suggest_float("learning_rate", 0.05, 0.5, log=True), #alias eta
        min_split_loss=trial.suggest_int("min_split_loss", 0, 6), #alias gamma
        reg_lambda=trial.suggest_int("reg_lambda", 10, 35),
        max_depth=trial.suggest_int("max_depth", 5, 15),
        subsample=trial.suggest_float("subsample", 0.5, 1.0),
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.5, 1.0),
        min_child_weight=trial.suggest_int("min_child_weight", 5, 25),
        tree_method='hist',
        n_estimators=100,
        
        # predictor='gpu_predictor'
    )
    score = cross_val_score(clf, X, y, scoring=auc_score, cv=5).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best hyperparameters:', study.best_params)

[I 2024-05-18 18:15:38,572] A new study created in memory with name: no-name-30636938-88b6-4ad2-acf6-9edfd43f0585
[I 2024-05-18 18:15:43,199] Trial 0 finished with value: 0.8809891247920334 and parameters: {'learning_rate': 0.2474259677660816, 'min_split_loss': 4, 'reg_lambda': 18, 'max_depth': 15, 'subsample': 0.5538526531287781, 'colsample_bytree': 0.5337437765059053, 'min_child_weight': 7}. Best is trial 0 with value: 0.8809891247920334.
[I 2024-05-18 18:15:46,850] Trial 1 finished with value: 0.880370186078949 and parameters: {'learning_rate': 0.3626451451841311, 'min_split_loss': 4, 'reg_lambda': 35, 'max_depth': 6, 'subsample': 0.6699058849021429, 'colsample_bytree': 0.74259075784517, 'min_child_weight': 8}. Best is trial 0 with value: 0.8809891247920334.
[I 2024-05-18 18:15:49,885] Trial 2 finished with value: 0.8789332764029419 and parameters: {'learning_rate': 0.4785173837689242, 'min_split_loss': 5, 'reg_lambda': 32, 'max_depth': 11, 'subsample': 0.6514875975285264, 'colsampl

Best hyperparameters: {'learning_rate': 0.06668974735495849, 'min_split_loss': 3, 'reg_lambda': 14, 'max_depth': 12, 'subsample': 0.8724848294655697, 'colsample_bytree': 0.6047213191041573, 'min_child_weight': 16}
CPU times: user 1h 14min 27s, sys: 46.4 s, total: 1h 15min 13s
Wall time: 4min 42s


Best is trial 18 with value: 0.8843528889321093
Best hyperparameters: {'learning_rate': 0.06668974735495849, 'min_split_loss': 3, 'reg_lambda': 14, 'max_depth': 12, 'subsample': 0.8724848294655697, 'colsample_bytree': 0.6047213191041573, 'min_child_weight': 16}
__

Previous results
I got a score of 0.8902636942892138 on the test set, which is a significant improvement on the 0.8744 we got with the default parameters 

### Validation of trial 2

In [9]:
best_params = {'learning_rate': 0.06668974735495849, 'min_split_loss': 3, 'reg_lambda': 14, 'max_depth': 12, 'subsample': 0.8724848294655697, 'colsample_bytree': 0.6047213191041573, 'min_child_weight': 16}
xgb_tuned = XGBClassifier(**best_params, n_estimators=1000)
xgb_tuned.fit(X, y)

score = auc_score(xgb_tuned, X_t, y_t)
print(f'Results on test set using the best parameters from this trial {score}')

Results on test set using the best parameters from this trial 0.8918639818452758


### Trial 3 brute force

Here I am going to try an increased number of hyperparameters with more rounds and see what happens

In [10]:
%%time
def objective(trial):
    clf = XGBClassifier(
        objective="binary:logistic",
        verbosity=0,
        learning_rate=trial.suggest_float("learning_rate", 0.05, 0.3, log=True), #alias eta
        min_split_loss=trial.suggest_int("min_split_loss", 0, 6), #alias gamma
        reg_lambda=trial.suggest_int("reg_lambda", 10, 35),
        max_depth=trial.suggest_int("max_depth", 3, 15),
        subsample=trial.suggest_float("subsample", 0.5, 1.0),
        colsample_bytree=trial.suggest_float("colsample_bytree", 0.3, 0.8),
        min_child_weight=trial.suggest_int("min_child_weight", 5, 20),
        tree_method='hist',
        n_estimators=200,

        colsample_bylevel=trial.suggest_float("colsample_bylevel", 0.1, 1.0),
        colsample_bynode=trial.suggest_float("colsample_bynode", 0.1, 1.0),
        reg_alpha=trial.suggest_int("reg_alpha", 0, 10)
        
        # predictor='gpu_predictor'
    )
    score = cross_val_score(clf, X, y, scoring=auc_score, cv=3).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=500)
print('Best hyperparameters:', study.best_params)

[I 2024-05-18 18:20:27,461] A new study created in memory with name: no-name-1a679e7a-3382-4f72-a95f-b80fa3003fb3
[I 2024-05-18 18:20:32,030] Trial 0 finished with value: 0.8772856883075079 and parameters: {'learning_rate': 0.053156052754817225, 'min_split_loss': 0, 'reg_lambda': 31, 'max_depth': 5, 'subsample': 0.897396261817292, 'colsample_bytree': 0.5888034951252654, 'min_child_weight': 14, 'colsample_bylevel': 0.4875938662599666, 'colsample_bynode': 0.29907147323647565, 'reg_alpha': 10}. Best is trial 0 with value: 0.8772856883075079.
[I 2024-05-18 18:20:34,458] Trial 1 finished with value: 0.875327336177126 and parameters: {'learning_rate': 0.14746047317267075, 'min_split_loss': 3, 'reg_lambda': 11, 'max_depth': 12, 'subsample': 0.9884069469245311, 'colsample_bytree': 0.5841892355546068, 'min_child_weight': 11, 'colsample_bylevel': 0.1837798563021386, 'colsample_bynode': 0.41169743163348993, 'reg_alpha': 3}. Best is trial 0 with value: 0.8772856883075079.
[I 2024-05-18 18:20:36,86

Best hyperparameters: {'learning_rate': 0.053686089568784456, 'min_split_loss': 0, 'reg_lambda': 14, 'max_depth': 6, 'subsample': 0.9827683050023375, 'colsample_bytree': 0.3656127562437365, 'min_child_weight': 5, 'colsample_bylevel': 0.813776637511822, 'colsample_bynode': 0.9658322536757605, 'reg_alpha': 4}
CPU times: user 11h 43min 18s, sys: 6min 45s, total: 11h 50min 3s
Wall time: 44min 27s


Best is trial 447 with value: 0.8852777829291546.
Best hyperparameters: {'learning_rate': 0.053686089568784456, 'min_split_loss': 0, 'reg_lambda': 14, 'max_depth': 6, 'subsample': 0.9827683050023375, 'colsample_bytree': 0.3656127562437365, 'min_child_weight': 5, 'colsample_bylevel': 0.813776637511822, 'colsample_bynode': 0.9658322536757605, 'reg_alpha': 4}
CPU times: user 11h 43min 18s, sys: 6min 45s, total: 11h 50min 3s
Wall time: 44min 27s

I had run this trial a few times earlier on another notebook and these were the results I got:

Best is trial 413 with value: 0.885647937243489
Best hyperparameters: {'learning_rate': 0.06968316190044742, 'min_split_loss': 1, 'reg_lambda': 24, 'max_depth': 5, 'subsample': 0.9906953118234637, 'colsample_bytree': 0.3258025562215848, 'min_child_weight': 5, 'colsample_bylevel': 0.7473381589535361, 'colsample_bynode': 0.896729936375899, 'reg_alpha': 0}
CPU times: user 5h 19min 3s, sys: 3min 21s, total: 5h 22min 25s
Wall time: 40min 18s

Trial 100 finished with value: 0.8855198659438193 and parameters: {'learning_rate': 0.06543786100681456, 'min_split_loss': 1, 'reg_lambda': 24, 'max_depth': 6, 'subsample': 0.9998978435460225, 'colsample_bytree': 0.3255760701926972, 'min_child_weight': 5, 'colsample_bylevel': 0.9391003933370177, 'colsample_bynode': 0.9979759672213127, 'reg_alpha': 1}. Best is trial 100 with value: 0.8855198659438193.

Trial 228 finished with value: 0.8856404469468021 and parameters: {'learning_rate': 0.06783207140063613, 'min_split_loss': 1, 'reg_lambda': 24, 'max_depth': 6, 'subsample': 0.990662871547234, 'colsample_bytree': 0.31151856870493166, 'min_child_weight': 5, 'colsample_bylevel': 0.7809688716778227, 'colsample_bynode': 0.931803688219953, 'reg_alpha': 1}. Best is trial 228 with value: 0.8856404469468021.

This final set was the one I used in [this notebook](https://www.kaggle.com/code/carljvh/xgb-baseline-b?scriptVersionId=172166730) that got scores of 0.88307(private) and 0.89099(public) on submitting to Kaggle. Position 823 of 2199

### Validation of trial 3 on the test set

In [10]:
best_params = {'learning_rate': 0.053686089568784456, 'min_split_loss': 0, 'reg_lambda': 14, 'max_depth': 6, 'subsample': 0.9827683050023375, 'colsample_bytree': 0.3656127562437365, 'min_child_weight': 5, 'colsample_bylevel': 0.813776637511822, 'colsample_bynode': 0.9658322536757605, 'reg_alpha': 4}
xgb_tuned = XGBClassifier(**best_params, n_estimators=1000)
xgb_tuned.fit(X, y)

score = auc_score(xgb_tuned, X_t, y_t)
print(f'Results on test set using the best parameters from this trial {score}')

Results on test set using the best parameters from this trial 0.8874028580609241
