<a href="https://colab.research.google.com/github/antbartash/product_failure/blob/main/xgboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading libraries and data <br>
Since xgboost supports missing values, we don't need to fill them ourselves. Also decision trees are fine with not-scaled features and KNN is not needed, so StandardScaler won't be applied

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from scipy.stats import uniform

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
X_train_dummies = pd.read_csv('drive/MyDrive/product_failure/data/X_train_dummies.csv', index_col=0)
X_test_dummies = pd.read_csv('drive/MyDrive/product_failure/data/X_test_dummies.csv', index_col=0)
y_train = pd.read_csv('drive/MyDrive/product_failure/data/y_train.csv', index_col=0)
y_test = pd.read_csv('drive/MyDrive/product_failure/data/y_test.csv', index_col=0)

X_train_dummies = X_train_dummies.astype(np.float32)
X_test_dummies = X_test_dummies.astype(np.float32)

Check first 5 observations and data shapes to make sure that the data was read correctly

In [5]:
X_train_dummies.head()

Unnamed: 0,loading,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,...,"measurement_2_grouped_(-0.001, 2.0]","measurement_2_grouped_(2.0, 3.0]","measurement_2_grouped_(3.0, 4.0]","measurement_2_grouped_(4.0, 5.0]","measurement_2_grouped_(5.0, 6.0]","measurement_2_grouped_(6.0, 7.0]","measurement_2_grouped_(7.0, 8.0]","measurement_2_grouped_(8.0, 9.0]","measurement_2_grouped_(9.0, 11.0]","measurement_2_grouped_(11.0, 24.0]"
1630,107.529999,16.0,2.0,4.0,,11.714,16.813999,17.49,11.654,19.361,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18030,128.990005,3.0,9.0,5.0,17.242001,11.003,18.827,18.099001,11.6,18.962999,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
26078,128.330002,4.0,16.0,5.0,16.094,12.303,15.482,17.219,13.163,19.337999,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
19823,125.209999,11.0,9.0,6.0,16.677,12.402,17.490999,16.756001,10.988,19.947001,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
15788,106.120003,5.0,6.0,6.0,16.962999,11.773,15.789,17.518999,11.808,18.009001,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


XGBoost does not allow features names to contain square brackets, commas or spaces, so they will be replaced with curly braces or underscores 

In [6]:
X_train_dummies.columns = X_train_dummies.columns.str.replace(']', '}', 
                                                              regex=False)
X_test_dummies.columns = X_test_dummies.columns.str.replace(']', '}',
                                                            regex=False)

X_train_dummies.columns = X_train_dummies.columns.str.replace(', ', '_',
                                                              regex=False)
X_test_dummies.columns = X_test_dummies.columns.str.replace(', ', '_',
                                                            regex=False)

In [7]:
X_train_dummies.columns

Index(['loading', 'measurement_0', 'measurement_1', 'measurement_2',
       'measurement_3', 'measurement_4', 'measurement_5', 'measurement_6',
       'measurement_7', 'measurement_8', 'measurement_9', 'measurement_10',
       'measurement_11', 'measurement_12', 'measurement_13', 'measurement_14',
       'measurement_15', 'measurement_16', 'measurement_17', 'product_code_A',
       'product_code_B', 'product_code_C', 'product_code_D', 'product_code_E',
       'attribute_0_material_5', 'attribute_0_material_7',
       'attribute_1_material_5', 'attribute_1_material_6',
       'attribute_1_material_8', 'attribute_2_5', 'attribute_2_6',
       'attribute_2_8', 'attribute_2_9', 'attribute_3_5', 'attribute_3_6',
       'attribute_3_8', 'attribute_3_9', 'measurement_0_grouped_(-0.001_3.0}',
       'measurement_0_grouped_(3.0_4.0}', 'measurement_0_grouped_(4.0_5.0}',
       'measurement_0_grouped_(5.0_6.0}', 'measurement_0_grouped_(6.0_7.0}',
       'measurement_0_grouped_(7.0_8.0}', 'measure

In [8]:
print("X_train.shape: ", X_train_dummies.shape)
print("y_train.shape: ", y_train.shape)
print("X_test.shape: ", X_test_dummies.shape)
print("y_test.shape:", y_test.shape)

X_train.shape:  (19927, 64)
y_train.shape:  (19927, 1)
X_test.shape:  (6643, 64)
y_test.shape: (6643, 1)


In [9]:
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

print("y_train.shape: ", y_train.shape)
print("y_test.shape:", y_test.shape)

y_train.shape:  (19927,)
y_test.shape: (6643,)


# Baseline model <br>
Build a baseline xgboost model and evaluate its performance on train and test sets. For performance evaluation AUC will be used

In [10]:
model_baseline = XGBClassifier(objective='binary:logistic',
                               eval_metric='auc', seed=42,
                               use_label_encoder=False)
model_baseline.fit(X_train_dummies, y_train)

print("Train set AUC: {}".format(
    roc_auc_score(y_train, model_baseline.predict(X_train_dummies))))
print("Test set AUC: {}".format(
    roc_auc_score(y_test, model_baseline.predict(X_test_dummies))))

Train set AUC: 0.5022410945977825
Test set AUC: 0.5005181248839538


AUC values indicate, that the model may underfit the dataset. We can try to tune its parameters to improve quality of the model

# Tuning parameters values <br>
Since XGBoost supports missing values, there's no need to use StandardScaler, KNNImputer and pipelines. XGBoost models have many parameters and some of them may not have a strong influence on the quality of the models. Because of that, RandomizedSearchCV will be used at the start of parameters tuning process. Also early stopping will be used with X_test_dummies as the evaluation set

Round 1

In [15]:
xgbclf = XGBClassifier(n_estimators=1000, objective='binary:logistic',
                       eval_metric='auc', seed=42,
                       early_stopping_rounds=50, 
                       eval_set=[X_test_dummies, y_test],
                       tree_method='gpu_hist', use_label_encoder=False)

In [12]:
distr = {
    'learning_rate': uniform(0.01, 1),
    'gamma': [0, 0.1, 0.2],
    'max_depth': [10, 6, 3],
    'reg_lambda': uniform(0, 3),
    'scale_pos_weight': [0.25, 1],
    'subsample': [1, 0.9, 0.75],
    'colsample_bytree': [1, 0.9, 0.8]
}

clf = RandomizedSearchCV(xgbclf, distr, n_iter=70, cv=5, scoring='roc_auc',
                         random_state=42, verbose=2)

clf.fit(X_train_dummies, y_train)

print('Best score: ', clf.best_score_)
print('Best params: ', clf.best_params_)

Fitting 5 folds for each of 70 candidates, totalling 350 fits
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.1934347898661638, max_depth=10, reg_lambda=1.790550473839461, scale_pos_weight=1, subsample=0.75; total time=   9.4s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.1934347898661638, max_depth=10, reg_lambda=1.790550473839461, scale_pos_weight=1, subsample=0.75; total time=   9.3s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.1934347898661638, max_depth=10, reg_lambda=1.790550473839461, scale_pos_weight=1, subsample=0.75; total time=   9.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.1934347898661638, max_depth=10, reg_lambda=1.790550473839461, scale_pos_weight=1, subsample=0.75; total time=   9.2s
[CV] END colsample_bytree=0.8, gamma=0, learning_rate=0.1934347898661638, max_depth=10, reg_lambda=1.790550473839461, scale_pos_weight=1, subsample=0.75; total time=   9.5s
[CV] END colsample_bytree=0.8, gamma=0.2, learning_rate=0.469248891965867

In result of the first round of grid search, the following parameters values were selected:
* learning_rate = 0.0253
* max_depth = 3
* reg_lambda = 1.509
* gamma = 0.2
* scale_pos_weight = 0.25
* colsample_bytree = 0.9
* subsample: 1
<br> <br>
In the second round grid search will be used to find the optimal parameters values. Since in the first round the lowest value of max_depth was selected, in second round lower max_depth will be tested. Values of learning_rate and reg_lambda will be set close to the values recieved in the first round.

Round 2

In [18]:
grid = {
    'learning_rate': [0.01, 0.02, 0.025, 0.03, 0.04],
    'gamma': [0.2, 0.5, 1],
    'max_depth': [6, 5, 4, 3, 2],
    'reg_lambda': [1.4, 1.5, 1.6],
    'scale_pos_weight': [0.25],
    'subsample': [1],
    'colsample_bytree': [0.9]
}

clf = GridSearchCV(xgbclf, grid, cv=5, scoring='roc_auc',
                   verbose=1)

clf.fit(X_train_dummies, y_train)

print('Best score: ', clf.best_score_)
print('Best params: ', clf.best_params_)

Fitting 5 folds for each of 225 candidates, totalling 1125 fits
Best score:  0.5816683092783697
Best params:  {'colsample_bytree': 0.9, 'gamma': 0.2, 'learning_rate': 0.01, 'max_depth': 2, 'reg_lambda': 1.5, 'scale_pos_weight': 0.25, 'subsample': 1}


In result of the second round of grid search reg_lambda=1.5, gamma=0.2, max_depth=2, learning_rate=0.01 were selected. <br>
So trees depth should be set to 2 or 1 (max_depth=1 was not tested during first 3 rounds). the optimal value for reg_lambda is 1.5. The optimal value of gamma is between 0.2 and 0.4. Taking that into account the 4th round of grid search will be performed.

Round 4

In [20]:
grid = {
    'n_estimators': [1000],
    'learning_rate': [0.01, 0.001, 0.0001, 1e-5, 1e-6],
    'gamma': [0.2, 0.3, 0.4],
    'max_depth': [2, 1],
    'reg_lambda': [1.5],
    'scale_pos_weight': [0.25],
    'subsample': [1],
    'colsample_bytree': [0.9]
}

clf = GridSearchCV(xgbclf, grid, cv=5, scoring='roc_auc',
                   verbose=1)

clf.fit(X_train_dummies, y_train)

print('Best score: ', clf.best_score_)
print('Best params: ', clf.best_params_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best score:  0.5837430724489516
Best params:  {'colsample_bytree': 0.9, 'gamma': 0.2, 'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 1000, 'reg_lambda': 1.5, 'scale_pos_weight': 0.25, 'subsample': 1}


The 4th round of grid search shows, that the optimal value for gamma is 0.2. In the 5th round combinations of different number of estimators and learning rates will be tested

Round 5

In [23]:
grid = {
    'n_estimators': [5000, 4000, 3000, 2000, 1000],
    'learning_rate': [0.01, 0.001, 0.0001, 1e-5, 1e-6],
    'gamma': [0.2],
    'max_depth': [2, 1],
    'reg_lambda': [1.5],
    'scale_pos_weight': [0.25],
    'subsample': [1],
    'colsample_bytree': [0.9]
}

clf = GridSearchCV(xgbclf, grid, cv=5, scoring='roc_auc',
                   verbose=1)

clf.fit(X_train_dummies, y_train)

print('Best score: ', clf.best_score_)
print('Best params: ', clf.best_params_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best score:  0.5852756591895497
Best params:  {'colsample_bytree': 0.9, 'gamma': 0.2, 'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 4000, 'reg_lambda': 1.5, 'scale_pos_weight': 0.25, 'subsample': 1}
