# Model Optimization

In this case, a model of the data with the total of the selected features in the dimensionality reduction will be optimized. Optimization of the model will be focused on F1 score since our data is imbalanced, and we want to have a good prediction from both minority and majority class.

The hyperparameter tuning will be made using bayesian optimization, which is a kind of "informed optimization" using a probabilistic approach in order to reach the minimum of the loss function score. The framework "optuna" is ideal for this bayesian optimization.

# Preparing environment

In [1]:
import pandas as pd
import xgboost as xgb
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
import sys
sys.path.append('../ecommerce_customer_churn_prevention')
from utils import paths
import warnings
warnings.filterwarnings('ignore')

# Importing the data

In [2]:
X = pd.read_csv(paths.data_processed_dir('df_selected_features.csv'))
y = pd.read_csv(paths.data_processed_dir('df_processed.csv'))['Churn']
X.head()

Unnamed: 0,CashbackAmount_Tenure_Ratio,Tenure,Complain_PreferedOrderCat,Complain,PreferedOrderCat_MaritalStatus,Complain_MaritalStatus,NumberOfAddress,WarehouseToHome_Tenure_Ratio,SatisfactionScore_Tenure_Ratio,PreferredLoginDevice_PreferredPaymentMode,CityTier_PreferredPaymentMode,PreferedOrderCat,OrderCount_Tenure_Ratio,DaySinceLastOrder,PreferredLoginDevice_CityTier,CouponUsed,SatisfactionScore_NumberOfDeviceRegistered,Gender_PreferedOrderCat,SatisfactionScore,Gender_MaritalStatus,MaritalStatus,HourSpendOnApp_Tenure_Ratio,CouponUsed_Tenure_Ratio,SatisfactionScore_CashbackAmount,Gender_Complain,SatisfactionScore_OrderCount,PreferredPaymentMode
0,31.986,4.0,1_Laptop & Accessory,1,Laptop & Accessory_Single,1_Single,9,1.2,0.4,Mobile Phone_Debit Card,3_Debit Card,Laptop & Accessory,0.2,5.0,Mobile Phone_3,1.0,6,Female_Laptop & Accessory,2,Female_Single,Single,0.6,0.2,319.86,Female_1,2.0,Debit Card
1,,,1_Mobile Phone,1,Mobile Phone_Single,1_Single,7,,,Mobile Phone_UPI,1_UPI,Mobile Phone,,0.0,Mobile Phone_1,0.0,12,Male_Mobile Phone,3,Male_Single,Single,,,362.7,Male_1,3.0,UPI
2,,,1_Mobile Phone,1,Mobile Phone_Single,1_Single,6,,,Mobile Phone_Debit Card,1_Debit Card,Mobile Phone,,3.0,Mobile Phone_1,0.0,12,Male_Mobile Phone,3,Male_Single,Single,,,360.84,Male_1,3.0,Debit Card
3,134.07,0.0,0_Laptop & Accessory,0,Laptop & Accessory_Single,0_Single,8,15.0,5.0,Mobile Phone_Debit Card,3_Debit Card,Laptop & Accessory,1.0,3.0,Mobile Phone_3,0.0,20,Male_Laptop & Accessory,5,Male_Single,Single,2.0,0.0,670.35,Male_0,5.0,Debit Card
4,129.6,0.0,0_Mobile Phone,0,Mobile Phone_Single,0_Single,3,12.0,5.0,Mobile Phone_Credit Card,1_Credit Card,Mobile Phone,1.0,3.0,Mobile Phone_1,1.0,15,Male_Mobile Phone,5,Male_Single,Single,,1.0,648.0,Male_0,5.0,Credit Card


In [3]:
# Converting the features to categorical like the data dictionary

cat_features = ['PreferredPaymentMode', 'PreferedOrderCat', 'MaritalStatus', 'Complain']

X[cat_features] = X[cat_features].astype('category')

# Converting the new features to categorical
new_feat_cat = [col for col in X.select_dtypes('object').columns]
X[new_feat_cat] = X[new_feat_cat].astype('category')

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 27 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   CashbackAmount_Tenure_Ratio                 5366 non-null   float64 
 1   Tenure                                      5366 non-null   float64 
 2   Complain_PreferedOrderCat                   5630 non-null   category
 3   Complain                                    5630 non-null   category
 4   PreferedOrderCat_MaritalStatus              5630 non-null   category
 5   Complain_MaritalStatus                      5630 non-null   category
 6   NumberOfAddress                             5630 non-null   int64   
 7   WarehouseToHome_Tenure_Ratio                5115 non-null   float64 
 8   SatisfactionScore_Tenure_Ratio              5366 non-null   float64 
 9   PreferredLoginDevice_PreferredPaymentMode   5630 non-null   category
 10  

# Dividing data into train and test

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Defining function to optimize

In [5]:
def objective(trial):
    # Suggest hyperparameters
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1.0, 10.0),
        'lambda': trial.suggest_float('lambda', 0.0, 10.0),
        'alpha': trial.suggest_float('alpha', 0.0, 10.0),
    }
    
    # Train/Validation split
    X_train_f, X_val, y_train_f, y_val = train_test_split(X, y, test_size=0.25, stratify=y)
    
    # Train XGBoost
    dtrain = xgb.DMatrix(X_train_f, label=y_train_f, enable_categorical=True)
    dval = xgb.DMatrix(X_val, label=y_val, enable_categorical=True)
    model = xgb.train(params, dtrain, evals=[(dval, "validation")], verbose_eval=False, num_boost_round=100, early_stopping_rounds=10)
    
    # Predictions
    preds = model.predict(dval)
    preds = (preds > 0.5).astype(int)
    
    # Evaluate
    return f1_score(y_val, preds)

# Create and optimize the study

In [6]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

[I 2024-12-28 17:34:45,011] A new study created in memory with name: no-name-9c0b500b-c9a3-4f28-a99b-a0576e736459
[I 2024-12-28 17:34:45,390] Trial 0 finished with value: 0.7582608695652174 and parameters: {'learning_rate': 0.042233458384064684, 'max_depth': 8, 'min_child_weight': 1, 'subsample': 0.6265430679604262, 'colsample_bytree': 0.6252098959184998, 'gamma': 2.615790387202734, 'scale_pos_weight': 7.611939611097444, 'lambda': 4.097452090346288, 'alpha': 5.221593216246133}. Best is trial 0 with value: 0.7582608695652174.
[I 2024-12-28 17:34:45,514] Trial 1 finished with value: 0.6868044515103339 and parameters: {'learning_rate': 0.10411291694598152, 'max_depth': 3, 'min_child_weight': 8, 'subsample': 0.7643545317396281, 'colsample_bytree': 0.9128544786530477, 'gamma': 1.5301722998186418, 'scale_pos_weight': 8.50266211859486, 'lambda': 8.630697246501617, 'alpha': 9.525992052891109}. Best is trial 0 with value: 0.7582608695652174.
[I 2024-12-28 17:34:45,624] Trial 2 finished with val

In [7]:
# Best parameters
print("Best Parameters:", study.best_params)

Best Parameters: {'learning_rate': 0.18091369944679841, 'max_depth': 14, 'min_child_weight': 3, 'subsample': 0.8703506298699917, 'colsample_bytree': 0.7223654033799202, 'gamma': 0.23555286036098583, 'scale_pos_weight': 2.012695102219507, 'lambda': 2.703985849646468, 'alpha': 0.4279320792882001}


# Training the final model and testing on test data

In [15]:
# Training will be made with both train and val data, then tested on test data

best_params = study.best_params
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, enable_categorical=True, **best_params)
xgb_model.fit(X_train, y_train, verbose=0)
y_pred = xgb_model.predict(X_test)

In [16]:
# Testing the model

print(f'F1 Score for model optimized with Optuna and all the selected categories: {f1_score(y_test, y_pred)}')
print(f'Confusion matrix: \n{confusion_matrix(y_test, y_pred)}')

F1 Score for model optimized with Optuna and all the selected categories: 0.9402597402597402
Confusion matrix: 
[[922  14]
 [  9 181]]


The model has a great performance. 0.94 F1 score with 9 false positives and 14 false negatives, which means that the model was able to generalize to the unseen data and recognize patters between customers who churn and who don't.

In [17]:
# Saving the model
xgb_model.save_model(paths.models_dir('xgb_model_optuna_full.json'))