# Model Optimization

In this case, a model of the data with the total of the selected features in the dimensionality reduction will be optimized, but with the first 18 features that led the model to have F1 score above 0.9, having the goal to check if these reduced features will improve or at least have the same performance of the previous model and reduce overfitting. Optimization of the model will be focused on F1 score since our data is imbalanced, and we want to have a good prediction from both minority and majority class.

The hyperparameter tuning will be made using bayesian optimization, which is a kind of "informed optimization" using a probabilistic approach in order to reach the minimum of the loss function score. The framework "optuna" is ideal for this bayesian optimization.

# Preparing environment

In [4]:
import pandas as pd
import xgboost as xgb
import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
import sys
sys.path.append('../ecommerce_customer_churn_prevention')
from utils import paths
import warnings
warnings.filterwarnings('ignore')

# Importing the data

In [5]:
X = pd.read_csv(paths.data_processed_dir('df_red_selected_features.csv'))
y = pd.read_csv(paths.data_processed_dir('df_processed.csv'))['Churn']
X.head()

Unnamed: 0,CashbackAmount_Tenure_Ratio,Tenure,Complain_PreferedOrderCat,Complain,PreferedOrderCat_MaritalStatus,Complain_MaritalStatus,NumberOfAddress,WarehouseToHome_Tenure_Ratio,SatisfactionScore_Tenure_Ratio,PreferredLoginDevice_PreferredPaymentMode,CityTier_PreferredPaymentMode,PreferedOrderCat,OrderCount_Tenure_Ratio,DaySinceLastOrder,PreferredLoginDevice_CityTier,CouponUsed,SatisfactionScore_NumberOfDeviceRegistered,Gender_PreferedOrderCat
0,31.986,4.0,1_Laptop & Accessory,1,Laptop & Accessory_Single,1_Single,9,1.2,0.4,Mobile Phone_Debit Card,3_Debit Card,Laptop & Accessory,0.2,5.0,Mobile Phone_3,1.0,6,Female_Laptop & Accessory
1,,,1_Mobile Phone,1,Mobile Phone_Single,1_Single,7,,,Mobile Phone_UPI,1_UPI,Mobile Phone,,0.0,Mobile Phone_1,0.0,12,Male_Mobile Phone
2,,,1_Mobile Phone,1,Mobile Phone_Single,1_Single,6,,,Mobile Phone_Debit Card,1_Debit Card,Mobile Phone,,3.0,Mobile Phone_1,0.0,12,Male_Mobile Phone
3,134.07,0.0,0_Laptop & Accessory,0,Laptop & Accessory_Single,0_Single,8,15.0,5.0,Mobile Phone_Debit Card,3_Debit Card,Laptop & Accessory,1.0,3.0,Mobile Phone_3,0.0,20,Male_Laptop & Accessory
4,129.6,0.0,0_Mobile Phone,0,Mobile Phone_Single,0_Single,3,12.0,5.0,Mobile Phone_Credit Card,1_Credit Card,Mobile Phone,1.0,3.0,Mobile Phone_1,1.0,15,Male_Mobile Phone


In [6]:
# Converting the features to categorical like the data dictionary

cat_features = ['PreferedOrderCat', 'Complain']

X[cat_features] = X[cat_features].astype('category')

# Converting the new features to categorical
new_feat_cat = [col for col in X.select_dtypes('object').columns]
X[new_feat_cat] = X[new_feat_cat].astype('category')

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 18 columns):
 #   Column                                      Non-Null Count  Dtype   
---  ------                                      --------------  -----   
 0   CashbackAmount_Tenure_Ratio                 5366 non-null   float64 
 1   Tenure                                      5366 non-null   float64 
 2   Complain_PreferedOrderCat                   5630 non-null   category
 3   Complain                                    5630 non-null   category
 4   PreferedOrderCat_MaritalStatus              5630 non-null   category
 5   Complain_MaritalStatus                      5630 non-null   category
 6   NumberOfAddress                             5630 non-null   int64   
 7   WarehouseToHome_Tenure_Ratio                5115 non-null   float64 
 8   SatisfactionScore_Tenure_Ratio              5366 non-null   float64 
 9   PreferredLoginDevice_PreferredPaymentMode   5630 non-null   category
 10  

# Dividing data into train and test

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Defining function to optimize

In [8]:
def objective(trial):
    # Suggest hyperparameters
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1.0, 10.0),
        'lambda': trial.suggest_float('lambda', 0.0, 10.0),
        'alpha': trial.suggest_float('alpha', 0.0, 10.0),
    }
    
    # Train/Validation split
    X_train_f, X_val, y_train_f, y_val = train_test_split(X, y, test_size=0.25, stratify=y)
    
    # Train XGBoost
    dtrain = xgb.DMatrix(X_train_f, label=y_train_f, enable_categorical=True)
    dval = xgb.DMatrix(X_val, label=y_val, enable_categorical=True)
    model = xgb.train(params, dtrain, evals=[(dval, "validation")], verbose_eval=False, num_boost_round=100, early_stopping_rounds=10)
    
    # Predictions
    preds = model.predict(dval)
    preds = (preds > 0.5).astype(int)
    
    # Evaluate
    return f1_score(y_val, preds)

# Create and optimize the study

In [9]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

[I 2024-12-28 18:01:19,984] A new study created in memory with name: no-name-b658de54-dd8d-48b2-aede-b6b65ca6a6dd
[I 2024-12-28 18:01:20,244] Trial 0 finished with value: 0.8028933092224232 and parameters: {'learning_rate': 0.2945697983942519, 'max_depth': 6, 'min_child_weight': 2, 'subsample': 0.7365039832052973, 'colsample_bytree': 0.5545923493259708, 'gamma': 0.31977490032228495, 'scale_pos_weight': 7.614220751242778, 'lambda': 9.315182247777875, 'alpha': 3.057711780578012}. Best is trial 0 with value: 0.8028933092224232.
[I 2024-12-28 18:01:20,396] Trial 1 finished with value: 0.8823529411764706 and parameters: {'learning_rate': 0.2713085420143745, 'max_depth': 10, 'min_child_weight': 9, 'subsample': 0.7982827586218127, 'colsample_bytree': 0.7999789742683275, 'gamma': 0.9028231294918715, 'scale_pos_weight': 9.66853029711783, 'lambda': 5.651205336766863, 'alpha': 0.22609399903703364}. Best is trial 1 with value: 0.8823529411764706.
[I 2024-12-28 18:01:20,508] Trial 2 finished with v

In [10]:
# Best parameters
print("Best Parameters:", study.best_params)

Best Parameters: {'learning_rate': 0.11059207827904094, 'max_depth': 9, 'min_child_weight': 9, 'subsample': 0.9653540652555519, 'colsample_bytree': 0.9357324803376276, 'gamma': 0.787619584367361, 'scale_pos_weight': 4.223837916590103, 'lambda': 5.246875260274468, 'alpha': 0.017323948443704662}


# Training the final model and testing on test data

In [11]:
# Training will be made with both train and val data, then tested on test data

best_params = study.best_params
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, enable_categorical=True, **best_params)
xgb_model.fit(X_train, y_train, verbose=0)
y_pred = xgb_model.predict(X_test)

In [12]:
# Testing the model

print(f'F1 Score for model optimized with Optuna and all the selected categories: {f1_score(y_test, y_pred)}')
print(f'Confusion matrix: \n{confusion_matrix(y_test, y_pred)}')

F1 Score for model optimized with Optuna and all the selected categories: 0.8985507246376812
Confusion matrix: 
[[898  38]
 [  4 186]]


The model has lower F1 Score than the previous model. Nontheless, 0.8985 F1 score with 4 false positives and 38 false negatives indicates a better performance on this model for predicting the customers who will really churn, since the complexity of the data was reduced and the patterns of the churn customers were clearer.

The decision of work with any of the models is on hands of the stakeholders, since they decide how to retain the customers and their strategy from this analysis, which may involves discounts and some other promotions for the churning customers predicted.

In [13]:
# Saving the model
xgb_model.save_model(paths.models_dir('xgb_model_optuna_red.json'))