<h1 align="center">Predicting Conversion in Digital Marketing Dataset</h1>

<h2>Summary and Goal</h2>

The dataset contains marketing data for campaigns and is at the user level. The goal is to predict which users will <b>not</b> convert. Conversion rate in this dataset is pretty high at 88%, so identifying the minority case should be more useful. To increase campaign efficiency, these non-converting users could be added to an exclusion audience. Or, experiments could be run with different marketing strategies to see how to increase the conversion rate for these users.

<h2>Assumptions</h2>
<ul>
    <li>Customer engagement features show the historical lifetime values for that user. These include WebsiteVisits, PagesPerVisit, TimeOnSite, SocialShares, EmailOpens, and EmailClicks.
    <li>Marketing-specific metrics also show the historical lifetime values for that user. These include AdSpend, ClickThroughRate, and ConversionRate.
    <li>Then the target variable (Conversion) is for a new campaign
</ul>

<h2>Data Source</h2>

Kaggle: https://www.kaggle.com/datasets/rabieelkharoua/predict-conversion-in-digital-marketing-dataset/data

In [30]:
import pandas as pd

In [31]:
df = pd.read_csv('digital_marketing_campaign_dataset.csv')

In [32]:
df.head()

Unnamed: 0,CustomerID,Age,Gender,Income,CampaignChannel,CampaignType,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,AdvertisingPlatform,AdvertisingTool,Conversion
0,8000,56,Female,136912,Social Media,Awareness,6497.870068,0.043919,0.088031,0,2.399017,7.396803,19,6,9,4,688,IsConfid,ToolConfid,1
1,8001,69,Male,41760,Email,Retention,3898.668606,0.155725,0.182725,42,2.917138,5.352549,5,2,7,2,3459,IsConfid,ToolConfid,1
2,8002,46,Female,88456,PPC,Awareness,1546.429596,0.27749,0.076423,2,8.223619,13.794901,0,11,2,8,2337,IsConfid,ToolConfid,1
3,8003,32,Female,44085,PPC,Conversion,539.525936,0.137611,0.088004,47,4.540939,14.688363,89,2,2,0,2463,IsConfid,ToolConfid,1
4,8004,60,Female,83964,PPC,Conversion,1678.043573,0.252851,0.10994,0,2.046847,13.99337,6,6,6,8,4345,IsConfid,ToolConfid,1


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CustomerID           8000 non-null   int64  
 1   Age                  8000 non-null   int64  
 2   Gender               8000 non-null   object 
 3   Income               8000 non-null   int64  
 4   CampaignChannel      8000 non-null   object 
 5   CampaignType         8000 non-null   object 
 6   AdSpend              8000 non-null   float64
 7   ClickThroughRate     8000 non-null   float64
 8   ConversionRate       8000 non-null   float64
 9   WebsiteVisits        8000 non-null   int64  
 10  PagesPerVisit        8000 non-null   float64
 11  TimeOnSite           8000 non-null   float64
 12  SocialShares         8000 non-null   int64  
 13  EmailOpens           8000 non-null   int64  
 14  EmailClicks          8000 non-null   int64  
 15  PreviousPurchases    8000 non-null   i

In [34]:
df.describe()

Unnamed: 0,CustomerID,Age,Income,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,Conversion
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,11999.5,43.6255,84664.19675,5000.94483,0.154829,0.104389,24.751625,5.549299,7.727718,49.79975,9.476875,4.467375,4.4855,2490.2685,0.8765
std,2309.54541,14.902785,37580.387945,2838.038153,0.084007,0.054878,14.312269,2.607358,4.228218,28.901165,5.711111,2.856564,2.888093,1429.527162,0.329031
min,8000.0,18.0,20014.0,100.054813,0.010005,0.010018,0.0,1.000428,0.501669,0.0,0.0,0.0,0.0,0.0,0.0
25%,9999.75,31.0,51744.5,2523.221165,0.082635,0.05641,13.0,3.302479,4.06834,25.0,5.0,2.0,2.0,1254.75,1.0
50%,11999.5,43.0,84926.5,5013.440044,0.154505,0.104046,25.0,5.534257,7.682956,50.0,9.0,4.0,4.0,2497.0,1.0
75%,13999.25,56.0,116815.75,7407.989369,0.228207,0.152077,37.0,7.835756,11.481468,75.0,14.0,7.0,7.0,3702.25,1.0
max,15999.0,69.0,149986.0,9997.914781,0.299968,0.199995,49.0,9.999055,14.995311,99.0,19.0,9.0,9.0,4999.0,1.0


In [35]:
df['CustomerID'].nunique()

8000

In [36]:
print(df['AdvertisingPlatform'].value_counts())
print(df['AdvertisingTool'].value_counts())

AdvertisingPlatform
IsConfid    8000
Name: count, dtype: int64
AdvertisingTool
ToolConfid    8000
Name: count, dtype: int64


In [37]:
# Drop features that are either id's or don't vary
df_model = df.drop(columns=['CustomerID', 'AdvertisingPlatform', 'AdvertisingTool'])

In [38]:
# Drop rows where WebsiteVisits is 0 but PagesPerVisit or TimeOnSite > 0
# This is logically inconsistent - if no visits, there shouldn't be pages or time on site
rows_before = len(df_model)
inconsistent_rows = df_model[(df_model['WebsiteVisits'] == 0) & 
                              ((df_model['PagesPerVisit'] > 0) | (df_model['TimeOnSite'] > 0))]
print(f"Rows before filtering: {rows_before}")
print(f"Rows with inconsistent data (WebsiteVisits=0 but PagesPerVisit>0 or TimeOnSite>0): {len(inconsistent_rows)}")

df_model = df_model[~((df_model['WebsiteVisits'] == 0) & 
                       ((df_model['PagesPerVisit'] > 0) | (df_model['TimeOnSite'] > 0)))]

rows_after = len(df_model)
print(f"Rows after filtering: {rows_after}")
print(f"Rows dropped: {rows_before - rows_after}")


Rows before filtering: 8000
Rows with inconsistent data (WebsiteVisits=0 but PagesPerVisit>0 or TimeOnSite>0): 149
Rows after filtering: 7851
Rows dropped: 149


In [39]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7851 entries, 1 to 7999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                7851 non-null   int64  
 1   Gender             7851 non-null   object 
 2   Income             7851 non-null   int64  
 3   CampaignChannel    7851 non-null   object 
 4   CampaignType       7851 non-null   object 
 5   AdSpend            7851 non-null   float64
 6   ClickThroughRate   7851 non-null   float64
 7   ConversionRate     7851 non-null   float64
 8   WebsiteVisits      7851 non-null   int64  
 9   PagesPerVisit      7851 non-null   float64
 10  TimeOnSite         7851 non-null   float64
 11  SocialShares       7851 non-null   int64  
 12  EmailOpens         7851 non-null   int64  
 13  EmailClicks        7851 non-null   int64  
 14  PreviousPurchases  7851 non-null   int64  
 15  LoyaltyPoints      7851 non-null   int64  
 16  Conversion         7851 non-n

In [40]:
df_model.describe()

Unnamed: 0,Age,Income,AdSpend,ClickThroughRate,ConversionRate,WebsiteVisits,PagesPerVisit,TimeOnSite,SocialShares,EmailOpens,EmailClicks,PreviousPurchases,LoyaltyPoints,Conversion
count,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0,7851.0
mean,43.620431,84650.945612,5001.503823,0.15451,0.104387,25.221373,5.550547,7.726058,49.88753,9.484779,4.470259,4.486435,2489.470641,0.877977
std,14.915118,37593.190579,2837.227423,0.083965,0.054926,14.03139,2.609545,4.225544,28.889682,5.710298,2.857682,2.885794,1430.015617,0.327333
min,18.0,20014.0,100.054813,0.010005,0.010018,1.0,1.001882,0.501669,0.0,0.0,0.0,0.0,0.0,0.0
25%,31.0,51630.5,2521.206517,0.082243,0.056325,13.0,3.302288,4.070556,25.0,5.0,2.0,2.0,1254.5,1.0
50%,43.0,84913.0,5011.919049,0.153996,0.104086,25.0,5.534259,7.686031,50.0,9.0,4.0,4.0,2496.0,1.0
75%,56.5,116783.0,7403.056308,0.227744,0.152128,37.0,7.84237,11.466218,75.0,14.0,7.0,7.0,3703.5,1.0
max,69.0,149986.0,9997.914781,0.299968,0.199995,49.0,9.999055,14.995311,99.0,19.0,9.0,9.0,4999.0,1.0


In [41]:
# Convert Age column to categorical age_group
# Age 18-24 --> age_group 1
# Age 25-34 --> age_group 2
# Age 35-44 --> age_group 3
# Age 45-54 --> age_group 4
# Age 55-64 --> age_group 5
# Age 65+ --> age_group 6

def assign_age_group(age):
    if 18 <= age <= 24:
        return 1
    elif 25 <= age <= 34:
        return 2
    elif 35 <= age <= 44:
        return 3
    elif 45 <= age <= 54:
        return 4
    elif 55 <= age <= 64:
        return 5
    elif age >= 65:
        return 6
    else:
        return None

# Apply the function to create age_group column
df_model['age_group'] = df_model['Age'].apply(assign_age_group)

# Drop the original Age column and keep age_group as categorical
df_model = df_model.drop(columns=['Age'])

# Convert age_group to object type (categorical)
df_model['age_group'] = df_model['age_group'].astype('object')

print("Age column converted to age_group")
print(f"\nage_group value counts:")
print(df_model['age_group'].value_counts().sort_index())
print(f"\ndf_model columns: {df_model.columns.tolist()}")


Age column converted to age_group

age_group value counts:
age_group
1    1017
2    1501
3    1544
4    1540
5    1509
6     740
Name: count, dtype: int64

df_model columns: ['Gender', 'Income', 'CampaignChannel', 'CampaignType', 'AdSpend', 'ClickThroughRate', 'ConversionRate', 'WebsiteVisits', 'PagesPerVisit', 'TimeOnSite', 'SocialShares', 'EmailOpens', 'EmailClicks', 'PreviousPurchases', 'LoyaltyPoints', 'Conversion', 'age_group']


In [42]:
# Import necessary libraries for machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, make_scorer
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Import external boosting libraries (if available)
try:
    from catboost import CatBoostClassifier
    CATBOOST_AVAILABLE = True
except ImportError:
    CATBOOST_AVAILABLE = False
    print("Note: CatBoost not available. Install with: pip install catboost")

try:
    from xgboost import XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("Note: XGBoost not available. Install with: pip install xgboost")

try:
    from lightgbm import LGBMClassifier
    LIGHTGBM_AVAILABLE = True
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("Note: LightGBM not available. Install with: pip install lightgbm")

# Create custom scorer for F1 score targeting class 0 (non-converters)
f1_score_class0 = make_scorer(f1_score, pos_label=0)


In [43]:
# Prepare data: separate features and target
X = df_model.drop(columns=['Conversion'])
y = df_model['Conversion']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nTarget percentage:")
print(y.value_counts(normalize=True) * 100)


Features shape: (7851, 16)
Target shape: (7851,)

Target distribution:
Conversion
1    6893
0     958
Name: count, dtype: int64

Target percentage:
Conversion
1    87.797733
0    12.202267
Name: proportion, dtype: float64


In [44]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")


Categorical columns: ['Gender', 'CampaignChannel', 'CampaignType', 'age_group']
Numerical columns: ['Income', 'AdSpend', 'ClickThroughRate', 'ConversionRate', 'WebsiteVisits', 'PagesPerVisit', 'TimeOnSite', 'SocialShares', 'EmailOpens', 'EmailClicks', 'PreviousPurchases', 'LoyaltyPoints']


In [45]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")


Training set size: 6280
Test set size: 1571


In [46]:
# Create preprocessing pipeline
# One-hot encode categorical variables and standardize numerical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_cols)
    ],
    remainder='passthrough'
)

# Apply preprocessing
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed training features shape: {X_train_processed.shape}")
print(f"Processed test features shape: {X_test_processed.shape}")

# Oversample the minority class (Conversion = 0) to address class imbalance
from imblearn.over_sampling import SMOTE

print("\n" + "="*70)
print("OVERSAMPLING MINORITY CLASS (Conversion = 0) WITH SMOTE")
print("="*70)

print(f"\nOriginal class distribution in training set:")
print(y_train.value_counts())
print(f"\nClass percentages:")
print(y_train.value_counts(normalize=True) * 100)

# Apply SMOTE to oversample the minority class
smote = SMOTE(random_state=42, sampling_strategy='auto')
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train_processed, y_train)

print(f"\nAfter SMOTE oversampling:")
print(f"Training set size: {X_train_oversampled.shape[0]} (was {X_train_processed.shape[0]})")
print(f"\nOversampled class distribution:")
print(pd.Series(y_train_oversampled).value_counts())
print(f"\nOversampled class percentages:")
print(pd.Series(y_train_oversampled).value_counts(normalize=True) * 100)


Processed training features shape: (6280, 25)
Processed test features shape: (1571, 25)

OVERSAMPLING MINORITY CLASS (Conversion = 0) WITH SMOTE

Original class distribution in training set:
Conversion
1    5516
0     764
Name: count, dtype: int64

Class percentages:
Conversion
1    87.834395
0    12.165605
Name: proportion, dtype: float64

After SMOTE oversampling:
Training set size: 11032 (was 6280)

Oversampled class distribution:
Conversion
1    5516
0    5516
Name: count, dtype: int64

Oversampled class percentages:
Conversion
1    50.0
0    50.0
Name: proportion, dtype: float64


In [47]:
# Define models and their hyperparameter grids for optimization
# We'll use GridSearchCV with F1 score as the scoring metric
# Note: Models are trained on oversampled data (X_train_oversampled, y_train_oversampled)
# but evaluated on original data for fair comparison

param_grids = {
    'Logistic Regression': {
        'C': [0.1, 1, 10],  # Reduced to 3 key values
        'penalty': ['l1', 'l2'],
        'class_weight': [None, 'balanced']
        # Total: 3 * 2 * 2 = 12 combinations
    },
    'Decision Tree': {
        'max_depth': [5, 10, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', None]
        # Total: 4 * 3 * 3 * 2 = 72 combinations
    },
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [10, 20, None],  # Reduced to 3 values
        'min_samples_split': [2, 5],  # Reduced to 2 values
        'min_samples_leaf': [1, 2],  # Reduced to 2 values
        'max_features': ['sqrt', None]  # Reduced to 2 values
        # Total: 3 * 3 * 2 * 2 * 2 = 72 combinations (down from 324)
    },
    'Gradient Boosting': {
        'n_estimators': [100, 200],  # Early stopping will handle this
        'max_depth': [3, 5, 7],
        'learning_rate': [0.1, 0.2],  # Reduced to 2 values
        'min_samples_split': [2, 5],  # Reduced to 2 values
        'n_iter_no_change': [5, 10],  # Early stopping
        'subsample': [0.8, 1.0],  # Regularization
        'max_features': ['sqrt', None]  # Regularization
        # Total: 2 * 3 * 2 * 2 * 2 * 2 * 2 = 96 combinations (down from 216)
    },
    'AdaBoost': {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.1, 0.5, 1.0],
        'algorithm': ['SAMME', 'SAMME.R']
        # Total: 3 * 3 * 2 = 18 combinations
    }
}

# Add external boosting libraries if available
# Note: early_stopping_rounds removed from param_grids as it requires eval_set which GridSearchCV doesn't provide
if CATBOOST_AVAILABLE:
    param_grids['CatBoost'] = {
        'iterations': [100, 200],
        'depth': [3, 5, 7],
        'learning_rate': [0.1, 0.2],
        'l2_leaf_reg': [1, 3]  # Regularization
        # Total: 2 * 3 * 2 * 2 = 24 combinations
    }

if XGBOOST_AVAILABLE:
    param_grids['XGBoost'] = {
        'n_estimators': [100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.1, 0.2],
        'subsample': [0.8, 1.0],  # Regularization
        'colsample_bytree': [0.8, 1.0]  # Regularization
        # Total: 2 * 3 * 2 * 2 * 2 = 48 combinations
    }

if LIGHTGBM_AVAILABLE:
    param_grids['LightGBM'] = {
        'n_estimators': [100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.1, 0.2],
        'subsample': [0.8, 1.0],  # Regularization
        'colsample_bytree': [0.8, 1.0]  # Regularization
        # Total: 2 * 3 * 2 * 2 * 2 = 48 combinations
    }

# Base models
# Note: Gradient Boosting and external libraries will use early stopping if configured in param_grid
base_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42, warm_start=False),
    'AdaBoost': AdaBoostClassifier(random_state=42)
}

# Add external boosting libraries if available
if CATBOOST_AVAILABLE:
    base_models['CatBoost'] = CatBoostClassifier(
        random_state=42, 
        verbose=False, 
        allow_writing_files=False,
        early_stopping_rounds=10  # Set as fixed parameter instead of in grid
    )

if XGBOOST_AVAILABLE:
    base_models['XGBoost'] = XGBClassifier(
        random_state=42, 
        eval_metric='logloss', 
        use_label_encoder=False,
        tree_method='hist'  # Faster and more memory efficient
    )

if LIGHTGBM_AVAILABLE:
    base_models['LightGBM'] = LGBMClassifier(random_state=42, verbose=-1)

# Dictionary to store results
results = {}

# Optimize and evaluate each model
for name, base_model in base_models.items():
    print(f"\n{'='*60}")
    print(f"Optimizing {name}...")
    print(f"{'='*60}")
    
    # Perform grid search with F1 score for class 0 (non-converters) as the scoring metric
    grid_search = GridSearchCV(
        estimator=base_model,
        param_grid=param_grids[name],
        scoring=f1_score_class0,  # Use custom scorer targeting class 0
        cv=5,
        n_jobs=-1,
        verbose=1
    )
    
    # Use oversampled training data for model training
    grid_search.fit(X_train_oversampled, y_train_oversampled)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    
    print(f"\nBest parameters for {name}:")
    print(grid_search.best_params_)
    print(f"Best CV F1 Score (Class 0): {grid_search.best_score_:.4f}")
    
    # Make predictions (both class predictions and probability predictions)
    y_train_pred = best_model.predict(X_train_processed)
    y_test_pred = best_model.predict(X_test_processed)
    y_train_proba = best_model.predict_proba(X_train_processed)[:, 1]
    y_test_proba = best_model.predict_proba(X_test_processed)[:, 1]
    
    # Calculate metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    # Calculate precision, recall, and F1 for class 0 (non-converters)
    train_precision_class0 = precision_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_precision_class0 = precision_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    train_recall_class0 = recall_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_recall_class0 = recall_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    train_f1_class0 = f1_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_f1_class0 = f1_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    
    # Also calculate metrics for class 1 for comparison
    train_precision_class1 = precision_score(y_train, y_train_pred, pos_label=1, zero_division=0)
    test_precision_class1 = precision_score(y_test, y_test_pred, pos_label=1, zero_division=0)
    train_recall_class1 = recall_score(y_train, y_train_pred, pos_label=1, zero_division=0)
    test_recall_class1 = recall_score(y_test, y_test_pred, pos_label=1, zero_division=0)
    train_f1_class1 = f1_score(y_train, y_train_pred, pos_label=1, zero_division=0)
    test_f1_class1 = f1_score(y_test, y_test_pred, pos_label=1, zero_division=0)
    
    train_roc_auc = roc_auc_score(y_train, y_train_proba)
    test_roc_auc = roc_auc_score(y_test, y_test_proba)
    
    # Store results
    results[name] = {
        'model': best_model,
        'best_params': grid_search.best_params_,
        'cv_f1_score_class0': grid_search.best_score_,
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'train_precision_class0': train_precision_class0,
        'test_precision_class0': test_precision_class0,
        'train_recall_class0': train_recall_class0,
        'test_recall_class0': test_recall_class0,
        'train_f1_class0': train_f1_class0,
        'test_f1_class0': test_f1_class0,
        'train_precision_class1': train_precision_class1,
        'test_precision_class1': test_precision_class1,
        'train_recall_class1': train_recall_class1,
        'test_recall_class1': test_recall_class1,
        'train_f1_class1': train_f1_class1,
        'test_f1_class1': test_f1_class1,
        'train_roc_auc': train_roc_auc,
        'test_roc_auc': test_roc_auc
    }
    
    # Print results - focusing on class 0 (non-converters)
    print(f"\n{'='*60}")
    print(f"Performance Metrics for Class 0 (Non-Converters):")
    print(f"{'='*60}")
    print(f"Train Precision (Class 0): {train_precision_class0:.4f}")
    print(f"Test Precision (Class 0): {test_precision_class0:.4f}")
    print(f"Train Recall (Class 0): {train_recall_class0:.4f}")
    print(f"Test Recall (Class 0): {test_recall_class0:.4f}")
    print(f"Train F1 Score (Class 0): {train_f1_class0:.4f}")
    print(f"Test F1 Score (Class 0): {test_f1_class0:.4f}")
    print(f"\n{'='*60}")
    print(f"Other Metrics:")
    print(f"{'='*60}")
    print(f"Train Accuracy: {train_accuracy:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Train F1 Score (Class 1): {train_f1_class1:.4f}")
    print(f"Test F1 Score (Class 1): {test_f1_class1:.4f}")
    print(f"Train ROC-AUC: {train_roc_auc:.4f}")
    print(f"Test ROC-AUC: {test_roc_auc:.4f}")
    
    # Print confusion matrix for test set
    print(f"\nTest Confusion Matrix:")
    print(confusion_matrix(y_test, y_test_pred))



Optimizing Logistic Regression...
Fitting 5 folds for each of 12 candidates, totalling 60 fits


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py", line 1218, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\User


Best parameters for Logistic Regression:
{'C': 10, 'class_weight': None, 'penalty': 'l2'}
Best CV F1 Score (Class 0): 0.7437

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 0.2750
Test Precision (Class 0): 0.2790
Train Recall (Class 0): 0.7199
Test Recall (Class 0): 0.7320
Train F1 Score (Class 0): 0.3980
Test F1 Score (Class 0): 0.4040

Other Metrics:
Train Accuracy: 0.7350
Test Accuracy: 0.7333
Train F1 Score (Class 1): 0.8301
Test F1 Score (Class 1): 0.8282
Train ROC-AUC: 0.7949
Test ROC-AUC: 0.7994

Test Confusion Matrix:
[[ 142   52]
 [ 367 1010]]

Optimizing Decision Tree...
Fitting 5 folds for each of 72 candidates, totalling 360 fits

Best parameters for Decision Tree:
{'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best CV F1 Score (Class 0): 0.8542

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 1.0000
Test Precision (Class 0): 0.2784
Train Recall (Class 0): 1.0000
Test Recal

45 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1358, in wrapper
    estimator._validate_params()
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 471, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\


Best parameters for AdaBoost:
{'algorithm': 'SAMME', 'learning_rate': 1.0, 'n_estimators': 100}
Best CV F1 Score (Class 0): 0.8822

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 0.6596
Test Precision (Class 0): 0.6644
Train Recall (Class 0): 0.5301
Test Recall (Class 0): 0.5103
Train F1 Score (Class 0): 0.5878
Test F1 Score (Class 0): 0.5773

Other Metrics:
Train Accuracy: 0.9096
Test Accuracy: 0.9077
Train F1 Score (Class 1): 0.9492
Test F1 Score (Class 1): 0.9482
Train ROC-AUC: 0.8338
Test ROC-AUC: 0.8181

Test Confusion Matrix:
[[  99   95]
 [  50 1327]]

Optimizing CatBoost...
Fitting 5 folds for each of 24 candidates, totalling 120 fits

Best parameters for CatBoost:
{'depth': 7, 'iterations': 200, 'l2_leaf_reg': 1, 'learning_rate': 0.2}
Best CV F1 Score (Class 0): 0.8890

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 1.0000
Test Precision (Class 0): 0.8452
Train Recall (Class 0): 0.9948
Test Recall (Class 0): 0.3660
T

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Best parameters for XGBoost:
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}
Best CV F1 Score (Class 0): 0.9038

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 0.9984
Test Precision (Class 0): 0.9091
Train Recall (Class 0): 0.8259
Test Recall (Class 0): 0.4124
Train F1 Score (Class 0): 0.9040
Test F1 Score (Class 0): 0.5674

Other Metrics:
Train Accuracy: 0.9787
Test Accuracy: 0.9223
Train F1 Score (Class 1): 0.9880
Test F1 Score (Class 1): 0.9573
Train ROC-AUC: 0.9979
Test ROC-AUC: 0.8001

Test Confusion Matrix:
[[  80  114]
 [   8 1369]]

Optimizing LightGBM...
Fitting 5 folds for each of 48 candidates, totalling 240 fits

Best parameters for LightGBM:
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.8}
Best CV F1 Score (Class 0): 0.8951

Performance Metrics for Class 0 (Non-Converters):
Train Precision (Class 0): 0.9862
Test Precision (Class 0): 0.



In [48]:
# Create a comparison DataFrame - focusing on Class 0 (non-converters)
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'CV F1 (Class 0)': [results[m]['cv_f1_score_class0'] for m in results.keys()],
    'Test Precision (Class 0)': [results[m]['test_precision_class0'] for m in results.keys()],
    'Test Recall (Class 0)': [results[m]['test_recall_class0'] for m in results.keys()],
    'Test F1 (Class 0)': [results[m]['test_f1_class0'] for m in results.keys()],
    'Test Accuracy': [results[m]['test_accuracy'] for m in results.keys()],
    'Test F1 (Class 1)': [results[m]['test_f1_class1'] for m in results.keys()],
    'Test ROC-AUC': [results[m]['test_roc_auc'] for m in results.keys()]
})

# Sort by Test F1 Score for Class 0 (best first) - this is our primary metric
comparison_df = comparison_df.sort_values('Test F1 (Class 0)', ascending=False)

print("\n" + "="*100)
print("MODEL COMPARISON SUMMARY (Sorted by Test F1 Score for Class 0 - Non-Converters)")
print("="*100)
print(comparison_df.to_string(index=False))
print("\n" + "="*100)



MODEL COMPARISON SUMMARY (Sorted by Test F1 Score for Class 0 - Non-Converters)
              Model  CV F1 (Class 0)  Test Precision (Class 0)  Test Recall (Class 0)  Test F1 (Class 0)  Test Accuracy  Test F1 (Class 1)  Test ROC-AUC
           AdaBoost         0.882152                  0.664430               0.510309           0.577259       0.907702           0.948196      0.818102
           LightGBM         0.895059                  0.833333               0.438144           0.574324       0.919796           0.955727      0.806074
            XGBoost         0.903821                  0.909091               0.412371           0.567376       0.922342           0.957343      0.800130
  Gradient Boosting         0.906938                  0.880952               0.381443           0.532374       0.917250           0.954609      0.799467
           CatBoost         0.888999                  0.845238               0.365979           0.510791       0.913431           0.952514      0.795750
 

In [49]:
# Compare models trained with and without oversampling
print("\n" + "="*70)
print("COMPARING MODELS: WITH vs WITHOUT OVERSAMPLING")
print("="*70)

# Store results from oversampled models (already trained above)
results_oversampled = results.copy()

# Now train models WITHOUT oversampling for comparison
print("\nTraining models WITHOUT oversampling...")
results_no_oversampling = {}

for name, base_model in base_models.items():
    print(f"\n{'='*60}")
    print(f"Training {name} WITHOUT oversampling...")
    print(f"{'='*60}")
    
    # Perform grid search with F1 score for class 0 (non-converters) as the scoring metric
    # Train on original (non-oversampled) data
    grid_search = GridSearchCV(
        estimator=base_model,
        param_grid=param_grids[name],
        scoring=f1_score_class0,  # Use custom scorer targeting class 0
        cv=5,
        n_jobs=-1,
        verbose=0  # Set to 0 to reduce output
    )
    
    # Use original training data (no oversampling)
    grid_search.fit(X_train_processed, y_train)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV F1 Score (Class 0): {grid_search.best_score_:.4f}")
    
    # Make predictions
    y_train_pred = best_model.predict(X_train_processed)
    y_test_pred = best_model.predict(X_test_processed)
    
    # Calculate metrics for class 0
    train_f1_class0 = f1_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_f1_class0 = f1_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    train_precision_class0 = precision_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_precision_class0 = precision_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    train_recall_class0 = recall_score(y_train, y_train_pred, pos_label=0, zero_division=0)
    test_recall_class0 = recall_score(y_test, y_test_pred, pos_label=0, zero_division=0)
    
    # Store results (including the model object)
    results_no_oversampling[name] = {
        'model': best_model,
        'best_params': grid_search.best_params_,
        'cv_f1_score_class0': grid_search.best_score_,
        'train_f1_class0': train_f1_class0,
        'test_f1_class0': test_f1_class0,
        'train_precision_class0': train_precision_class0,
        'test_precision_class0': test_precision_class0,
        'train_recall_class0': train_recall_class0,
        'test_recall_class0': test_recall_class0,
        'train_accuracy': accuracy_score(y_train, y_train_pred),
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'train_f1_class1': f1_score(y_train, y_train_pred, pos_label=1, zero_division=0),
        'test_f1_class1': f1_score(y_test, y_test_pred, pos_label=1, zero_division=0),
        'test_roc_auc': roc_auc_score(y_test, best_model.predict_proba(X_test_processed)[:, 1])
    }

# Create comparison DataFrame
comparison_oversampling = pd.DataFrame({
    'Model': list(results_oversampled.keys()),
    'With Oversampling - Test F1 (Class 0)': [results_oversampled[m]['test_f1_class0'] for m in results_oversampled.keys()],
    'With Oversampling - Test Precision (Class 0)': [results_oversampled[m]['test_precision_class0'] for m in results_oversampled.keys()],
    'With Oversampling - Test Recall (Class 0)': [results_oversampled[m]['test_recall_class0'] for m in results_oversampled.keys()],
    'Without Oversampling - Test F1 (Class 0)': [results_no_oversampling[m]['test_f1_class0'] for m in results_no_oversampling.keys()],
    'Without Oversampling - Test Precision (Class 0)': [results_no_oversampling[m]['test_precision_class0'] for m in results_no_oversampling.keys()],
    'Without Oversampling - Test Recall (Class 0)': [results_no_oversampling[m]['test_recall_class0'] for m in results_no_oversampling.keys()],
    'F1 Improvement': [results_oversampled[m]['test_f1_class0'] - results_no_oversampling[m]['test_f1_class0'] for m in results_oversampled.keys()],
    'Recall Improvement': [results_oversampled[m]['test_recall_class0'] - results_no_oversampling[m]['test_recall_class0'] for m in results_oversampled.keys()],
    'Precision Change': [results_oversampled[m]['test_precision_class0'] - results_no_oversampling[m]['test_precision_class0'] for m in results_oversampled.keys()]
})

print("\n" + "="*100)
print("COMPARISON: WITH OVERSAMPLING vs WITHOUT OVERSAMPLING")
print("="*100)
print(comparison_oversampling.to_string(index=False))
print("\n" + "="*100)

# Determine which approach is better for each model
print("\n" + "="*70)
print("SUMMARY: Which approach performs better?")
print("="*70)
for model_name in results_oversampled.keys():
    f1_with = results_oversampled[model_name]['test_f1_class0']
    f1_without = results_no_oversampling[model_name]['test_f1_class0']
    improvement = f1_with - f1_without
    
    if improvement > 0:
        print(f"{model_name}: Oversampling is BETTER (F1 improvement: +{improvement:.4f})")
    elif improvement < 0:
        print(f"{model_name}: No oversampling is BETTER (F1 difference: {improvement:.4f})")
    else:
        print(f"{model_name}: Both approaches perform equally")

# Find best overall approach
best_f1_with = max([results_oversampled[m]['test_f1_class0'] for m in results_oversampled.keys()])
best_f1_without = max([results_no_oversampling[m]['test_f1_class0'] for m in results_no_oversampling.keys()])

print(f"\nBest Test F1 Score WITH oversampling: {best_f1_with:.4f}")
print(f"Best Test F1 Score WITHOUT oversampling: {best_f1_without:.4f}")

if best_f1_with > best_f1_without:
    print(f"\n✓ OVERALL WINNER: Oversampling (improvement: +{best_f1_with - best_f1_without:.4f})")
elif best_f1_without > best_f1_with:
    print(f"\n✓ OVERALL WINNER: No oversampling (improvement: +{best_f1_without - best_f1_with:.4f})")
else:
    print(f"\nBoth approaches perform equally")

# Calculate best individual model for ensemble comparison
best_f1_individual = max(best_f1_with, best_f1_without)
if best_f1_with >= best_f1_without:
    best_individual_name = max(results_oversampled.keys(), key=lambda x: results_oversampled[x]['test_f1_class0'])
    best_individual_results = results_oversampled[best_individual_name]
else:
    best_individual_name = max(results_no_oversampling.keys(), key=lambda x: results_no_oversampling[x]['test_f1_class0'])
    best_individual_results = results_no_oversampling[best_individual_name]



COMPARING MODELS: WITH vs WITHOUT OVERSAMPLING

Training models WITHOUT oversampling...

Training Logistic Regression WITHOUT oversampling...


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py", line 1218, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\User

Best parameters: {'C': 10, 'class_weight': 'balanced', 'penalty': 'l2'}
Best CV F1 Score (Class 0): 0.3886

Training Decision Tree WITHOUT oversampling...
Best parameters: {'max_depth': 10, 'max_features': None, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best CV F1 Score (Class 0): 0.3700

Training Random Forest WITHOUT oversampling...
Best parameters: {'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best CV F1 Score (Class 0): 0.3768

Training Gradient Boosting WITHOUT oversampling...
Best parameters: {'learning_rate': 0.2, 'max_depth': 3, 'max_features': None, 'min_samples_split': 5, 'n_estimators': 100, 'n_iter_no_change': 10, 'subsample': 1.0}
Best CV F1 Score (Class 0): 0.5163

Training AdaBoost WITHOUT oversampling...


45 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1358, in wrapper
    estimator._validate_params()
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 471, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\bnoha\AppData\Local\Programs\Python\Python310\lib\

Best parameters: {'algorithm': 'SAMME', 'learning_rate': 1.0, 'n_estimators': 200}
Best CV F1 Score (Class 0): 0.3814

Training CatBoost WITHOUT oversampling...
Best parameters: {'depth': 3, 'iterations': 100, 'l2_leaf_reg': 1, 'learning_rate': 0.2}
Best CV F1 Score (Class 0): 0.5771

Training XGBoost WITHOUT oversampling...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best CV F1 Score (Class 0): 0.5670

Training LightGBM WITHOUT oversampling...
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best CV F1 Score (Class 0): 0.5583

COMPARISON: WITH OVERSAMPLING vs WITHOUT OVERSAMPLING
              Model  With Oversampling - Test F1 (Class 0)  With Oversampling - Test Precision (Class 0)  With Oversampling - Test Recall (Class 0)  Without Oversampling - Test F1 (Class 0)  Without Oversampling - Test Precision (Class 0)  Without Oversampling - Test Recall (Class 0)  F1 Improvement  Recall Improvement  Precision Change
Logistic Regression                               0.403983                                      0.278978                                   0.731959                                  0.396648                                         0.272031               



In [50]:
# Create ensemble methods using top-performing models
print("\n" + "="*70)
print("TESTING ENSEMBLE METHODS")
print("="*70)

# Identify top 3-5 models based on Test F1 Score for Class 0
# Sort all models (both oversampled and non-oversampled) by performance
all_model_scores = []

# Add models from oversampled results
for name in results_oversampled.keys():
    all_model_scores.append({
        'name': name,
        'f1_score': results_oversampled[name]['test_f1_class0'],
        'model': results_oversampled[name]['model'],
        'source': 'oversampled'
    })

# Add models from non-oversampled results
for name in results_no_oversampling.keys():
    all_model_scores.append({
        'name': name,
        'f1_score': results_no_oversampling[name]['test_f1_class0'],
        'model': results_no_oversampling[name]['model'],
        'source': 'no_oversampling'
    })

# Sort by F1 score and get top models
all_model_scores.sort(key=lambda x: x['f1_score'], reverse=True)

print("\nTop performing models (sorted by Test F1 Score for Class 0):")
for i, model_info in enumerate(all_model_scores[:5], 1):
    print(f"{i}. {model_info['name']} ({model_info['source']}): {model_info['f1_score']:.4f}")

# Select top 3-5 models for ensemble (diverse set)
# Try to get models from different algorithms
top_models_for_ensemble = []
seen_types = set()

for model_info in all_model_scores:
    model_name = model_info['name']
    # Get diverse models (avoid duplicates of same algorithm)
    if model_name not in seen_types or len(top_models_for_ensemble) < 3:
        top_models_for_ensemble.append(model_info)
        seen_types.add(model_name)
    if len(top_models_for_ensemble) >= 5:
        break

print(f"\nSelected {len(top_models_for_ensemble)} models for ensemble:")
for i, model_info in enumerate(top_models_for_ensemble, 1):
    print(f"{i}. {model_info['name']} ({model_info['source']}): F1={model_info['f1_score']:.4f}")

# Prepare models for ensemble - retrain with best hyperparameters
from sklearn.base import clone

ensemble_models = []
for model_info in top_models_for_ensemble:
    model_name = model_info['name']
    source = model_info['source']
    
    # Get the best hyperparameters from the appropriate results dictionary
    if source == 'oversampled':
        best_params = results_oversampled[model_name]['best_params']
    else:
        best_params = results_no_oversampling[model_name]['best_params']
    
    # Get the base model and set hyperparameters
    if model_name in base_models:
        base_model = base_models[model_name]
        # Clone and set parameters
        model_for_ensemble = clone(base_model)
        model_for_ensemble.set_params(**best_params)
    else:
        # For external libraries, clone the trained model
        model_for_ensemble = clone(model_info['model'])
    
    ensemble_models.append((model_name, model_for_ensemble))

print(f"\nPrepared {len(ensemble_models)} models for ensemble with their best hyperparameters")



TESTING ENSEMBLE METHODS

Top performing models (sorted by Test F1 Score for Class 0):
1. LightGBM (no_oversampling): 0.6284
2. XGBoost (no_oversampling): 0.6127
3. CatBoost (no_oversampling): 0.6014
4. Gradient Boosting (no_oversampling): 0.5773
5. AdaBoost (oversampled): 0.5773

Selected 5 models for ensemble:
1. LightGBM (no_oversampling): F1=0.6284
2. XGBoost (no_oversampling): F1=0.6127
3. CatBoost (no_oversampling): F1=0.6014
4. Gradient Boosting (no_oversampling): F1=0.5773
5. AdaBoost (oversampled): F1=0.5773

Prepared 5 models for ensemble with their best hyperparameters


In [51]:
# Train ensemble methods
# Determine which training data to use based on best approach
use_oversampling_ensemble = best_f1_with >= best_f1_without
training_data_X = X_train_oversampled if use_oversampling_ensemble else X_train_processed
training_data_y = y_train_oversampled if use_oversampling_ensemble else y_train

print(f"\nUsing {'oversampled' if use_oversampling_ensemble else 'original'} training data for ensembles")
print(f"Training data shape: {training_data_X.shape}")

# 1. Voting Classifier (Hard Voting)
print("\n" + "="*70)
print("1. VOTING CLASSIFIER (Hard Voting)")
print("="*70)

voting_hard = VotingClassifier(
    estimators=ensemble_models,
    voting='hard',
    n_jobs=-1
)

voting_hard.fit(training_data_X, training_data_y)

# Make predictions
y_train_pred_voting_hard = voting_hard.predict(X_train_processed)
y_test_pred_voting_hard = voting_hard.predict(X_test_processed)

# Calculate metrics
train_f1_voting_hard = f1_score(y_train, y_train_pred_voting_hard, pos_label=0, zero_division=0)
test_f1_voting_hard = f1_score(y_test, y_test_pred_voting_hard, pos_label=0, zero_division=0)
test_precision_voting_hard = precision_score(y_test, y_test_pred_voting_hard, pos_label=0, zero_division=0)
test_recall_voting_hard = recall_score(y_test, y_test_pred_voting_hard, pos_label=0, zero_division=0)

print(f"Train F1 (Class 0): {train_f1_voting_hard:.4f}")
print(f"Test F1 (Class 0): {test_f1_voting_hard:.4f}")
print(f"Test Precision (Class 0): {test_precision_voting_hard:.4f}")
print(f"Test Recall (Class 0): {test_recall_voting_hard:.4f}")

# 2. Voting Classifier (Soft Voting) - if models support predict_proba
print("\n" + "="*70)
print("2. VOTING CLASSIFIER (Soft Voting)")
print("="*70)

try:
    voting_soft = VotingClassifier(
        estimators=ensemble_models,
        voting='soft',
        n_jobs=-1
    )
    
    voting_soft.fit(training_data_X, training_data_y)
    
    # Make predictions
    y_train_pred_voting_soft = voting_soft.predict(X_train_processed)
    y_test_pred_voting_soft = voting_soft.predict(X_test_processed)
    
    # Calculate metrics
    train_f1_voting_soft = f1_score(y_train, y_train_pred_voting_soft, pos_label=0, zero_division=0)
    test_f1_voting_soft = f1_score(y_test, y_test_pred_voting_soft, pos_label=0, zero_division=0)
    test_precision_voting_soft = precision_score(y_test, y_test_pred_voting_soft, pos_label=0, zero_division=0)
    test_recall_voting_soft = recall_score(y_test, y_test_pred_voting_soft, pos_label=0, zero_division=0)
    
    print(f"Train F1 (Class 0): {train_f1_voting_soft:.4f}")
    print(f"Test F1 (Class 0): {test_f1_voting_soft:.4f}")
    print(f"Test Precision (Class 0): {test_precision_voting_soft:.4f}")
    print(f"Test Recall (Class 0): {test_recall_voting_soft:.4f}")
    voting_soft_available = True
except Exception as e:
    print(f"Soft voting failed: {e}")
    voting_soft_available = False
    train_f1_voting_soft = 0
    test_f1_voting_soft = 0



Using original training data for ensembles
Training data shape: (6280, 25)

1. VOTING CLASSIFIER (Hard Voting)




Train F1 (Class 0): 0.7433
Test F1 (Class 0): 0.5950
Test Precision (Class 0): 0.9765
Test Recall (Class 0): 0.4278

2. VOTING CLASSIFIER (Soft Voting)




Train F1 (Class 0): 0.7504
Test F1 (Class 0): 0.6175
Test Precision (Class 0): 0.9670
Test Recall (Class 0): 0.4536


In [52]:
# 3. Stacking Classifier
print("\n" + "="*70)
print("3. STACKING CLASSIFIER")
print("="*70)

# Use Logistic Regression as the meta-learner
from sklearn.linear_model import LogisticRegression

stacking = StackingClassifier(
    estimators=ensemble_models,
    final_estimator=LogisticRegression(random_state=42, max_iter=1000),
    cv=5,
    n_jobs=-1
)

stacking.fit(training_data_X, training_data_y)

# Make predictions
y_train_pred_stacking = stacking.predict(X_train_processed)
y_test_pred_stacking = stacking.predict(X_test_processed)

# Calculate metrics
train_f1_stacking = f1_score(y_train, y_train_pred_stacking, pos_label=0, zero_division=0)
test_f1_stacking = f1_score(y_test, y_test_pred_stacking, pos_label=0, zero_division=0)
test_precision_stacking = precision_score(y_test, y_test_pred_stacking, pos_label=0, zero_division=0)
test_recall_stacking = recall_score(y_test, y_test_pred_stacking, pos_label=0, zero_division=0)

print(f"Train F1 (Class 0): {train_f1_stacking:.4f}")
print(f"Test F1 (Class 0): {test_f1_stacking:.4f}")
print(f"Test Precision (Class 0): {test_precision_stacking:.4f}")
print(f"Test Recall (Class 0): {test_recall_stacking:.4f}")

# Compare ensemble methods with best individual model
print("\n" + "="*70)
print("ENSEMBLE COMPARISON WITH BEST INDIVIDUAL MODEL")
print("="*70)

best_individual_f1 = best_f1_individual

print(f"\nBest Individual Model: {best_individual_name}")
print(f"  Test F1 (Class 0): {best_individual_f1:.4f}")

print(f"\nEnsemble Methods:")
print(f"  Voting (Hard) - Test F1 (Class 0): {test_f1_voting_hard:.4f} (improvement: {test_f1_voting_hard - best_individual_f1:+.4f})")
if voting_soft_available:
    print(f"  Voting (Soft) - Test F1 (Class 0): {test_f1_voting_soft:.4f} (improvement: {test_f1_voting_soft - best_individual_f1:+.4f})")
print(f"  Stacking - Test F1 (Class 0): {test_f1_stacking:.4f} (improvement: {test_f1_stacking - best_individual_f1:+.4f})")

# Create comparison DataFrame
ensemble_comparison = pd.DataFrame({
    'Method': ['Best Individual Model', 'Voting (Hard)', 'Voting (Soft)', 'Stacking'],
    'Test F1 (Class 0)': [
        best_individual_f1,
        test_f1_voting_hard,
        test_f1_voting_soft if voting_soft_available else None,
        test_f1_stacking
    ],
    'Test Precision (Class 0)': [
        best_individual_results['test_precision_class0'],
        test_precision_voting_hard,
        test_precision_voting_soft if voting_soft_available else None,
        test_precision_stacking
    ],
    'Test Recall (Class 0)': [
        best_individual_results['test_recall_class0'],
        test_recall_voting_hard,
        test_recall_voting_soft if voting_soft_available else None,
        test_recall_stacking
    ]
})

# Remove rows with None values
ensemble_comparison = ensemble_comparison.dropna()

print("\n" + "="*70)
print("DETAILED ENSEMBLE COMPARISON")
print("="*70)
print(ensemble_comparison.to_string(index=False))

# Find best overall method
best_ensemble_f1 = max([test_f1_voting_hard, test_f1_stacking] + ([test_f1_voting_soft] if voting_soft_available else []))
best_overall_f1 = max(best_individual_f1, best_ensemble_f1)

print(f"\n{'='*70}")
print("BEST OVERALL METHOD")
print(f"{'='*70}")
if best_overall_f1 > best_individual_f1:
    if best_ensemble_f1 == test_f1_stacking:
        print(f"✓ Stacking Classifier is the BEST (F1: {best_overall_f1:.4f})")
        print(f"  Improvement over best individual: +{best_overall_f1 - best_individual_f1:.4f}")
    elif best_ensemble_f1 == test_f1_voting_soft and voting_soft_available:
        print(f"✓ Voting (Soft) is the BEST (F1: {best_overall_f1:.4f})")
        print(f"  Improvement over best individual: +{best_overall_f1 - best_individual_f1:.4f}")
    else:
        print(f"✓ Voting (Hard) is the BEST (F1: {best_overall_f1:.4f})")
        print(f"  Improvement over best individual: +{best_overall_f1 - best_individual_f1:.4f}")
else:
    print(f"✓ Best Individual Model ({best_individual_name}) remains the BEST (F1: {best_overall_f1:.4f})")
    print(f"  Ensembles did not improve performance")



3. STACKING CLASSIFIER




Train F1 (Class 0): 0.7942
Test F1 (Class 0): 0.7165
Test Precision (Class 0): 0.9055
Test Recall (Class 0): 0.5928

ENSEMBLE COMPARISON WITH BEST INDIVIDUAL MODEL

Best Individual Model: LightGBM
  Test F1 (Class 0): 0.6284

Ensemble Methods:
  Voting (Hard) - Test F1 (Class 0): 0.5950 (improvement: -0.0334)
  Voting (Soft) - Test F1 (Class 0): 0.6175 (improvement: -0.0108)
  Stacking - Test F1 (Class 0): 0.7165 (improvement: +0.0881)

DETAILED ENSEMBLE COMPARISON
               Method  Test F1 (Class 0)  Test Precision (Class 0)  Test Recall (Class 0)
Best Individual Model           0.628378                  0.911765               0.479381
        Voting (Hard)           0.594982                  0.976471               0.427835
        Voting (Soft)           0.617544                  0.967033               0.453608
             Stacking           0.716511                  0.905512               0.592784

BEST OVERALL METHOD
✓ Stacking Classifier is the BEST (F1: 0.7165)
  Improvemen

In [53]:
# Identify the best overall model (including ensembles) based on Test F1 Score for Class 0 (non-converters)
# Use the comparison from the ensemble cell above to determine the best overall method

print(f"\n{'='*70}")
print(f"BEST OVERALL MODEL (Selected by Test F1 Score for Class 0 - Non-Converters)")
print(f"{'='*70}")

# Check if best overall is an ensemble or individual model
if best_overall_f1 > best_individual_f1:
    # Best is an ensemble method
    if best_ensemble_f1 == test_f1_stacking:
        best_method_name = "Stacking Classifier"
        best_method_type = "Ensemble"
        best_model_obj = stacking
        best_method_f1 = test_f1_stacking
        best_method_precision = test_precision_stacking
        best_method_recall = test_recall_stacking
        
        # Get predictions for classification report
        y_test_pred_best = stacking.predict(X_test_processed)
        y_test_proba_best = stacking.predict_proba(X_test_processed)[:, 1]
        
        print(f"\n✓ BEST METHOD: {best_method_name} (Ensemble)")
        print(f"  Test F1 Score (Class 0): {best_method_f1:.4f}")
        print(f"  Improvement over best individual model ({best_individual_name}): +{best_method_f1 - best_individual_f1:.4f}")
        
        print(f"\n{'='*70}")
        print(f"Ensemble Details:")
        print(f"{'='*70}")
        print(f"  Base Models: {', '.join([name for name, _ in ensemble_models])}")
        print(f"  Meta-learner: Logistic Regression")
        print(f"  Training Approach: {'WITH Oversampling (SMOTE)' if use_oversampling_ensemble else 'WITHOUT Oversampling'}")
        
    elif best_ensemble_f1 == test_f1_voting_soft and voting_soft_available:
        best_method_name = "Voting Classifier (Soft)"
        best_method_type = "Ensemble"
        best_model_obj = voting_soft
        best_method_f1 = test_f1_voting_soft
        best_method_precision = test_precision_voting_soft
        best_method_recall = test_recall_voting_soft
        
        # Get predictions for classification report
        y_test_pred_best = voting_soft.predict(X_test_processed)
        y_test_proba_best = voting_soft.predict_proba(X_test_processed)[:, 1]
        
        print(f"\n✓ BEST METHOD: {best_method_name} (Ensemble)")
        print(f"  Test F1 Score (Class 0): {best_method_f1:.4f}")
        print(f"  Improvement over best individual model ({best_individual_name}): +{best_method_f1 - best_individual_f1:.4f}")
        
        print(f"\n{'='*70}")
        print(f"Ensemble Details:")
        print(f"{'='*70}")
        print(f"  Base Models: {', '.join([name for name, _ in ensemble_models])}")
        print(f"  Voting Method: Soft (probability-weighted)")
        print(f"  Training Approach: {'WITH Oversampling (SMOTE)' if use_oversampling_ensemble else 'WITHOUT Oversampling'}")
        
    else:
        best_method_name = "Voting Classifier (Hard)"
        best_method_type = "Ensemble"
        best_model_obj = voting_hard
        best_method_f1 = test_f1_voting_hard
        best_method_precision = test_precision_voting_hard
        best_method_recall = test_recall_voting_hard
        
        # Get predictions for classification report
        y_test_pred_best = voting_hard.predict(X_test_processed)
        y_test_proba_best = voting_hard.predict_proba(X_test_processed)[:, 1]
        
        print(f"\n✓ BEST METHOD: {best_method_name} (Ensemble)")
        print(f"  Test F1 Score (Class 0): {best_method_f1:.4f}")
        print(f"  Improvement over best individual model ({best_individual_name}): +{best_method_f1 - best_individual_f1:.4f}")
        
        print(f"\n{'='*70}")
        print(f"Ensemble Details:")
        print(f"{'='*70}")
        print(f"  Base Models: {', '.join([name for name, _ in ensemble_models])}")
        print(f"  Voting Method: Hard (majority vote)")
        print(f"  Training Approach: {'WITH Oversampling (SMOTE)' if use_oversampling_ensemble else 'WITHOUT Oversampling'}")
    
    print(f"\n{'='*70}")
    print(f"Test Set Performance for Class 0 (Non-Converters):")
    print(f"{'='*70}")
    print(f"  Test Precision (Class 0): {best_method_precision:.4f}")
    print(f"  Test Recall (Class 0): {best_method_recall:.4f}")
    print(f"  Test F1 Score (Class 0): {best_method_f1:.4f}")
    
    # Calculate additional metrics
    test_accuracy_best = accuracy_score(y_test, y_test_pred_best)
    test_f1_class1_best = f1_score(y_test, y_test_pred_best, pos_label=1, zero_division=0)
    test_roc_auc_best = roc_auc_score(y_test, y_test_proba_best)
    
    print(f"\nOther Test Set Metrics:")
    print(f"  Test Accuracy: {test_accuracy_best:.4f}")
    print(f"  Test F1 Score (Class 1): {test_f1_class1_best:.4f}")
    print(f"  Test ROC-AUC: {test_roc_auc_best:.4f}")
    
    # Show comparison with best individual model
    print(f"\n{'='*70}")
    print(f"Comparison with Best Individual Model ({best_individual_name}):")
    print(f"{'='*70}")
    print(f"  Best Individual Model - Test F1 (Class 0): {best_individual_f1:.4f}")
    print(f"  Best Ensemble Method - Test F1 (Class 0): {best_method_f1:.4f}")
    print(f"  Improvement: +{best_method_f1 - best_individual_f1:.4f}")
    
else:
    # Best is an individual model
    best_f1_with = max([results_oversampled[m]['test_f1_class0'] for m in results_oversampled.keys()])
    best_f1_without = max([results_no_oversampling[m]['test_f1_class0'] for m in results_no_oversampling.keys()])
    
    # Determine which approach is better and get the best model
    if best_f1_with >= best_f1_without:
        use_oversampling = True
        best_f1_score = best_f1_with
        best_model_name = max(results_oversampled.keys(), key=lambda x: results_oversampled[x]['test_f1_class0'])
        best_results = results_oversampled[best_model_name]
    else:
        use_oversampling = False
        best_f1_score = best_f1_without
        best_model_name = max(results_no_oversampling.keys(), key=lambda x: results_no_oversampling[x]['test_f1_class0'])
        best_results = results_no_oversampling[best_model_name]
    
    best_model_obj = best_results['model']
    
    print(f"\n✓ BEST METHOD: {best_model_name} (Individual Model)")
    print(f"  Training Approach: {'WITH Oversampling (SMOTE)' if use_oversampling else 'WITHOUT Oversampling'}")
    print(f"  Test F1 Score (Class 0): {best_f1_score:.4f}")
    
    print(f"\nBest Hyperparameters:")
    for param, value in best_results['best_params'].items():
        print(f"  {param}: {value}")
    
    print(f"\nCross-Validation F1 Score (Class 0): {best_results['cv_f1_score_class0']:.4f}")
    
    print(f"\n{'='*70}")
    print(f"Test Set Performance for Class 0 (Non-Converters):")
    print(f"{'='*70}")
    print(f"  Test Precision (Class 0): {best_results['test_precision_class0']:.4f}")
    print(f"  Test Recall (Class 0): {best_results['test_recall_class0']:.4f}")
    print(f"  Test F1 Score (Class 0): {best_results['test_f1_class0']:.4f}")
    
    print(f"\nOther Test Set Metrics:")
    print(f"  Test Accuracy: {best_results['test_accuracy']:.4f}")
    print(f"  Test F1 Score (Class 1): {best_results['test_f1_class1']:.4f}")
    print(f"  Test ROC-AUC: {best_results['test_roc_auc']:.4f}")
    
    # Show comparison with the other approach
    print(f"\n{'='*70}")
    print(f"Comparison with {'WITHOUT' if use_oversampling else 'WITH'} Oversampling:")
    print(f"{'='*70}")
    if use_oversampling:
        other_results = results_no_oversampling[best_model_name]
        print(f"  Without Oversampling - Test F1 (Class 0): {other_results['test_f1_class0']:.4f}")
        print(f"  Improvement from Oversampling: +{best_f1_score - other_results['test_f1_class0']:.4f}")
    else:
        other_results = results_oversampled[best_model_name]
        print(f"  With Oversampling - Test F1 (Class 0): {other_results['test_f1_class0']:.4f}")
        print(f"  Improvement without Oversampling: +{best_f1_score - other_results['test_f1_class0']:.4f}")
    
    # Get predictions for classification report
    y_test_pred_best = best_model_obj.predict(X_test_processed)
    
    # Show comparison with ensemble methods
    print(f"\n{'='*70}")
    print(f"Comparison with Ensemble Methods:")
    print(f"{'='*70}")
    print(f"  Best Individual Model - Test F1 (Class 0): {best_f1_score:.4f}")
    print(f"  Voting (Hard) - Test F1 (Class 0): {test_f1_voting_hard:.4f}")
    if voting_soft_available:
        print(f"  Voting (Soft) - Test F1 (Class 0): {test_f1_voting_soft:.4f}")
    print(f"  Stacking - Test F1 (Class 0): {test_f1_stacking:.4f}")
    print(f"  Note: Individual model performed better than all ensemble methods")

# Print detailed classification report for best model
print(f"\n{'='*70}")
if best_overall_f1 > best_individual_f1:
    print(f"Detailed Classification Report for {best_method_name}")
else:
    print(f"Detailed Classification Report for {best_model_name}")
print(f"{'='*70}")
print(classification_report(y_test, y_test_pred_best))



BEST OVERALL MODEL (Selected by Test F1 Score for Class 0 - Non-Converters)

✓ BEST METHOD: Stacking Classifier (Ensemble)
  Test F1 Score (Class 0): 0.7165
  Improvement over best individual model (LightGBM): +0.0881

Ensemble Details:
  Base Models: LightGBM, XGBoost, CatBoost, Gradient Boosting, AdaBoost
  Meta-learner: Logistic Regression
  Training Approach: WITHOUT Oversampling

Test Set Performance for Class 0 (Non-Converters):
  Test Precision (Class 0): 0.9055
  Test Recall (Class 0): 0.5928
  Test F1 Score (Class 0): 0.7165

Other Test Set Metrics:
  Test Accuracy: 0.9421
  Test F1 Score (Class 1): 0.9677
  Test ROC-AUC: 0.8117

Comparison with Best Individual Model (LightGBM):
  Best Individual Model - Test F1 (Class 0): 0.6284
  Best Ensemble Method - Test F1 (Class 0): 0.7165
  Improvement: +0.0881

Detailed Classification Report for Stacking Classifier




              precision    recall  f1-score   support

           0       0.91      0.59      0.72       194
           1       0.95      0.99      0.97      1377

    accuracy                           0.94      1571
   macro avg       0.93      0.79      0.84      1571
weighted avg       0.94      0.94      0.94      1571



<h2>Conclusion and Next Steps</h2>

An ensemble stacking classifier of the following performed the best: LightGBM, XGBoost, CatBoost, Gradient Boosting, and AdaBoost.

Our initial goal was to predict non-converters, and we did this with an F1 score of 0.72. This is likely the more valuable use case, as historical conversion rate is alreadyhigh at 88%. We can add non-converting users to an exclusion audience, which may help to increase efficiency for the campaign. Or, experiments could be run with different marketing strategies to see how to increase the conversion rate for these users that we predict will not convert with the existing strategy.

Still, F1 score for predicting converting users was also high at 0.97, so the model does increase our intelligence at predicting the majority case as well.

To further improve the model, particularly for the non-converting user case, these strategies could be tested:
<ul>
<li>Adding aggregation features like gender x age group, total page visits, time spent per page, etc
<li>Removing low importance features, identified by mutual information
<li>More robust hyperparameter tuning with Optuna
</ul>