#### **04_MODEL_COMPARISON.IPYNB**
##### Comprehensive ML Model Training & Comparison


Bu notebook'un amacƒ±:
1. Multiple ML models training & comparison
2. Imbalanced data handling (SMOTE, class weights)
3. Feature selection (importance-based)
4. Hyperparameter tuning (RandomizedSearchCV)
5. Model evaluation (F1, Precision, Recall, ROC-AUC, PR-AUC)
6. Final production model selection

Challenge: Severe class imbalance (1.3% conversion)
Solution: SMOTE + proper metrics (F1, PR-AUC)



In [4]:
# IMPORTS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score,
                             roc_curve, precision_recall_curve, auc, f1_score,
                             precision_score, recall_score, accuracy_score)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import warnings
warnings.filterwarnings('ignore')

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Seed for reproducibility
np.random.seed(42)

print("üì¶ All libraries imported successfully!")


üì¶ All libraries imported successfully!


In [5]:
print("\n" + "="*70)
print("üìä DATA LOADING & PREPARATION")
print("="*70)

# Load featured data
df = pd.read_csv('../data/marketing_analytics_featured.csv')

print(f"\n‚úÖ Data loaded: {df.shape}")
print(f"   Features: {df.shape[1]}")
print(f"   Samples: {df.shape[0]}")

# Target distribution check
print("\nüéØ TARGET DISTRIBUTION:")
print(df['Conversion'].value_counts())
print(f"\nConversion Rate: {df['Conversion'].mean() * 100:.2f}%")
print(f"Imbalance Ratio: {(df['Conversion']==0).sum() / (df['Conversion']==1).sum():.2f}:1")



üìä DATA LOADING & PREPARATION

‚úÖ Data loaded: (48000, 37)
   Features: 37
   Samples: 48000

üéØ TARGET DISTRIBUTION:
Conversion
0    47393
1      607
Name: count, dtype: int64

Conversion Rate: 1.26%
Imbalance Ratio: 78.08:1


#### **FEATURE SELECTION - PREPARE X, y**
1. Drop non-predictive columns:
   - CustomerID (identifier, not predictive)
   - Channel_Performance (categorical, will be encoded)
   - Age_Group, Income_Tier, Loyalty_Tier (categorical, already have numeric versions)

2. Categorical encoding:
   - CampaignChannel, CampaignType, Gender
   - AdvertisingPlatform, AdvertisingTool
   - Channel_Performance

3. Keep all engineered features:
   - ROI_Proxy, CPA_Proxy, Site_Engagement, etc.


In [6]:
print("\n" + "="*70)
print("üîß FEATURE PREPARATION")
print("="*70)

# Columns to drop
drop_cols = ['CustomerID', 'Conversion']

# Identify categorical columns for encoding
categorical_cols = ['CampaignChannel', 'CampaignType', 'Gender',
                    'AdvertisingPlatform', 'AdvertisingTool',
                    'Age_Group', 'Income_Tier', 'Loyalty_Tier',
                    'Channel_Performance']

# Separate features and target
X = df.drop(drop_cols, axis=1)
y = df['Conversion']



üîß FEATURE PREPARATION


In [7]:
# One-Hot Encoding for categorical variables
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

print(f"\n‚úÖ Feature matrix prepared:")
print(f"   Original features: {X.shape[1]}")
print(f"   After encoding: {X_encoded.shape[1]}")
print(f"   Target: {y.name}")

# Check for any remaining non-numeric columns
non_numeric = X_encoded.select_dtypes(exclude=[np.number]).columns.tolist()
if non_numeric:
    print(f"\n‚ö†Ô∏è Non-numeric columns found: {non_numeric}")
    X_encoded = X_encoded.drop(non_numeric, axis=1)
    print(f"   Dropped. Final features: {X_encoded.shape[1]}")



‚úÖ Feature matrix prepared:
   Original features: 35
   After encoding: 57
   Target: Conversion

‚ö†Ô∏è Non-numeric columns found: ['CampaignChannel_Display', 'CampaignChannel_Email', 'CampaignChannel_PPC', 'CampaignChannel_Referral', 'CampaignChannel_SEO', 'CampaignChannel_Social Media', 'CampaignType_Consideration', 'CampaignType_Conversion', 'CampaignType_Retention', 'Gender_Male', 'AdvertisingPlatform_Google', 'AdvertisingPlatform_Instagram', 'AdvertisingPlatform_LinkedIn', 'AdvertisingPlatform_TikTok', 'AdvertisingPlatform_Twitter', 'AdvertisingPlatform_YouTube', 'AdvertisingTool_Hootsuite', 'AdvertisingTool_HubSpot', 'AdvertisingTool_MailChimp', 'AdvertisingTool_Meta Ads Manager', 'AdvertisingTool_SEMrush', 'Age_Group_MiddleAge', 'Age_Group_Senior', 'Age_Group_YoungAdult', 'Income_Tier_Low', 'Income_Tier_Medium', 'Income_Tier_VeryHigh', 'Loyalty_Tier_Gold', 'Loyalty_Tier_Silver', 'Channel_Performance_Low', 'Channel_Performance_Medium']
   Dropped. Final features: 26


#### **TRAIN-TEST SPLIT (STRATIFIED)**

In [8]:
print("\n" + "="*70)
print("‚úÇÔ∏è TRAIN-TEST SPLIT")
print("="*70)

# Stratified split (preserve class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y,
    test_size=0.20,
    random_state=42,
    stratify=y  # CRITICAL for imbalanced data!
)

print(f"\n‚úÖ Split completed:")
print(f"   Train: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"   Test:  {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")

print(f"\nüìä Class distribution preserved:")
print(f"   Train - Class 0: {(y_train==0).sum()} | Class 1: {(y_train==1).sum()}")
print(f"   Test  - Class 0: {(y_test==0).sum()} | Class 1: {(y_test==1).sum()}")



‚úÇÔ∏è TRAIN-TEST SPLIT

‚úÖ Split completed:
   Train: 38400 samples (80.0%)
   Test:  9600 samples (20.0%)

üìä Class distribution preserved:
   Train - Class 0: 37914 | Class 1: 486
   Test  - Class 0: 9479 | Class 1: 121


#### **BASELINE MODEL (Logistic Regression)**
**BASELINE MODEL:**

Purpose: Simple model for comparison
Model: Logistic Regression (linear, interpretable)
Handling imbalance: class_weight='balanced'

In [9]:
print("\n" + "="*70)
print("üéØ BASELINE MODEL: LOGISTIC REGRESSION")
print("="*70)

# Scale features (important for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with balanced class weights
lr_model = LogisticRegression(
    class_weight='balanced',  # Handle imbalance
    max_iter=1000,
    random_state=42
)

lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
y_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]


üéØ BASELINE MODEL: LOGISTIC REGRESSION


In [10]:
# Evaluation
print("\nüìä BASELINE MODEL RESULTS:")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, digits=4))

# Key metrics
f1_lr = f1_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr, zero_division=0)
recall_lr = recall_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_proba_lr)

print(f"\nüéØ KEY METRICS:")
print(f"   F1-Score:  {f1_lr:.4f}")
print(f"   Precision: {precision_lr:.4f}")
print(f"   Recall:    {recall_lr:.4f}")
print(f"   ROC-AUC:   {roc_auc_lr:.4f}")



üìä BASELINE MODEL RESULTS:

Classification Report:
              precision    recall  f1-score   support

           0     1.0000    0.9984    0.9992      9479
           1     0.8897    1.0000    0.9416       121

    accuracy                         0.9984      9600
   macro avg     0.9449    0.9992    0.9704      9600
weighted avg     0.9986    0.9984    0.9985      9600


üéØ KEY METRICS:
   F1-Score:  0.9416
   Precision: 0.8897
   Recall:    1.0000
   ROC-AUC:   1.0000


In [11]:
# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(f"\nüìã CONFUSION MATRIX:")
print(f"   TN: {cm_lr[0,0]} | FP: {cm_lr[0,1]}")
print(f"   FN: {cm_lr[1,0]} | TP: {cm_lr[1,1]}")


üìã CONFUSION MATRIX:
   TN: 9464 | FP: 15
   FN: 0 | TP: 121
