# ML Classification Pipeline - Example Usage

Demonstrates customer churn prediction with imbalanced data handling and comprehensive evaluation.

## What We'll Build:
1. **Data Preparation** - Load and split customer data
2. **Model Training** - XGBoost classifier with SMOTE
3. **Evaluation** - ROC curves, confusion matrix, business metrics

In [1]:
# Import libraries
import sys
sys.path.append('./src')

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from classifier import ClassificationPipeline, compare_models
from evaluation import evaluate_classifier, plot_confusion_matrix, plot_roc_curve

print("âœ… All modules imported successfully!")

âœ… All modules imported successfully!


## Step 1: Generate Sample Customer Data

Simulate customer churn data with imbalanced classes (realistic scenario)

In [2]:
# Generate imbalanced churn dataset
X, y = make_classification(
    n_samples=2000,
    n_features=15,
    n_informative=12,
    n_redundant=3,
    weights=[0.85, 0.15],  # 15% churn rate (realistic)
    random_state=42
)

# Create feature names
feature_names = [
    'tenure_months', 'monthly_charges', 'total_charges', 'contract_type',
    'payment_method', 'internet_service', 'phone_service', 'tech_support',
    'online_security', 'streaming_tv', 'paperless_billing', 'num_services',
    'avg_call_duration', 'support_tickets', 'late_payments'
]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['churn'] = y

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Dataset: {X.shape[0]} customers, {X.shape[1]} features")
print(f"Churn rate: {y.mean()*100:.1f}%")
print(f"Training set: {X_train.shape[0]} | Test set: {X_test.shape[0]}")

Dataset: 2000 customers, 15 features
Churn rate: 15.4%
Training set: 1600 | Test set: 400


## Step 2: Train XGBoost Model with SMOTE

Handle class imbalance using SMOTE oversampling

In [3]:
# Create and train classifier
classifier = ClassificationPipeline(
    model_type='xgboost',
    imbalance_strategy='smote',
    random_state=42
)

classifier.fit(X_train, y_train, feature_names=feature_names)

# Make predictions
y_pred = classifier.predict(X_test)
y_pred_proba = classifier.predict_proba(X_test)

print("âœ… Model trained successfully!")
print(f"\nTest set predictions: {len(y_pred)} customers")
print(f"Predicted churners: {y_pred.sum()} ({y_pred.sum()/len(y_pred)*100:.1f}%)")

TypeError: ClassificationPipeline.__init__() got an unexpected keyword argument 'imbalance_strategy'

## Step 3: Evaluate Model Performance

In [None]:
# Evaluate classifier
metrics = evaluate_classifier(y_test, y_pred, y_pred_proba[:, 1])

print("ðŸ“Š Model Performance:")
print(f"Accuracy:  {metrics.accuracy:.3f}")
print(f"Precision: {metrics.precision:.3f} (% of predicted churners who actually churned)")
print(f"Recall:    {metrics.recall:.3f} (% of actual churners we identified)")
print(f"F1 Score:  {metrics.f1_score:.3f}")
print(f"ROC-AUC:   {metrics.roc_auc:.3f}")

# Show top features
print("\nðŸŽ¯ Top 5 Churn Drivers:")
feature_importance = classifier.get_feature_importance()
for idx, row in feature_importance.head(5).iterrows():
    print(f"  {row['feature']}: {row['importance']:.3f}")