# KKBox Churn Prediction - Model Training with MLflow

This notebook trains three models (Logistic Regression, XGBoost, Random Forest) with hyperparameter tuning and comprehensive MLflow tracking.

**Key Features:**
- Separate encoding strategies for different model types
- RandomizedSearchCV for efficient hyperparameter tuning
- Class weight handling for imbalanced data
- Nested MLflow runs (parent per model, child per CV fold)
- Multiple evaluation metrics (ROC-AUC, Precision, Recall, F1)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_curve, auc
)
from scipy.stats import uniform, randint
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
# MLflow configuration
mlflow_tracking_uri = 'http://mlflow:5000'
mlflow.set_tracking_uri(mlflow_tracking_uri)
mlflow.set_experiment("kkbox-churn-prediction")

print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")
print(f"MLflow Experiment: {mlflow.get_experiment_by_name('kkbox-churn-prediction').experiment_id}")

## Load Your Prepared Data

**Note**: Replace this cell with your actual data loading code. 
The script expects these variables to be defined:
- `X_train`, `y_train`
- `X_val`, `y_val`
- `X_test`, `y_test`
- `X_oot`, `y_oot`

In [None]:
data_pdf = pd.read_csv("data_pdf.csv", parse_dates=["snapshot_date"])
data_pdf["snapshot_date"] = data_pdf["snapshot_date"].dt.date

In [None]:
# YOUR DATA LOADING CODE HERE
# Example:
# train_pdf = pd.read_csv('train_data.csv')
# val_pdf = pd.read_csv('val_data.csv')
# test_pdf = pd.read_csv('test_data.csv')
# oot_pdf = pd.read_csv('oot_data.csv')

feature_cols = ['tenure_days_at_snapshot', 'registered_via', 'city_clean', 
                'sum_secs_w30', 'active_days_w30', 'complete_rate_w30', 
                'sum_secs_w7', 'engagement_ratio_7_30', 'days_since_last_play', 
                'trend_secs_w30', 'auto_renew_share', 'last_is_auto_renew']

# X_train = train_pdf[feature_cols]
# y_train = train_pdf["label"]
# X_val = val_pdf[feature_cols]
# y_val = val_pdf["label"]
# X_test = test_pdf[feature_cols]
# y_test = test_pdf["label"]
# X_oot = oot_pdf[feature_cols]
# y_oot = oot_pdf["label"]

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"OOT set shape: {X_oot.shape}")
print(f"\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))

## Data Preparation: Proper Encoding Strategy

**Important**: We fit encoders on training data and transform all sets consistently to avoid data leakage.

In [None]:
# Identify categorical and numerical columns
categorical_cols = ['registered_via', 'city_clean']
numerical_cols = [col for col in feature_cols if col not in categorical_cols]

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

In [None]:
# Create one-hot encoder fitted on training data (for Logistic Regression)
print("\n[INFO] Creating one-hot encoder for Logistic Regression...")
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
ohe.fit(X_train[categorical_cols])

# Transform all datasets
X_train_cat_ohe = ohe.transform(X_train[categorical_cols])
X_val_cat_ohe = ohe.transform(X_val[categorical_cols])
X_test_cat_ohe = ohe.transform(X_test[categorical_cols])
X_oot_cat_ohe = ohe.transform(X_oot[categorical_cols])

# Get feature names
ohe_feature_names = ohe.get_feature_names_out(categorical_cols)

# Combine with numerical features
X_train_lr = np.hstack([X_train[numerical_cols].values, X_train_cat_ohe])
X_val_lr = np.hstack([X_val[numerical_cols].values, X_val_cat_ohe])
X_test_lr = np.hstack([X_test[numerical_cols].values, X_test_cat_ohe])
X_oot_lr = np.hstack([X_oot[numerical_cols].values, X_oot_cat_ohe])

print(f"One-hot encoded training shape: {X_train_lr.shape}")
print(f"Number of one-hot encoded features: {len(ohe_feature_names)}")

In [None]:
# For tree-based models (XGBoost, Random Forest), use label encoding
# Simply ensure categorical columns are integer type
print("\n[INFO] Preparing data for tree-based models (using original encoding)...")

X_train_tree = X_train.copy()
X_val_tree = X_val.copy()
X_test_tree = X_test.copy()
X_oot_tree = X_oot.copy()

# Ensure categorical columns are integer type
for col in categorical_cols:
    X_train_tree[col] = X_train_tree[col].astype(int)
    X_val_tree[col] = X_val_tree[col].astype(int)
    X_test_tree[col] = X_test_tree[col].astype(int)
    X_oot_tree[col] = X_oot_tree[col].astype(int)

print(f"Tree-based model training shape: {X_train_tree.shape}")

## Utility Functions for Evaluation and Logging

In [None]:
def evaluate_model(model, X, y, dataset_name="", threshold=0.5):
    """
    Comprehensive model evaluation with multiple metrics.
    
    Args:
        model: Trained model
        X: Feature matrix
        y: True labels
        dataset_name: Name of dataset (train/val/test/oot)
        threshold: Classification threshold for precision/recall/F1
    
    Returns:
        Dictionary of metrics
    """
    # Get predictions
    y_pred_proba = model.predict_proba(X)[:, 1]
    y_pred = (y_pred_proba >= threshold).astype(int)
    
    # Calculate metrics
    metrics = {
        f'{dataset_name}_roc_auc': roc_auc_score(y, y_pred_proba),
        f'{dataset_name}_precision': precision_score(y, y_pred, zero_division=0),
        f'{dataset_name}_recall': recall_score(y, y_pred, zero_division=0),
        f'{dataset_name}_f1': f1_score(y, y_pred, zero_division=0),
    }
    
    return metrics, y_pred_proba, y_pred


def plot_roc_curve(y_true, y_pred_proba, title="ROC Curve"):
    """
    Plot and return ROC curve figure.
    """
    fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
    ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(title)
    ax.legend(loc="lower right")
    ax.grid(alpha=0.3)
    
    return fig


def plot_confusion_matrix(y_true, y_pred, title="Confusion Matrix"):
    """
    Plot and return confusion matrix figure.
    """
    cm = confusion_matrix(y_true, y_pred)
    
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(title)
    ax.set_xticklabels(['No Churn', 'Churn'])
    ax.set_yticklabels(['No Churn', 'Churn'])
    
    return fig

## Define Hyperparameter Search Spaces

In [None]:
# Logistic Regression hyperparameter space
lr_param_dist = {
    'C': uniform(0.01, 10),  # Regularization strength
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [1000, 2000, 3000]
}

# XGBoost hyperparameter space
xgb_param_dist = {
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.3),
    'n_estimators': randint(100, 500),
    'min_child_weight': randint(1, 10),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'gamma': uniform(0, 5),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(0, 1)
}

# Random Forest hyperparameter space
rf_param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': [None] + list(range(5, 30, 5)),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

# Cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Number of random parameter combinations to try
N_ITER = 50  # Adjust based on computational budget

print(f"Hyperparameter tuning configuration:")
print(f"  - CV folds: {cv.n_splits}")
print(f"  - Random search iterations: {N_ITER}")
print(f"  - Scoring metric: roc_auc")

## Model Training Pipeline

We'll create a reusable function that:
1. Performs RandomizedSearchCV with nested MLflow logging
2. Logs all hyperparameter combinations tried
3. Evaluates on all datasets (train, val, test, oot)
4. Creates and logs visualizations

In [None]:
def train_and_log_model(model_name, base_model, param_distributions, 
                        X_train, y_train, X_val, y_val, X_test, y_test, X_oot, y_oot,
                        n_iter=50, cv=5):
    """
    Train model with hyperparameter tuning and comprehensive MLflow logging.
    
    Args:
        model_name: Name of the model for MLflow tracking
        base_model: Scikit-learn estimator
        param_distributions: Dictionary of hyperparameter distributions
        X_train, y_train: Training data
        X_val, y_val: Validation data
        X_test, y_test: Test data
        X_oot, y_oot: Out-of-time data
        n_iter: Number of random search iterations
        cv: Cross-validation strategy
    
    Returns:
        best_model: The best model from RandomizedSearchCV
    """
    print(f"\n{'='*80}")
    print(f"Training {model_name}")
    print(f"{'='*80}")
    
    # Start parent run
    with mlflow.start_run(run_name=f"{model_name}_hyperparameter_tuning"):
        
        # Log parent run metadata
        mlflow.set_tag("model_type", model_name)
        mlflow.set_tag("tuning_method", "RandomizedSearchCV")
        mlflow.log_param("n_iter", n_iter)
        mlflow.log_param("cv_folds", cv.n_splits if hasattr(cv, 'n_splits') else cv)
        mlflow.log_param("random_state", RANDOM_STATE)
        mlflow.log_param("train_samples", len(y_train))
        mlflow.log_param("val_samples", len(y_val))
        mlflow.log_param("test_samples", len(y_test))
        mlflow.log_param("oot_samples", len(y_oot))
        
        # Perform RandomizedSearchCV
        print(f"\n[INFO] Starting RandomizedSearchCV with {n_iter} iterations...")
        random_search = RandomizedSearchCV(
            estimator=base_model,
            param_distributions=param_distributions,
            n_iter=n_iter,
            cv=cv,
            scoring='roc_auc',
            n_jobs=-1,
            verbose=1,
            random_state=RANDOM_STATE,
            return_train_score=True
        )
        
        random_search.fit(X_train, y_train)
        
        print(f"\n[INFO] Best CV ROC-AUC: {random_search.best_score_:.4f}")
        print(f"[INFO] Best parameters: {random_search.best_params_}")
        
        # Log best parameters
        for param_name, param_value in random_search.best_params_.items():
            mlflow.log_param(f"best_{param_name}", param_value)
        
        mlflow.log_metric("best_cv_roc_auc", random_search.best_score_)
        
        # Log all CV results as child runs
        print(f"\n[INFO] Logging individual hyperparameter combinations...")
        cv_results = pd.DataFrame(random_search.cv_results_)
        
        for idx in range(min(10, len(cv_results))):  # Log top 10 combinations
            with mlflow.start_run(run_name=f"{model_name}_trial_{idx+1}", nested=True):
                # Log parameters for this trial
                params = cv_results.loc[idx, 'params']
                for param_name, param_value in params.items():
                    mlflow.log_param(param_name, param_value)
                
                # Log CV metrics
                mlflow.log_metric("mean_cv_roc_auc", cv_results.loc[idx, 'mean_test_score'])
                mlflow.log_metric("std_cv_roc_auc", cv_results.loc[idx, 'std_test_score'])
                mlflow.log_metric("mean_train_roc_auc", cv_results.loc[idx, 'mean_train_score'])
                mlflow.log_metric("rank", cv_results.loc[idx, 'rank_test_score'])
        
        # Get best model
        best_model = random_search.best_estimator_
        
        # Evaluate on all datasets
        print(f"\n[INFO] Evaluating best model on all datasets...")
        
        # Training set
        train_metrics, train_proba, train_pred = evaluate_model(
            best_model, X_train, y_train, dataset_name="train"
        )
        
        # Validation set
        val_metrics, val_proba, val_pred = evaluate_model(
            best_model, X_val, y_val, dataset_name="val"
        )
        
        # Test set
        test_metrics, test_proba, test_pred = evaluate_model(
            best_model, X_test, y_test, dataset_name="test"
        )
        
        # OOT set
        oot_metrics, oot_proba, oot_pred = evaluate_model(
            best_model, X_oot, y_oot, dataset_name="oot"
        )
        
        # Log all metrics
        all_metrics = {**train_metrics, **val_metrics, **test_metrics, **oot_metrics}
        for metric_name, metric_value in all_metrics.items():
            mlflow.log_metric(metric_name, metric_value)
        
        # Print results
        print(f"\n[RESULTS] {model_name} Performance:")
        print(f"  Train ROC-AUC: {train_metrics['train_roc_auc']:.4f}")
        print(f"  Val   ROC-AUC: {val_metrics['val_roc_auc']:.4f}")
        print(f"  Test  ROC-AUC: {test_metrics['test_roc_auc']:.4f}")
        print(f"  OOT   ROC-AUC: {oot_metrics['oot_roc_auc']:.4f}")
        
        # Create and log visualizations
        print(f"\n[INFO] Creating visualizations...")
        
        # ROC curves for each dataset
        for dataset_name, y_true, y_proba in [
            ('train', y_train, train_proba),
            ('val', y_val, val_proba),
            ('test', y_test, test_proba),
            ('oot', y_oot, oot_proba)
        ]:
            fig = plot_roc_curve(y_true, y_proba, title=f"{model_name} - {dataset_name.upper()} ROC Curve")
            mlflow.log_figure(fig, f"roc_curve_{dataset_name}.png")
            plt.close(fig)
        
        # Confusion matrices for validation and test sets
        for dataset_name, y_true, y_pred in [
            ('val', y_val, val_pred),
            ('test', y_test, test_pred),
            ('oot', y_oot, oot_pred)
        ]:
            fig = plot_confusion_matrix(y_true, y_pred, 
                                       title=f"{model_name} - {dataset_name.upper()} Confusion Matrix")
            mlflow.log_figure(fig, f"confusion_matrix_{dataset_name}.png")
            plt.close(fig)
        
        # Log model
        print(f"\n[INFO] Logging model to MLflow...")
        mlflow.sklearn.log_model(best_model, f"{model_name}_model")
        
        # Log CV results dataframe
        cv_results_path = f"/tmp/{model_name}_cv_results.csv"
        cv_results.to_csv(cv_results_path, index=False)
        mlflow.log_artifact(cv_results_path, "cv_results")
        
        print(f"\n[SUCCESS] {model_name} training complete!")
        
    return best_model, all_metrics

## 1. Train Logistic Regression (with One-Hot Encoding)

In [None]:
# Calculate class weights for imbalanced data
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.array([0, 1]),
    y=y_train
)
class_weight_dict = {0: class_weights[0], 1: class_weights[1]}

print(f"Class weights: {class_weight_dict}")

In [None]:
# Initialize Logistic Regression with class weights
lr_base = LogisticRegression(
    class_weight=class_weight_dict,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

# Train with hyperparameter tuning
lr_model, lr_metrics = train_and_log_model(
    model_name="LogisticRegression",
    base_model=lr_base,
    param_distributions=lr_param_dist,
    X_train=X_train_lr,
    y_train=y_train,
    X_val=X_val_lr,
    y_val=y_val,
    X_test=X_test_lr,
    y_test=y_test,
    X_oot=X_oot_lr,
    y_oot=y_oot,
    n_iter=N_ITER,
    cv=cv
)

## 2. Train XGBoost (with Label Encoding)

In [None]:
# Calculate scale_pos_weight for XGBoost (handles class imbalance)
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f"XGBoost scale_pos_weight: {scale_pos_weight:.2f}")

In [None]:
# Initialize XGBoost
xgb_base = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    random_state=RANDOM_STATE,
    eval_metric='auc',
    use_label_encoder=False,
    n_jobs=-1
)

# Train with hyperparameter tuning
xgb_model, xgb_metrics = train_and_log_model(
    model_name="XGBoost",
    base_model=xgb_base,
    param_distributions=xgb_param_dist,
    X_train=X_train_tree,
    y_train=y_train,
    X_val=X_val_tree,
    y_val=y_val,
    X_test=X_test_tree,
    y_test=y_test,
    X_oot=X_oot_tree,
    y_oot=y_oot,
    n_iter=N_ITER,
    cv=cv
)

## 3. Train Random Forest (with Label Encoding)

In [None]:
# Initialize Random Forest with class weights
rf_base = RandomForestClassifier(
    class_weight=class_weight_dict,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

# Train with hyperparameter tuning
rf_model, rf_metrics = train_and_log_model(
    model_name="RandomForest",
    base_model=rf_base,
    param_distributions=rf_param_dist,
    X_train=X_train_tree,
    y_train=y_train,
    X_val=X_val_tree,
    y_val=y_val,
    X_test=X_test_tree,
    y_test=y_test,
    X_oot=X_oot_tree,
    y_oot=y_oot,
    n_iter=N_ITER,
    cv=cv
)

## Model Comparison

In [None]:
# Compare all models
comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'XGBoost', 'Random Forest'],
    'Train ROC-AUC': [
        lr_metrics['train_roc_auc'],
        xgb_metrics['train_roc_auc'],
        rf_metrics['train_roc_auc']
    ],
    'Val ROC-AUC': [
        lr_metrics['val_roc_auc'],
        xgb_metrics['val_roc_auc'],
        rf_metrics['val_roc_auc']
    ],
    'Test ROC-AUC': [
        lr_metrics['test_roc_auc'],
        xgb_metrics['test_roc_auc'],
        rf_metrics['test_roc_auc']
    ],
    'OOT ROC-AUC': [
        lr_metrics['oot_roc_auc'],
        xgb_metrics['oot_roc_auc'],
        rf_metrics['oot_roc_auc']
    ],
    'Test F1': [
        lr_metrics['test_f1'],
        xgb_metrics['test_f1'],
        rf_metrics['test_f1']
    ]
})

print("\n" + "="*80)
print("FINAL MODEL COMPARISON")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

# Identify best model
best_model_idx = comparison_df['Test ROC-AUC'].idxmax()
best_model_name = comparison_df.loc[best_model_idx, 'Model']
print(f"\nüèÜ Best Model (by Test ROC-AUC): {best_model_name}")
print(f"   Test ROC-AUC: {comparison_df.loc[best_model_idx, 'Test ROC-AUC']:.4f}")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# ROC-AUC comparison across datasets
datasets = ['Train', 'Val', 'Test', 'OOT']
x = np.arange(len(datasets))
width = 0.25

axes[0].bar(x - width, comparison_df.iloc[:, 1:5].iloc[0], width, label='Logistic Regression')
axes[0].bar(x, comparison_df.iloc[:, 1:5].iloc[1], width, label='XGBoost')
axes[0].bar(x + width, comparison_df.iloc[:, 1:5].iloc[2], width, label='Random Forest')
axes[0].set_xlabel('Dataset')
axes[0].set_ylabel('ROC-AUC')
axes[0].set_title('Model Comparison: ROC-AUC Across Datasets')
axes[0].set_xticks(x)
axes[0].set_xticklabels(datasets)
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)
axes[0].set_ylim([0.5, 1.0])

# Test set metrics comparison
test_metrics_data = {
    'ROC-AUC': comparison_df['Test ROC-AUC'].values,
    'F1': comparison_df['Test F1'].values
}
x = np.arange(len(comparison_df))
width = 0.35

axes[1].bar(x - width/2, test_metrics_data['ROC-AUC'], width, label='ROC-AUC')
axes[1].bar(x + width/2, test_metrics_data['F1'], width, label='F1')
axes[1].set_xlabel('Model')
axes[1].set_ylabel('Score')
axes[1].set_title('Test Set Metrics Comparison')
axes[1].set_xticks(x)
axes[1].set_xticklabels(comparison_df['Model'], rotation=15, ha='right')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim([0, 1.0])

plt.tight_layout()
plt.savefig('/tmp/model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

# Log comparison to MLflow
with mlflow.start_run(run_name="Model_Comparison_Summary"):
    mlflow.log_figure(fig, "model_comparison.png")
    comparison_df.to_csv('/tmp/model_comparison.csv', index=False)
    mlflow.log_artifact('/tmp/model_comparison.csv')
    mlflow.log_metric("best_test_roc_auc", comparison_df['Test ROC-AUC'].max())
    mlflow.set_tag("best_model", best_model_name)

print("\n‚úÖ Model comparison logged to MLflow!")

## Next Steps

1. **Review MLflow UI**: Open `http://mlflow:5000` to explore:
   - All hyperparameter combinations tried
   - Metrics across different datasets
   - Model artifacts and visualizations

2. **Model Selection**: Based on the comparison above, select your best model for production

3. **Model Registry**: Register the best model in MLflow Model Registry:
   ```python
   # Example:
   model_uri = f"runs:/<run_id>/XGBoost_model"
   mlflow.register_model(model_uri, "kkbox-churn-predictor")
   ```

4. **Further Tuning**: If needed:
   - Increase `N_ITER` for more thorough search
   - Try different threshold values for precision/recall trade-off
   - Perform feature engineering based on model insights
   - Consider ensemble methods