# Model Building

In this notebook, we will train several machine learning models to predict customer churn. We will use **MLflow** to track our experiments, including parameters and metrics.

## Models to Train:
1. Logistic Regression
2. Random Forest Classifier
3. XGBoost Classifier
4. LightGBM Classifier

In [None]:
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import mlflow.lightgbm
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Set pandas display options
pd.set_option('display.max_columns', None)

## 1. Load Processed Data

In [None]:
data_path = '../data/processed'

X_train = pd.read_csv(f'{data_path}/train_processed.csv')
X_val = pd.read_csv(f'{data_path}/val_processed.csv')

y_train = X_train.pop('Churn')
y_val = X_val.pop('Churn')

print(f"Train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"Val shape: {X_val.shape}, y_val shape: {y_val.shape}")

## 2. Setup MLflow

We will set the tracking URI to `http://localhost:5000` and define the experiment name.

In [None]:
tracking_uri = "http://localhost:5000"
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment("Customer Churn Prediction")

print(f"MLflow tracking URI: {tracking_uri}")

## 3. Model Training & Tracking Loop

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    "LightGBM": LGBMClassifier(random_state=42, verbose=-1)
}

results = []

for model_name, model in models.items():
    with mlflow.start_run(run_name=model_name):
        print(f"Training {model_name}...")
        
        # Fit model
        model.fit(X_train, y_train)
        
        # Predict
        y_pred = model.predict(X_val)
        y_prob = model.predict_proba(X_val)[:, 1] if hasattr(model, "predict_proba") else None
        
        # Metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)
        auc = roc_auc_score(y_val, y_prob) if y_prob is not None else None
        
        # Log params and metrics
        mlflow.log_param("model_type", model_name)
        mlflow.log_params(model.get_params())
        
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.log_metric("f1_score", f1)
        if auc:
            mlflow.log_metric("roc_auc", auc)
            
        # Log model
        mlflow.sklearn.log_model(model, "model")
        
        print(f"{model_name} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}")
        
        results.append({
            "Model": model_name,
            "Accuracy": accuracy,
            "Precision": precision,
            "Recall": recall,
            "F1 Score": f1,
            "AUC": auc
        })

## 4. Compare Results

Let's look at the performance comparison.

In [None]:
results_df = pd.DataFrame(results)
results_df.sort_values(by="F1 Score", ascending=False, inplace=True)
results_df

In [None]:
# Plot model comparison
plt.figure(figsize=(10, 6))
sns.barplot(x="F1 Score", y="Model", data=results_df, palette="viridis")
plt.title("Model Comparison - F1 Score")
plt.show()