# ðŸ¤– Customer Churn Prediction: Model Building & Training

**Objective**: Build and train machine learning models to predict customer churn with high accuracy and business impact.

**Dataset**: Engineered features from 7,043 customer records (40 features)

**Models Trained**:
- Logistic Regression (baseline)
- Decision Tree
- Random Forest (ensemble)

**Approach**:
1. Load preprocessed and engineered features
2. Handle class imbalance using SMOTE
3. Scale features for model optimization
4. Train multiple models
5. Evaluate and compare performance
6. Select best model for deployment

**Success Metrics**: AUC-ROC, Precision, Recall, F1-Score


## Import Libraries

We import standard ML libraries, imbalanced-learning tools (SMOTE), and our custom evaluation module.


In [1]:
# Core Data Science Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Handle Class Imbalance
from imblearn.over_sampling import SMOTE

# Model Persistence
import joblib
from pathlib import Path
import sys
import os
import logging

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s: %(message)s'
)

# ====== IMPORT CUSTOM MODULES ======
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.model_training import ChurnModelTrainer
from src.model_evaluation import evaluate_model, compare_models

# Set random seed
np.random.seed(42)

# Visualization settings
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10

print("\nâœ… All libraries imported successfully!")


âœ… All libraries imported successfully!


## Load Preprocessed & Engineered Features

We load the feature-engineered dataset created in the previous notebook.


In [2]:
# Load preprocessed data
df = pd.read_csv('../data/processed/customer_churn_engineered.csv')

print(f"âœ“ Dataset loaded: {df.shape[0]:,} rows Ã— {df.shape[1]} columns")
print(f"\nFirst 5 rows:")
display(df.head())

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

print(f"\nâœ“ Features (X): {X.shape}")
print(f"âœ“ Target (y): {y.shape}")
print(f"âœ“ Churn rate: {y.mean():.2%}")

âœ“ Dataset loaded: 7,043 rows Ã— 7083 columns

First 5 rows:


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,tenure_group,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn
0,1,1,1,1,38.0,1,0,101.15,1398.6,2-4 years,...,False,True,False,True,False,False,False,True,False,0
1,1,0,0,0,1.0,1,0,70.4,50.65,0-1 year,...,False,False,False,False,False,False,False,False,True,1
2,1,1,0,0,45.0,1,1,97.05,4385.05,2-4 years,...,False,True,False,False,False,False,False,True,False,0
3,1,0,0,0,18.0,1,0,20.1,401.85,1-2 years,...,True,False,True,False,True,False,False,False,True,0
4,0,0,1,1,2.0,1,0,70.4,1398.6,0-1 year,...,False,False,False,False,False,False,False,True,False,1



âœ“ Features (X): (7043, 7082)
âœ“ Target (y): (7043,)
âœ“ Churn rate: 26.54%


## Encode Categorical Features

Convert any remaining categorical variables to numerical format using Label Encoding.


In [3]:
# Encode categorical variables
le = LabelEncoder()
categorical_cols = X.select_dtypes(include=['object']).columns

if len(categorical_cols) > 0:
    print(f"Encoding {len(categorical_cols)} categorical columns:")
    for col in categorical_cols:
        X[col] = le.fit_transform(X[col].astype(str))
        print(f"  âœ“ {col}")
else:
    print("âœ“ No categorical columns to encode")

print("\nâœ“ All features are now numerical")
display(X.head())


Encoding 2 categorical columns:
  âœ“ tenure_group
  âœ“ charge_category

âœ“ All features are now numerical


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,tenure_group,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,1,1,1,38.0,1,0,101.15,1398.6,2,...,False,False,True,False,True,False,False,False,True,False
1,1,0,0,0,1.0,1,0,70.4,50.65,0,...,False,False,False,False,False,False,False,False,False,True
2,1,1,0,0,45.0,1,1,97.05,4385.05,2,...,False,False,True,False,False,False,False,False,True,False
3,1,0,0,0,18.0,1,0,20.1,401.85,1,...,False,True,False,True,False,True,False,False,False,True
4,0,0,1,1,2.0,1,0,70.4,1398.6,0,...,False,False,False,False,False,False,False,False,True,False


## Initialize Machine Learning Models

We'll train and compare 5 different classification algorithms.


In [4]:
print("=" * 70)
print("MODEL TRAINING")
print("=" * 70)

trainer = ChurnModelTrainer()

X_train, X_test, y_train, y_test = trainer.train_all_models(
    X,
    y,
    balance_data=True
)

print("\nâœ… Models trained successfully:")
for name in trainer.models:
    print(f" - {name}")

2026-01-03 17:13:53,525 - INFO: Training all models...
2026-01-03 17:13:53,526 - INFO: Splitting data into train and test sets...
2026-01-03 17:13:53,595 - INFO: Applying StandardScaler...


MODEL TRAINING


2026-01-03 17:13:57,722 - INFO: Applied SMOTE: 5634 -> 8278 samples
2026-01-03 17:13:57,748 - INFO: Training Random Forest...
2026-01-03 17:13:59,800 - INFO: âœ… Random Forest trained
2026-01-03 17:13:59,801 - INFO: Training Logistic Regression...
2026-01-03 17:14:00,352 - INFO: âœ… Logistic Regression trained
2026-01-03 17:14:00,352 - INFO: Training Gradient Boosting...
2026-01-03 17:17:19,647 - INFO: âœ… Gradient Boosting trained
2026-01-03 17:17:19,647 - INFO: âœ… Trained 3 models



âœ… Models trained successfully:
 - Random Forest
 - Logistic Regression
 - Gradient Boosting


In [5]:
print("=" * 70)
print("SAVING MODELS")
print("=" * 70)

from pathlib import Path
import joblib
from sklearn.metrics import roc_auc_score

# --------------------------------------------------
# Create models directory
# --------------------------------------------------
models_dir = Path("../models")
models_dir.mkdir(parents=True, exist_ok=True)

# --------------------------------------------------
# Save all trained models
# --------------------------------------------------
saved_count = 0
model_scores = {}

for model_name, model in trainer.models.items():
    file_name = model_name.lower().replace(" ", "_") + ".pkl"
    file_path = models_dir / file_name

    try:
        joblib.dump(model, file_path, compress=3)
        file_size = file_path.stat().st_size / 1024
        saved_count += 1
        print(f"{model_name:25s} {file_name:30s} {file_size:7.2f} KB")

        # Evaluate model
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        roc_score = roc_auc_score(y_test, y_pred_proba)
        model_scores[model_name] = roc_score

    except Exception as e:
        print(f"{model_name:25s} FAILED: {str(e)}")

print("=" * 70)
print(f"âœ“ Successfully saved {saved_count}/{len(trainer.models)} models")

# --------------------------------------------------
# Identify & save the best model
# --------------------------------------------------
best_model_name = max(model_scores, key=model_scores.get)
best_model = trainer.models[best_model_name]

best_model_path = models_dir / "best_model.pkl"
joblib.dump(best_model, best_model_path, compress=3)

print(f"\nâœ“ Best model identified: {best_model_name}")
print(f"âœ“ Best model saved as: best_model.pkl")
print(f"âœ“ ROC-AUC Score: {model_scores[best_model_name]:.4f}")

# --------------------------------------------------
# Save scaler (VERY IMPORTANT for inference)
# --------------------------------------------------
if hasattr(trainer, "scaler"):
    scaler_path = models_dir / "scaler.pkl"
    joblib.dump(trainer.scaler, scaler_path)
    scaler_size = scaler_path.stat().st_size / 1024
    print(f"âœ“ Scaler saved: scaler.pkl ({scaler_size:.2f} KB)")

print(f"\nâœ“ All artifacts saved to: {models_dir.resolve()}")


SAVING MODELS
Random Forest             random_forest.pkl               220.49 KB
Logistic Regression       logistic_regression.pkl          45.09 KB
Gradient Boosting         gradient_boosting.pkl            93.81 KB
âœ“ Successfully saved 3/3 models

âœ“ Best model identified: Gradient Boosting
âœ“ Best model saved as: best_model.pkl
âœ“ ROC-AUC Score: 0.8597
âœ“ Scaler saved: scaler.pkl (332.74 KB)

âœ“ All artifacts saved to: D:\Labmentix\customer-churn-analysis\models
