# Customer Churn Project by Ethan Huffman

## Business Understanding

Customer churn represents a major cost for telecommunications companies because retaining existing customers is significantly less expensive than acquiring new ones. The objective of this project is to develop a predictive model that identifies customers who are most likely to churn so the business can intervene before the customer leaves. From a business standpoint, failing to identify a customer who is going to churn is more costly than incorrectly flagging a customer who would have stayed, which makes recall for churned customers the most important evaluation metric. As a result, the modeling approach in this project prioritizes capturing as many churners as possible, even when this comes at the expense of precision or overall accuracy.

## Data Understanding

The dataset used in this project combines two telecommunications churn datasets to create a larger and more informative sample of customer behavior. The features describe customer tenure, service plans, billing amounts, usage patterns, and customer service interactions, all of which are commonly associated with churn in the telecom industry. The target variable is churn, represented as a binary indicator showing whether a customer discontinued service. A key characteristic of the data is class imbalance, with non churners making up a larger portion of the dataset, which directly influences modeling decisions and evaluation metrics. This imbalance reinforces the need to focus on recall and churn detection rather than raw predictive accuracy.

## Data Preparation 

## Data Preparation

The data preparation process begins by loading and inspecting two separate telecommunications churn datasets and aligning them into a consistent structure. Because the datasets originate from different sources, column names, data types, and target variable formats must be standardized before they can be combined. This includes converting churn indicators into a consistent binary format, resolving differences in categorical feature naming, and ensuring numeric fields are properly cast for modeling. Combining the datasets increases the total number of observations, which helps improve model stability and provides a more realistic representation of customer behavior.

Once the datasets are merged, missing values are handled explicitly to ensure downstream models can train successfully. Numeric features are inspected for invalid or empty values and converted to proper numeric types where necessary, while categorical features are preserved for later encoding. No information is dropped prematurely, as retaining as much signal as possible is important for churn detection. At this stage, the focus is on preparing clean, model ready inputs rather than optimizing performance.

Next, features are separated into numeric and categorical groups so that appropriate preprocessing steps can be applied to each. Numeric features are scaled to normalize their ranges, while categorical features are encoded to convert them into a format suitable for machine learning models. These preprocessing steps are implemented using pipelines to ensure consistency, prevent data leakage, and maintain reproducibility. Structuring preprocessing this way allows the same transformations to be applied seamlessly during training and evaluation.

Finally, the prepared dataset is split into training and testing sets to allow for model evaluation. The split is performed after all structural cleaning but before any modeling so that the test data remains unseen during training. At this point, the data is fully prepared and ready to be passed into the baseline model, establishing a reference point for evaluating future improvements. This preparation process ensures that any changes in model performance can be attributed to modeling decisions rather than data inconsistencies.

In [26]:
# Imports

import pandas as pd
import numpy as np
import sqlite3

from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.impute import SimpleImputer

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline


This cell imports all required libraries used throughout the notebook, including data manipulation, visualization, modeling, and evaluation tools. All imports are centralized in one place to improve readability and reproducibility. Keeping imports in a single cell makes it easy to audit dependencies and rerun the notebook on a new machine.

In [27]:
# Load datasets

df_ibm = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df_bigml = pd.read_csv("../data/raw/churn_bigml.xlsx")


This cell loads the IBM Telco Customer Churn dataset from disk into a pandas DataFrame. The dataset provides detailed customer demographic, service, and billing information along with a churn indicator. Loading the data early allows for inspection and alignment with other datasets.

In [28]:
# Clean churn targets

df_ibm["Churn"] = df_ibm["Churn"].map({"Yes": 1, "No": 0})
df_bigml["Churn"] = df_bigml["Churn"].astype(int)


In [29]:
# Feature engineering for BigML dataset

df_bigml["total_usage_minutes"] = (
    df_bigml["Total day minutes"]
    + df_bigml["Total eve minutes"]
    + df_bigml["Total night minutes"]
    + df_bigml["Total intl minutes"]
)

df_bigml["total_usage_calls"] = (
    df_bigml["Total day calls"]
    + df_bigml["Total eve calls"]
    + df_bigml["Total night calls"]
    + df_bigml["Total intl calls"]
)

df_bigml = df_bigml[
    [
        "International plan",
        "Voice mail plan",
        "Customer service calls",
        "total_usage_minutes",
        "total_usage_calls",
        "Churn",
    ]
]

df_bigml.columns = [
    "international_plan",
    "voice_mail_plan",
    "customer_service_calls",
    "total_usage_minutes",
    "total_usage_calls",
    "churn",
]


In [30]:
# Feature selection for IBM dataset

df_ibm = df_ibm[
    [
        "tenure",
        "MonthlyCharges",
        "TotalCharges",
        "Churn",
    ]
]

df_ibm.columns = [
    "tenure",
    "monthly_charges",
    "total_charges",
    "churn",
]

df_ibm["total_charges"] = pd.to_numeric(df_ibm["total_charges"], errors="coerce")


This cell standardizes the churn column across both datasets into a consistent binary format. Different datasets encode churn differently, so alignment is required before modeling. This ensures churn has the same semantic meaning across all records.

In [31]:
# Combine datasets

df_combined = pd.concat(
    [df_ibm, df_bigml],
    axis=0,
    ignore_index=True
)


This cell selects and renames relevant features so that both datasets share a compatible schema. Only features that can be meaningfully aligned are retained. This step allows the datasets to be safely combined without introducing ambiguity.

In [32]:
# Split features and target

X = df_combined.drop("churn", axis=1)
y = df_combined["churn"]


In [33]:
# Train/test split with stratification

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


In [34]:
# Identify numeric and categorical features

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()


In [35]:
# Preprocessing pipelines

numeric_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features)
    ]
)


In [36]:
# Helper function to evaluate churn performance

def evaluate_churn(y_true, y_pred):
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))


## Modeling

Multiple models are evaluated to determine which approach best identifies customers who are likely to churn. A baseline logistic regression model is used first to establish a reference point and demonstrate how standard classification techniques struggle with churn detection on imbalanced data. From there, more advanced models are introduced that incorporate class weighting, resampling, and probability threshold adjustments to better align predictions with the business objective. All models are trained using pipelines to ensure consistent preprocessing and fair comparison.
Model performance is evaluated primarily using recall for the churn class, with precision considered as a secondary metric to manage intervention costs. As model complexity increases, recall improves substantially, confirming that optimizing for churn detection requires deliberate tradeoffs rather than default settings. Threshold tuning is used to further improve churn capture by adjusting the decision boundary based on predicted probabilities. The final selected model reflects the best balance between high churn recall and acceptable precision while remaining interpretable and reproducible.

In [37]:
# Baseline churn model using Logistic Regression and SMOTE

baseline_model = ImbPipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("smote", SMOTE(random_state=42)),
        ("classifier", LogisticRegression(
            max_iter=1000,
            class_weight="balanced"
        ))
    ]
)

baseline_model.fit(X_train, y_train)

y_pred_baseline = baseline_model.predict(X_test)

evaluate_churn(y_test, y_pred_baseline)


              precision    recall  f1-score   support

           0       0.89      0.71      0.79      1437
           1       0.46      0.74      0.57       491

    accuracy                           0.72      1928
   macro avg       0.68      0.72      0.68      1928
weighted avg       0.78      0.72      0.73      1928

[[1017  420]
 [ 128  363]]




In [38]:
# Gradient Boosting model with SMOTE

gb_model = ImbPipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("smote", SMOTE(random_state=42)),
        ("classifier", GradientBoostingClassifier(random_state=42))
    ]
)

gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

evaluate_churn(y_test, y_pred_gb)




              precision    recall  f1-score   support

           0       0.88      0.76      0.81      1437
           1       0.49      0.70      0.58       491

    accuracy                           0.74      1928
   macro avg       0.69      0.73      0.70      1928
weighted avg       0.78      0.74      0.75      1928

[[1088  349]
 [ 149  342]]


In [39]:
# Function to evaluate multiple probability thresholds

def threshold_evaluation(model, X_test, y_test, thresholds):
    probs = model.predict_proba(X_test)[:, 1]
    for t in thresholds:
        preds = (probs >= t).astype(int)
        print(f"\nThreshold: {t}")
        print(classification_report(y_test, preds))


In [40]:
# Threshold sweep for baseline model

thresholds = [0.5, 0.4, 0.3, 0.25, 0.2]
threshold_evaluation(baseline_model, X_test, y_test, thresholds)



Threshold: 0.5
              precision    recall  f1-score   support

           0       0.89      0.71      0.79      1437
           1       0.46      0.74      0.57       491

    accuracy                           0.72      1928
   macro avg       0.68      0.72      0.68      1928
weighted avg       0.78      0.72      0.73      1928


Threshold: 0.4
              precision    recall  f1-score   support

           0       0.92      0.60      0.72      1437
           1       0.42      0.84      0.56       491

    accuracy                           0.66      1928
   macro avg       0.67      0.72      0.64      1928
weighted avg       0.79      0.66      0.68      1928


Threshold: 0.3
              precision    recall  f1-score   support

           0       0.94      0.46      0.62      1437
           1       0.37      0.92      0.52       491

    accuracy                           0.57      1928
   macro avg       0.65      0.69      0.57      1928
weighted avg       0.79   

In [41]:
# Recall-constrained Gradient Boosting model

gb_tuned = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", GradientBoostingClassifier(
            n_estimators=200,
            learning_rate=0.05,
            max_depth=3,
            random_state=42
        ))
    ]
)

gb_tuned.fit(X_train, y_train)

y_probs = gb_tuned.predict_proba(X_test)[:, 1]

precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)

recall_floor = 0.85
valid_idxs = np.where(recalls >= recall_floor)[0]

best_idx = valid_idxs[np.argmax(precisions[valid_idxs])]
best_threshold = thresholds[best_idx]

print("Selected Threshold:", best_threshold)

y_final = (y_probs >= best_threshold).astype(int)

print(classification_report(y_test, y_final))
print(confusion_matrix(y_test, y_final))


Selected Threshold: 0.17081091206089097
              precision    recall  f1-score   support

           0       0.92      0.62      0.74      1437
           1       0.43      0.85      0.58       491

    accuracy                           0.68      1928
   macro avg       0.68      0.74      0.66      1928
weighted avg       0.80      0.68      0.70      1928

[[894 543]
 [ 73 418]]


In [42]:
# Lock final model and threshold

FINAL_MODEL = gb_tuned
FINAL_THRESHOLD = best_threshold

final_probs = FINAL_MODEL.predict_proba(X_test)[:, 1]
final_preds = (final_probs >= FINAL_THRESHOLD).astype(int)

print("FINAL MODEL PERFORMANCE")
print("Threshold:", FINAL_THRESHOLD)
print(classification_report(y_test, final_preds))
print(confusion_matrix(y_test, final_preds))


FINAL MODEL PERFORMANCE
Threshold: 0.17081091206089097
              precision    recall  f1-score   support

           0       0.92      0.62      0.74      1437
           1       0.43      0.85      0.58       491

    accuracy                           0.68      1928
   macro avg       0.68      0.74      0.66      1928
weighted avg       0.80      0.68      0.70      1928

[[894 543]
 [ 73 418]]


## Evaluation of Final Model

The final model is selected based on its ability to identify churners with a high level of recall while maintaining reasonable precision. By applying class balancing techniques and adjusting the probability threshold, the model captures a significantly larger portion of customers who are likely to churn compared to earlier approaches. Although this results in lower overall accuracy, the tradeoff is intentional and aligned with the business goal of preventing customer loss. The evaluation metrics confirm that the model prioritizes churn detection rather than optimizing for non churn predictions.
Threshold tuning plays a critical role in the final modelâ€™s performance by shifting the decision boundary to favor identifying potential churners. Lowering the threshold increases recall for churn, ensuring fewer at risk customers are missed. While this introduces more false positives, the business impact of incorrectly flagging a stable customer is far less severe than failing to identify a churner. This makes the final model suitable for proactive customer retention strategies.

In [43]:
# Store final evaluation outputs

final_report = classification_report(y_test, final_preds, output_dict=True)
final_confusion = confusion_matrix(y_test, final_preds)

final_report, final_confusion


({'0': {'precision': 0.9245087900723888,
   'recall': 0.6221294363256785,
   'f1-score': 0.7437603993344426,
   'support': 1437.0},
  '1': {'precision': 0.43496357960457854,
   'recall': 0.8513238289205702,
   'f1-score': 0.5757575757575758,
   'support': 491.0},
  'accuracy': 0.6804979253112033,
  'macro avg': {'precision': 0.6797361848384837,
   'recall': 0.7367266326231243,
   'f1-score': 0.6597589875460093,
   'support': 1928.0},
  'weighted avg': {'precision': 0.7998372660372775,
   'recall': 0.6804979253112033,
   'f1-score': 0.7009754478944832,
   'support': 1928.0}},
 array([[894, 543],
        [ 73, 418]]))

In [44]:
# Churn capture sanity check

churn_captured = final_confusion[1, 1]
churn_missed = final_confusion[1, 0]
total_churn = churn_captured + churn_missed

print(f"Churners captured: {churn_captured}")
print(f"Churners missed: {churn_missed}")
print(f"Churn capture rate: {churn_captured / total_churn:.3f}")


Churners captured: 418
Churners missed: 73
Churn capture rate: 0.851


In [45]:
# Create churn action table for business use

churn_action_table = X_test.copy()
churn_action_table["churn_probability"] = final_probs
churn_action_table["predicted_churn"] = final_preds

churn_action_table = churn_action_table[churn_action_table["predicted_churn"] == 1]

churn_action_table.head()


Unnamed: 0,tenure,monthly_charges,total_charges,international_plan,voice_mail_plan,customer_service_calls,total_usage_minutes,total_usage_calls,churn_probability,predicted_churn
6844,29.0,89.65,2623.65,,,,,,0.326971,1
3398,61.0,100.7,6018.65,,,,,,0.209655,1
419,1.0,75.3,75.3,,,,,,0.821473,1
5048,54.0,99.1,5437.1,,,,,,0.237096,1
1103,54.0,105.2,5637.85,,,,,,0.271958,1


In [46]:
# Save predicted churners to SQLite database

conn = sqlite3.connect("../reports/churn_predictions.db")

churn_action_table.to_sql(
    "predicted_churners",
    conn,
    if_exists="replace",
    index=False
)

conn.close()


In [47]:
# Read back from database to verify save

conn = sqlite3.connect("../reports/churn_predictions.db")

pd.read_sql(
    "SELECT * FROM predicted_churners LIMIT 5;",
    conn
)

conn.close()


In [48]:
import joblib

joblib.dump(FINAL_MODEL, "../models/final_churn_model.joblib")
joblib.dump(FINAL_THRESHOLD, "../models/final_threshold.joblib")


['../models/final_threshold.joblib']

In [49]:
import os
os.listdir("../models")


['final_threshold.joblib', 'final_churn_model.joblib']

In [52]:
import joblib
import os

os.makedirs("../models", exist_ok=True)

joblib.dump(FINAL_MODEL, "../models/final_churn_model.joblib")
joblib.dump(FINAL_THRESHOLD, "../models/final_threshold.joblib")


['../models/final_threshold.joblib']

## Conclusion