# Telecom Churn Prediction

by Ethan Huffman

# Business Understanding

Customer churn represents a significant financial risk for telecommunications companies, as acquiring new customers is substantially more expensive than retaining existing ones. This project addresses the business problem of identifying customers who are most likely to churn so that proactive retention strategies can be deployed before customers leave. By leveraging historical customer data related to tenure, billing behavior, service usage, and customer service interactions, the model helps surface early warning signs of churn. The primary stakeholder for this analysis is a telecommunications provider seeking to reduce churn-driven revenue loss through targeted intervention. The ultimate goal is not perfect accuracy, but maximizing the identification of high-risk customers so retention resources can be allocated effectively.

# Data Understanding

This project uses two publicly available telecommunications datasets that capture complementary aspects of customer behavior related to churn. The IBM Telco Customer Churn dataset provides customer-level information such as tenure, monthly charges, total charges, and churn outcomes, which reflect long-term relationship and billing patterns. The BigML Telecommunications dataset contributes detailed usage and service interaction data, including call activity and customer service calls, which capture short-term behavioral signals. Combining these datasets increases both the volume and diversity of churn-related information, allowing the model to learn from contractual, financial, and behavioral perspectives. While the datasets do not include personally identifiable information, they provide sufficient operational signals to support meaningful churn prediction and business decision-making

# Data Preparation 

In [1]:
# Import core data manipulation libraries
import pandas as pd
import numpy as np
import joblib
import os

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import modeling utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.impute import SimpleImputer

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier


# Import imbalance handling
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline


This cell imports all core libraries required for data manipulation, visualization, modeling, and deployment throughout the project. Pandas and NumPy are used for data loading, cleaning, and numerical operations, while Matplotlib and Seaborn support exploratory analysis and visual inspection. Scikit-learn utilities are imported to handle data splitting, preprocessing, model training, and evaluation in a structured and reproducible way. The selected models include Logistic Regression as a baseline and Gradient Boosting as a more advanced, non-linear approach. Finally, imbalanced-learn tools are included to address class imbalance in churn prediction, which is critical for improving recall on churners.

In [2]:
# Load IBM Telco Customer Churn dataset
df_ibm = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Load BigML Telecommunications Churn dataset (CSV-formatted with .xlsx extension)
df_bigml = pd.read_csv("../data/raw/churn_bigml.xlsx")


This cell loads the two datasets used in the project from the local repository structure. The IBM Telco dataset provides detailed customer-level information such as tenure, contract type, billing behavior, and churn outcomes. The BigML telecommunications dataset adds complementary usage-based features, including call activity and customer service interactions. Together, these datasets allow the project to combine behavioral, contractual, and usage signals that are relevant to churn prediction. Loading both datasets at the start ensures all downstream preparation and modeling steps are reproducible and consistent across environments.

In [3]:
# Standardize churn target column in IBM dataset
df_ibm["Churn"] = df_ibm["Churn"].map({"Yes": 1, "No": 0})

# Convert churn target to integer in BigML dataset
df_bigml["Churn"] = df_bigml["Churn"].astype(int)


This cell standardizes the churn target variable across both datasets so they can be used together consistently in modeling. In the IBM dataset, churn is originally stored as categorical text values, which are mapped to binary numeric labels for machine learning compatibility. In the BigML dataset, the churn column is already binary but is explicitly cast to an integer type to ensure consistency. Aligning the target variable across datasets is a critical preparation step before merging or comparing results. This ensures that churn predictions are interpreted the same way regardless of the data source.

In [4]:
# Create aggregated usage features from call data
df_bigml["total_usage_minutes"] = (
    df_bigml["Total day minutes"] +
    df_bigml["Total eve minutes"] +
    df_bigml["Total night minutes"] +
    df_bigml["Total intl minutes"]
)

df_bigml["total_usage_calls"] = (
    df_bigml["Total day calls"] +
    df_bigml["Total eve calls"] +
    df_bigml["Total night calls"] +
    df_bigml["Total intl calls"]
)


This cell engineers new aggregated usage features from the raw call-level data in the BigML dataset. Total usage minutes are calculated by summing call durations across day, evening, night, and international periods. Similarly, total usage calls are created by summing call counts across the same time segments. These engineered features provide a more holistic view of overall customer activity than individual time-based metrics. Consolidating usage in this way simplifies the feature space while preserving behavior signals that are strongly associated with churn risk.

In [5]:
# Select relevant columns from BigML dataset
df_bigml_selected = df_bigml[
    [
        "International plan",
        "Voice mail plan",
        "Customer service calls",
        "total_usage_minutes",
        "total_usage_calls",
        "Churn"
    ]
].copy()

# Rename columns to align with IBM dataset naming
df_bigml_selected.columns = [
    "international_plan",
    "voice_mail_plan",
    "customer_service_calls",
    "total_usage_minutes",
    "total_usage_calls",
    "churn"
]


This cell selects a focused subset of features from the BigML dataset that are most relevant to churn prediction. The chosen columns emphasize customer plans, service interactions, and overall usage behavior, all of which are known drivers of churn. The columns are then renamed to follow a consistent naming convention that aligns with the IBM dataset and improves readability. Standardizing feature names is especially important when combining datasets or building reusable pipelines. This step ensures downstream preprocessing and modeling logic can treat both data sources consistently

In [6]:
# Select relevant columns from IBM dataset
df_ibm_selected = df_ibm[
    [
        "tenure",
        "MonthlyCharges",
        "TotalCharges",
        "Churn"
    ]
].copy()

# Rename columns for consistency
df_ibm_selected.columns = [
    "tenure",
    "monthly_charges",
    "total_charges",
    "churn"
]


This cell extracts the most relevant churn-related features from the IBM Telco dataset. The selected columns capture customer longevity, monthly billing behavior, and total revenue contribution, which are strong indicators of churn risk. Column names are then standardized to match the naming style used across the project. This consistency simplifies feature management when combining datasets and building preprocessing pipelines. Narrowing the IBM dataset to these core features also reduces noise and keeps the modeling focus aligned with the business problem

In [7]:
# Combine both datasets into a single feature-aligned dataset
df_combined = pd.concat(
    [df_ibm_selected, df_bigml_selected],
    axis=0,
    ignore_index=True
)


This cell combines the selected IBM and BigML datasets into a single, unified dataset for modeling. The concatenation is performed row-wise, aligning features that share the same business meaning across both sources. Combining the datasets increases the overall sample size, which helps improve model learning and generalization. This approach allows the model to learn churn patterns from both contractual and usage-based perspectives. The resulting dataset serves as the foundation for all subsequent data preparation and modeling steps.

In [8]:
# Separate features and target variable
X = df_combined.drop("churn", axis=1)
y = df_combined["churn"]


This cell separates the input features from the target variable in preparation for modeling. The feature matrix contains all customer attributes that will be used to predict churn, while the target vector contains the churn labels. This separation is required by scikit-learn’s modeling API and ensures a clear distinction between inputs and outputs. Structuring the data this way helps prevent accidental data leakage during training. From this point forward, all preprocessing and modeling steps operate on these defined feature and target sets

In [9]:
# Perform stratified train/test split to preserve churn distribution
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


This cell splits the dataset into training and testing subsets while preserving the original churn distribution. Stratification ensures that both the training and test sets contain a representative proportion of churners and non-churners. Maintaining this balance is especially important for churn prediction, where the target class is often imbalanced. A fixed random state is used to make the split reproducible across runs and environments. This split establishes a fair and consistent basis for evaluating model performance later in the notebook.

In [10]:
# Identify numeric and categorical columns for preprocessing
numeric_features = X_train.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X_train.select_dtypes(include=["object"]).columns.tolist()


This cell programmatically identifies which features are numeric and which are categorical within the training data. Separating features by data type allows different preprocessing steps to be applied appropriately to each group. Numeric features typically require scaling and imputation, while categorical features require encoding before modeling. Detecting these feature types dynamically makes the pipeline more flexible and less error-prone if the feature set changes. This step sets the foundation for building a clean and reusable preprocessing pipeline

In [11]:
# Update numeric transformer to handle missing values
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

# Update categorical transformer to handle missing values
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

# Rebuild preprocessor with imputers included
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)


This cell defines the preprocessing logic applied to the feature data before modeling. Numeric features are handled with a pipeline that imputes missing values using the median and then scales them to a standard range. Categorical features are processed with a separate pipeline that fills missing values using the most frequent category and encodes them into numerical format using one-hot encoding. These two pipelines are combined into a single ColumnTransformer that applies the correct transformations to each feature type. This structured preprocessing approach ensures the model can handle missing data, mixed feature types, and unseen categories in a robust and reproducible way

# Modeling

In [12]:
# Baseline churn model using Logistic Regression with SMOTE
baseline_model = ImbPipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("smote", SMOTE(random_state=42)),
        ("classifier", LogisticRegression(max_iter=1000))
    ]
)

# Train baseline model
baseline_model.fit(X_train, y_train)


This cell defines and trains the baseline churn prediction model using Logistic Regression. The model pipeline includes preprocessing to handle scaling, encoding, and missing values, followed by SMOTE to address class imbalance in the churn target. Logistic Regression is used as the baseline because it is interpretable and provides a strong reference point for more complex models. Training the model at this stage establishes an initial performance benchmark for churn detection. This baseline helps evaluate whether later modeling choices meaningfully improve the ability to identify churners.

In [13]:
# Evaluate baseline model performance
baseline_preds = baseline_model.predict(X_test)

print(classification_report(y_test, baseline_preds))
print(confusion_matrix(y_test, baseline_preds))


              precision    recall  f1-score   support

           0       0.87      0.78      0.82      1437
           1       0.50      0.65      0.57       491

    accuracy                           0.75      1928
   macro avg       0.68      0.71      0.69      1928
weighted avg       0.77      0.75      0.76      1928

[[1118  319]
 [ 172  319]]


This cell evaluates the performance of the baseline churn model on the held-out test dataset. Predictions are generated using the trained model and compared against the true churn labels. The classification report provides detailed metrics such as precision, recall, and F1-score, which are especially important for assessing churn detection performance. The confusion matrix offers a clear breakdown of true positives, false positives, and false negatives. Together, these outputs help determine how well the baseline model identifies churners and where it falls short.

In [14]:
# Advanced churn model using Gradient Boosting with SMOTE
gb_model = ImbPipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("smote", SMOTE(random_state=42)),
        ("classifier", GradientBoostingClassifier(random_state=42))
    ]
)

# Train Gradient Boosting model
gb_model.fit(X_train, y_train)


This cell defines and trains a more advanced churn prediction model using Gradient Boosting. Gradient Boosting is capable of capturing complex, non-linear relationships in the data that simpler models may miss. As with the baseline model, preprocessing and SMOTE are included in the pipeline to ensure fair handling of feature scaling, encoding, missing values, and class imbalance. Training this model allows for a direct comparison against the baseline Logistic Regression approach. The goal of this step is to improve churn detection performance, particularly recall on churners, while maintaining reasonable precision.

In [15]:
# Helper function to evaluate churn performance at different thresholds
def evaluate_thresholds(model, X, y, thresholds):
    probs = model.predict_proba(X)[:, 1]
    for t in thresholds:
        preds = (probs >= t).astype(int)
        print(f"Threshold: {t}")
        print(classification_report(y, preds))


This cell defines a helper function used to evaluate model performance across different decision thresholds. Instead of relying on the default probability cutoff, the function converts predicted churn probabilities into binary predictions at various thresholds. This allows direct observation of how precision, recall, and overall performance change as the threshold is adjusted. Evaluating multiple thresholds is especially important in churn prediction, where recall on churners may be prioritized over overall accuracy. This function supports business-driven decision making by helping identify a threshold that best aligns with churn intervention goals.

In [16]:
# Evaluate multiple probability thresholds
thresholds = [0.5, 0.4, 0.3, 0.25, 0.2]
evaluate_thresholds(gb_model, X_test, y_test, thresholds)


Threshold: 0.5
              precision    recall  f1-score   support

           0       0.88      0.76      0.82      1437
           1       0.50      0.70      0.58       491

    accuracy                           0.75      1928
   macro avg       0.69      0.73      0.70      1928
weighted avg       0.78      0.75      0.76      1928

Threshold: 0.4
              precision    recall  f1-score   support

           0       0.92      0.65      0.76      1437
           1       0.45      0.82      0.58       491

    accuracy                           0.70      1928
   macro avg       0.68      0.74      0.67      1928
weighted avg       0.80      0.70      0.72      1928

Threshold: 0.3
              precision    recall  f1-score   support

           0       0.93      0.58      0.71      1437
           1       0.41      0.88      0.56       491

    accuracy                           0.65      1928
   macro avg       0.67      0.73      0.64      1928
weighted avg       0.80      

This cell applies the previously defined threshold evaluation function to the Gradient Boosting model using a range of probability cutoffs. Each threshold represents a different tradeoff between identifying more churners and limiting false positives. By printing performance metrics at multiple thresholds, this step makes the cost–benefit tradeoffs of churn intervention explicit. This analysis helps determine whether lowering the threshold meaningfully improves churn recall without making the model unusable for business operations. The output from this cell directly informs the selection of a final decision threshold aligned with business priorities.

In [17]:
# Set final model and selected business-aligned threshold
FINAL_MODEL = gb_model
FINAL_THRESHOLD = 0.17


This cell finalizes the model and decision threshold that will be used for deployment and downstream analysis. The Gradient Boosting model is selected as the final model based on its stronger performance relative to the baseline, particularly in identifying churners. The chosen threshold reflects a business-aligned decision to prioritize recall, even at the cost of additional false positives. This lower threshold allows the company to capture more at-risk customers for proactive intervention. Locking in both the model and threshold ensures consistency across evaluation, reporting, and application use

In [18]:
# Generate final predictions using selected threshold
final_probs = FINAL_MODEL.predict_proba(X_test)[:, 1]
final_preds = (final_probs >= FINAL_THRESHOLD).astype(int)

print(classification_report(y_test, final_preds))
print(confusion_matrix(y_test, final_preds))


              precision    recall  f1-score   support

           0       0.95      0.34      0.50      1437
           1       0.33      0.95      0.49       491

    accuracy                           0.50      1928
   macro avg       0.64      0.65      0.50      1928
weighted avg       0.79      0.50      0.50      1928

[[488 949]
 [ 24 467]]


This cell generates the final churn predictions using the selected model and business-aligned probability threshold. Predicted probabilities are converted into binary churn flags based on the chosen cutoff, rather than the default threshold. The classification report summarizes how well the final model performs in terms of precision, recall, and overall balance. The confusion matrix provides a clear view of how many churners are correctly identified versus missed. Together, these outputs confirm whether the final model behavior aligns with the goal of maximizing churn detection for intervention.

In [19]:
# Assign risk tiers based on churn probability
def assign_risk_tier(prob):
    if prob >= 0.75:
        return "Very High Risk"
    elif prob >= 0.50:
        return "High Risk"
    elif prob >= 0.30:
        return "Medium Risk"
    else:
        return "Low Risk"


This cell defines a function that converts raw churn probabilities into clear, interpretable risk tiers. Each tier represents a different level of urgency, allowing the business to prioritize retention actions based on predicted risk. Higher probability thresholds correspond to more aggressive intervention categories, such as “High Risk” and “Very High Risk.” Translating probabilities into risk tiers makes model outputs easier for non-technical stakeholders to understand and act upon. This step bridges the gap between machine learning predictions and real-world business decision-making

In [20]:
# Create results table with churn probabilities and risk tiers
results = X_test.copy()
results["churn_probability"] = final_probs
results["predicted_churn"] = final_preds
results["risk_tier"] = results["churn_probability"].apply(assign_risk_tier)


This cell creates a structured results table that combines model predictions with the original feature data. Churn probabilities, binary churn predictions, and assigned risk tiers are appended to each customer record. This format allows analysts and business teams to directly inspect which customers are at risk and why. The resulting table can be filtered, exported, or integrated into downstream tools such as dashboards or retention workflows. Organizing predictions this way makes the model’s output immediately actionable.

In [21]:
# Extract high-risk customers for business action
high_risk_customers = results[results["risk_tier"].isin(["High Risk", "Very High Risk"])]

# Save high-risk customers to CSV for downstream use
high_risk_customers.to_csv("../reports/high_risk_customers.csv", index=False)


This cell filters the results to isolate customers who fall into the highest churn risk categories. By selecting only “High Risk” and “Very High Risk” customers, the analysis focuses on individuals most likely to churn without intervention. These customers represent the highest priority for retention strategies such as outreach, discounts, or service improvements. The filtered dataset is then saved as a CSV file for easy sharing and downstream use. This output enables business teams to operationalize the model’s predictions outside of the notebook environment.

In [22]:
# Save final model and threshold for deployment
os.makedirs("../models", exist_ok=True)

joblib.dump(FINAL_MODEL, "../models/final_churn_model.joblib")
joblib.dump(FINAL_THRESHOLD, "../models/final_threshold.joblib")


['../models/final_threshold.joblib']

This cell saves the finalized churn prediction model and the selected decision threshold to disk for reuse. The models directory is created if it does not already exist, ensuring the save operation succeeds across environments. Persisting the trained model allows it to be loaded later without retraining, which is essential for deployment and application use. Saving the threshold alongside the model ensures that predictions remain consistent with the business-aligned decision logic. This step enables the churn model to be integrated into external tools such as APIs or interactive applications.