Skip to content

aashir92/Customer_Churn_Prediction

Repository files navigation

library_name scikit-learn
tags
tabular-classification
customer-churn
random-forest
gradio
datasets
WA_Fn-UseC_-Telco-Customer-Churn
language
en
metrics
accuracy
f1
base_model sklearn-pipeline-random-forest

Model Card for Customer Churn Prediction Pipeline

This model is a trained Scikit-learn pipeline designed to predict whether a telecom customer is likely to churn based on account, service, and billing attributes.

Model Details

Model Description

This model acts as a churn-risk scoring engine for retention workflows. It combines preprocessing (imputation, scaling, one-hot encoding) and classification in a single serialized pipeline artifact for consistent training and inference behavior.

  • Developed by: Aashir Hameed
  • Model type: Scikit-learn Tabular Classification Pipeline
  • Language(s): English (en) for feature labels/documentation
  • License: Apache 2.0
  • Trained from: Telco customer churn tabular dataset

Model Sources

Uses

Direct Use

This model is intended for churn risk scoring in:

  • CRM prioritization and retention campaigns
  • Proactive outreach workflows for high-risk customers
  • Batch scoring of customer cohorts

Binary output mapping:

  • 0: No Churn
  • 1: Churn

Out-of-Scope Use

This model is not intended for:

  • Causal inference on churn drivers
  • Fairness-critical automated decisions without human review
  • Data distributions that significantly differ from the Telco training data

Bias, Risks, and Limitations

Like all supervised models, this pipeline may reflect historical biases and collection artifacts present in source data. Prediction confidence can degrade under distribution shift (for example new plans, pricing structures, or service bundles not represented in training data). The model should be monitored for drift and recalibrated/retrained on a schedule.

How to Get Started with the Model

Use the code below for inference with joblib:

from pathlib import Path
import joblib
import pandas as pd

model = joblib.load(Path("churn_model_v1.pkl"))

sample = pd.DataFrame(
    [
        {
            "gender": "Female",
            "SeniorCitizen": "0",
            "Partner": "Yes",
            "Dependents": "No",
            "tenure": 12,
            "PhoneService": "Yes",
            "MultipleLines": "No",
            "InternetService": "Fiber optic",
            "OnlineSecurity": "No",
            "OnlineBackup": "Yes",
            "DeviceProtection": "No",
            "TechSupport": "No",
            "StreamingTV": "Yes",
            "StreamingMovies": "Yes",
            "Contract": "Month-to-month",
            "PaperlessBilling": "Yes",
            "PaymentMethod": "Electronic check",
            "MonthlyCharges": 89.1,
            "TotalCharges": 1069.2,
        }
    ]
)

prediction = model.predict(sample)[0]
probability = model.predict_proba(sample)[0][1]
print(prediction, probability)

Training Details

Training Data

The model was trained on WA_Fn-UseC_-Telco-Customer-Churn.csv with the standard churn target column (Churn).

Training Procedure

Preprocessing

  • Dropped non-predictive customerID
  • Coerced TotalCharges to numeric and removed rows with invalid target/critical numeric values
  • Numeric preprocessing: median imputation + standard scaling
  • Categorical preprocessing: most-frequent imputation + one-hot encoding (handle_unknown='ignore')

Training Hyperparameters

  • Validation: Stratified K-Fold cross-validation (n_splits=5)
  • Model search: GridSearchCV with scoring = f1
  • Candidates: Logistic Regression and Random Forest
  • Winning model: Random Forest
  • Best params (winner):
    • class_weight=balanced
    • max_depth=8
    • min_samples_leaf=4
    • min_samples_split=2
    • n_estimators=200

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out split from the Telco dataset with stratified train/test partitioning.

Metrics

  • Accuracy
  • F1-score

Results

  • Final Test Accuracy: 75.05%
  • Final Test F1-Score: 62.38%
  • Best CV F1-score: 63.96%

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

  • Hardware Type: Standard local CPU training environment
  • Training profile: Classical ML grid-search over two model families

Author & Contact

Aashir Hameed

About

Built Telco churn prediction with sklearn pipelines, RF tuned via GridSearchCV + StratifiedKFold CV, joblib artifact, Gradio UI on Hugging Face.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages