# ChurnSense: Business Framing & Descriptive Analysis

## Project Context
Customer churn represents a major source of revenue loss for subscription-based businesses. 
While predicting churn risk is useful, the real business challenge is deciding **which customers to intervene on** given limited retention budgets.

This notebook focuses on:
- clearly defining the business problem
- understanding what "churn" means in this dataset
- validating the target variable
- performing descriptive analysis to build intuition
- identifying data quality, bias, and leakage risks

No modeling decisions are finalized in this notebook.

## Business Problem

The core business question is **not**:

> "Can we predict churn accurately?"

Instead, the decision-oriented question is:

> **Given limited retention budget, which customers should we target to maximize expected retained value under uncertainty?**

This reframes churn modeling as a **decision-support problem**, not a pure prediction task.

Key constraints:
- Retention actions (discounts, incentives, outreach) cost money
- Intervening on all customers is infeasible
- Predictions are probabilistic, not certain

Therefore, the goal of ChurnSense is to:
1. Estimate churn risk **before churn occurs**
2. Rank customers by risk
3. Support intervention decisions using thresholds and cost–benefit tradeoffs

## Data Science Workflow Alignment

This project follows a standard, defensible data science workflow:

1. Business problem framing 
2. Data understanding & leakage assessment ← **this notebook**
3. Data preprocessing & feature handling
4. Baseline modeling
5. Evaluation using business-relevant metrics
6. Threshold tuning and intervention analysis
7. Deployment as a decision-support tool

This notebook focuses on **Steps 1–2**.

## Load Data

In [1]:
import pandas as pd
import numpy as np

DATA_PATH = "../data/raw/churn.csv"
df = pd.read_csv(DATA_PATH)

df.head()

Unnamed: 0,CustomerID,Gender,Age,Under30,SeniorCitizen,Married,Dependents,NumberofDependents,Country,State,...,TotalExtraDataCharges,TotalLongDistanceCharges,TotalRevenue,SatisfactionScore,CustomerStatus,ChurnLabel,ChurnScore,CLTV,ChurnCategory,ChurnReason
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,20,0.0,59.65,3,Churned,Yes,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,0,390.8,1024.1,3,Churned,Yes,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,0,203.94,1910.88,2,Churned,Yes,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,0,494.0,2995.07,2,Churned,Yes,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,0,234.21,3102.36,2,Churned,Yes,67,2793,Price,Extra data charges


## Dataset Overview

We begin by examining:
- number of observations
- number of features
- column names and data types

This helps validate assumptions before any modeling.

In [2]:
df.shape

(7043, 50)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 50 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   CustomerID                     7043 non-null   object 
 1   Gender                         7043 non-null   object 
 2   Age                            7043 non-null   int64  
 3   Under30                        7043 non-null   object 
 4   SeniorCitizen                  7043 non-null   object 
 5   Married                        7043 non-null   object 
 6   Dependents                     7043 non-null   object 
 7   NumberofDependents             7043 non-null   int64  
 8   Country                        7043 non-null   object 
 9   State                          7043 non-null   object 
 10  City                           7043 non-null   object 
 11  ZipCode                        7043 non-null   int64  
 12  Latitude                       7043 non-null   f

In [4]:
df.columns.tolist()

['CustomerID',
 'Gender',
 'Age',
 'Under30',
 'SeniorCitizen',
 'Married',
 'Dependents',
 'NumberofDependents',
 'Country',
 'State',
 'City',
 'ZipCode',
 'Latitude',
 'Longitude',
 'Population',
 'Quarter',
 'ReferredaFriend',
 'Number_of_Referrals',
 'TenureinMonths',
 'Offer',
 'PhoneService',
 'AvgMonthlyLongDistanceCharges',
 'MultipleLines',
 'InternetService',
 'InternetType',
 'AvgMonthlyGBDownload',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtectionPlan',
 'PremiumTechSupport',
 'StreamingTV',
 'StreamingMovies',
 'StreamingMusic',
 'UnlimitedData',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'MonthlyCharge',
 'TotalCharges',
 'TotalRefunds',
 'TotalExtraDataCharges',
 'TotalLongDistanceCharges',
 'TotalRevenue',
 'SatisfactionScore',
 'CustomerStatus',
 'ChurnLabel',
 'ChurnScore',
 'CLTV',
 'ChurnCategory',
 'ChurnReason']

## Target Variable: What Does "Churn" Mean Here?

Before modeling, we must clearly identify:
- which column represents churn
- how it is encoded
- whether it reflects **future information**

A valid churn target should represent whether a customer *eventually left*, 
not information that would only be known after churn occurred.

In [5]:
# Look at unique values of likely target columns
potential_targets = ["ChurnLabel", "Churn", "Exited", "is_churn"]

for col in potential_targets:
    if col in df.columns:
        print(col, df[col].value_counts(dropna=False))

ChurnLabel ChurnLabel
No     5174
Yes    1869
Name: count, dtype: int64


## Churn Base Rate

The churn base rate tells us:
- how common churn is
- how imbalanced the problem may be
- which evaluation metrics are appropriate

In churn problems, precision–recall metrics are often more informative than accuracy.

In [6]:
target_col = "ChurnLabel" if "ChurnLabel" in df.columns else "Churn"

df[target_col].value_counts(normalize=True)

ChurnLabel
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64

## Descriptive Statistics: Numerical Features

We inspect central tendency and spread for numerical variables to:
- understand customer behavior
- identify potential outliers
- build intuition for churn drivers

In [7]:
df.describe()

Unnamed: 0,Age,NumberofDependents,ZipCode,Latitude,Longitude,Population,Number_of_Referrals,TenureinMonths,AvgMonthlyLongDistanceCharges,AvgMonthlyGBDownload,MonthlyCharge,TotalCharges,TotalRefunds,TotalExtraDataCharges,TotalLongDistanceCharges,TotalRevenue,SatisfactionScore,ChurnScore,CLTV
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0
mean,46.509726,0.468692,93486.070567,36.197455,-119.756684,22139.603294,1.951867,32.386767,22.958954,20.515405,64.761692,2280.381264,1.962182,6.860713,749.099262,3034.379056,3.244924,58.50504,4400.295755
std,16.750352,0.962802,1856.767505,2.468929,2.154425,21152.392837,3.001199,24.542061,15.448113,20.41894,30.090047,2266.220462,7.902614,25.104978,846.660055,2865.204542,1.201657,21.170031,1183.057152
min,19.0,0.0,90001.0,32.555828,-124.301372,11.0,0.0,1.0,0.0,0.0,18.25,18.8,0.0,0.0,0.0,21.36,1.0,5.0,2003.0
25%,32.0,0.0,92101.0,33.990646,-121.78809,2344.0,0.0,9.0,9.21,3.0,35.5,400.15,0.0,0.0,70.545,605.61,3.0,40.0,3469.0
50%,46.0,0.0,93518.0,36.205465,-119.595293,17554.0,0.0,29.0,22.89,17.0,70.35,1394.55,0.0,0.0,401.44,2108.64,3.0,61.0,4527.0
75%,60.0,0.0,95329.0,38.161321,-117.969795,36125.0,3.0,55.0,36.395,27.0,89.85,3786.6,0.0,0.0,1191.1,4801.145,4.0,75.5,5380.5
max,80.0,9.0,96150.0,41.962127,-114.192901,105285.0,11.0,72.0,49.99,85.0,118.75,8684.8,49.79,150.0,3564.72,11979.34,5.0,96.0,6500.0


## Data Leakage Assessment

Data leakage occurs when features contain information that would not be available
at the time a prediction is made.

In churn datasets, common leakage columns include:
- churn reasons
- post-churn customer status
- vendor-generated churn scores

These variables may improve model performance artificially but invalidate business use.

In [8]:
leakage_candidates = [
    "CustomerStatus", 
    "ChurnScore", 
    "ChurnReason", 
    "ChurnCategory"
]

[c for c in leakage_candidates if c in df.columns]

['CustomerStatus', 'ChurnScore', 'ChurnReason', 'ChurnCategory']

## Note on Data Cleaning and Transformation

Data cleaning, encoding, and scaling are intentionally **deferred to the modeling pipeline** rather than performed directly in this notebook.

This ensures that:
- preprocessing is applied **consistently** during training and inference
- no data leakage is introduced through manual transformations
- the workflow remains reproducible and production-aligned

All feature preprocessing decisions are implemented using `scikit-learn` pipelines in subsequent modeling steps.

## Bias, Assumptions, and Limitations

Important considerations:
- This is observational data, not experimental data
- Causal conclusions cannot be drawn directly
- Customer behavior may be influenced by unobserved factors
- Historical churn patterns may not generalize perfectly

All modeling decisions in this project are made under uncertainty.
The goal is **better decisions**, not perfect prediction.

## Key Takeaways from Descriptive Analysis

- Approximately **26–27%** of customers in this dataset have churned, indicating a moderately imbalanced classification problem.
- Several columns contain **post-churn information** (e.g., churn reasons and customer status) and must be excluded to avoid data leakage.
- Numerical features exhibit wide ranges, reinforcing the need for **scaling in linear models**.
- This dataset is suitable for **risk ranking and decision support**, but not for causal inference.

These findings guide subsequent modeling, evaluation metrics, and threshold-based intervention decisions.