# ChurnSense: Baseline Modeling, Evaluation, and Threshold Tuning

## Purpose
This notebook retrains the baseline churn model to build understanding and evaluate it as a **decision tool**.

We will:
- define the modeling dataset (target + leakage removal)
- train a baseline Logistic Regression pipeline (with preprocessing)
- evaluate with ROC-AUC and PR-AUC
- examine confusion matrices at multiple thresholds
- choose a threshold (or Top-N policy) based on business tradeoffs (cost, capacity, ROI)

Key mindset:
> The goal is not "maximum accuracy." The goal is **better decisions under uncertainty**.

In [6]:
import pandas as pd
import numpy as np

from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    precision_recall_curve, confusion_matrix
)

## Load Data

We load the dataset and confirm basic shape.  
(We did descriptive analysis in Notebook 01; here we focus on modeling and decision evaluation.)

In [7]:
ROOT = Path("..").resolve()
DATA_PATH = ROOT / "data" / "raw" / "churn.csv"

df = pd.read_csv(DATA_PATH)
df.shape, df.head()

((7043, 50),
    CustomerID  Gender  Age Under30 SeniorCitizen Married Dependents  \
 0  8779-QRDMV    Male   78      No           Yes      No         No   
 1  7495-OOKFY  Female   74      No           Yes     Yes        Yes   
 2  1658-BYGOY    Male   71      No           Yes      No        Yes   
 3  4598-XLKNJ  Female   78      No           Yes     Yes        Yes   
 4  4846-WHAFZ  Female   80      No           Yes     Yes        Yes   
 
    NumberofDependents        Country       State  ... TotalExtraDataCharges  \
 0                   0  United States  California  ...                    20   
 1                   1  United States  California  ...                     0   
 2                   3  United States  California  ...                     0   
 3                   1  United States  California  ...                     0   
 4                   1  United States  California  ...                     0   
 
    TotalLongDistanceCharges  TotalRevenue  SatisfactionScore  Customer

## Target and Leakage Policy

A valid churn model must only use features that would be available **before** churn occurs.
Therefore, we:
1) identify the churn target column
2) drop known leakage columns if present (post-churn info)
3) convert the target into a binary 0/1 label

In [8]:
LEAKAGE_COLS = ["CustomerStatus", "ChurnScore", "ChurnReason", "ChurnCategory"]

def find_target_column(df: pd.DataFrame) -> str:
    candidates = ["ChurnLabel", "Churn", "Exited", "is_churn"]
    for c in candidates:
        if c in df.columns:
            return c
    raise ValueError(f"Target not found. Candidates: {candidates}")

def to_binary_target(y: pd.Series) -> pd.Series:
    # Handles Yes/No, True/False, 0/1
    if y.dtype == "object":
        y = (
            y.astype(str).str.strip().str.lower()
             .map({"yes": 1, "no": 0, "true": 1, "false": 0, "1": 1, "0": 0})
        )
    return y.astype(int)

target = find_target_column(df)
target

'ChurnLabel'

In [9]:
y = to_binary_target(df[target])

drop_cols = [target] + [c for c in LEAKAGE_COLS if c in df.columns]
X = df.drop(columns=drop_cols, errors="ignore")

print("Target:", target)
print("Dropped leakage columns:", [c for c in LEAKAGE_COLS if c in df.columns])
print("X shape:", X.shape, "| y mean (churn rate):", y.mean())

Target: ChurnLabel
Dropped leakage columns: ['CustomerStatus', 'ChurnScore', 'ChurnReason', 'ChurnCategory']
X shape: (7043, 45) | y mean (churn rate): 0.2653698707936959


## Train/Test Split

We use a stratified split so the churn rate is similar in train and test sets.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

y_train.mean(), y_test.mean()

(np.float64(0.2653532126375577), np.float64(0.2654364797728886))

## Baseline Pipeline: Preprocessing + Logistic Regression

We use a single scikit-learn Pipeline to ensure preprocessing is applied consistently:
- Numeric: median imputation + standard scaling (important for linear models)
- Categorical: most-frequent imputation + one-hot encoding
- Model: Logistic Regression (interpretable, strong baseline)

This pipeline is production-aligned because it prevents "train vs inference" mismatch.

In [11]:
cat_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_cols = [c for c in X_train.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ],
    remainder="drop"
)

model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", LogisticRegression(max_iter=5000, solver="saga", n_jobs=-1))
])

model

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'saga'
,max_iter,5000


## Train the Baseline Model

In [12]:
model.fit(X_train, y_train)

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'saga'
,max_iter,5000


## Evaluation: Ranking Metrics

- **ROC-AUC**: probability the model ranks a random churner above a random non-churner  
- **PR-AUC**: focuses on the positive class (churn), more informative when classes are imbalanced

Next we evaluate decision thresholds because the business must decide who to intervene on.

In [19]:
proba = model.predict_proba(X_test)[:, 1]

roc = roc_auc_score(y_test, proba)
pr  = average_precision_score(y_test, proba)

roc, pr

(0.9919450257046165, 0.9816804473927203)

The model demonstrates strong ranking ability, suggesting it can effectively prioritize higher-risk customers even if exact probabilities are imperfect.

## Threshold Tuning

A probability is not a decision.  
To operationalize this model, we must choose either:

- a **threshold** (intervene if p(churn) ≥ t), or
- a **Top-N policy** (intervene on N highest-risk customers)

Below we compute precision/recall and confusion matrix counts across thresholds.

In [14]:
def threshold_report(y_true, y_proba, thresholds=None):
    if thresholds is None:
        thresholds = np.linspace(0.05, 0.95, 19)

    rows = []
    for t in thresholds:
        y_hat = (y_proba >= t).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, y_hat).ravel()

        precision = tp / (tp + fp) if (tp + fp) else 0.0
        recall    = tp / (tp + fn) if (tp + fn) else 0.0
        fpr       = fp / (fp + tn) if (fp + tn) else 0.0

        rows.append({
            "threshold": float(t),
            "selected": int(y_hat.sum()),
            "precision": precision,
            "recall": recall,
            "false_positive_rate": fpr,
            "tp": int(tp), "fp": int(fp), "fn": int(fn), "tn": int(tn)
        })

    return pd.DataFrame(rows)

thr_df = threshold_report(y_test, proba)
thr_df

Unnamed: 0,threshold,selected,precision,recall,false_positive_rate,tp,fp,fn,tn
0,0.05,510,0.721569,0.983957,0.137198,368,142,6,893
1,0.1,467,0.779443,0.973262,0.099517,364,103,10,932
2,0.15,440,0.820455,0.965241,0.076329,361,79,13,956
3,0.2,417,0.851319,0.949198,0.059903,355,62,19,973
4,0.25,402,0.878109,0.94385,0.047343,353,49,21,986
5,0.3,389,0.894602,0.930481,0.039614,348,41,26,994
6,0.35,376,0.914894,0.919786,0.030918,344,32,30,1003
7,0.4,365,0.936986,0.914439,0.022222,342,23,32,1012
8,0.45,361,0.941828,0.909091,0.02029,340,21,34,1014
9,0.5,352,0.954545,0.898396,0.015459,336,16,38,1019


## Business Framing: Capacity-Constrained Intervention

In many retention workflows, the business has a fixed capacity or budget.
Example:
- a call center can contact 300 customers/week
- marketing can send 500 retention offers/day

A Top-N policy maps cleanly to this reality.

Next we simulate a Top-N policy: pick the N highest-risk customers and evaluate outcomes.

In [15]:
def top_n_report(y_true, y_proba, n):
    idx = np.argsort(-y_proba)[:n]  # top n probs
    y_hat = np.zeros_like(y_true)
    y_hat[idx] = 1

    tn, fp, fn, tp = confusion_matrix(y_true, y_hat).ravel()
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall    = tp / (tp + fn) if (tp + fn) else 0.0

    return {
        "N": n,
        "selected": int(y_hat.sum()),
        "precision": precision,
        "recall": recall,
        "tp": int(tp), "fp": int(fp), "fn": int(fn), "tn": int(tn)
    }

for n in [100, 250, 500, 1000]:
    print(top_n_report(y_test.to_numpy(), proba, n))

{'N': 100, 'selected': 100, 'precision': np.float64(1.0), 'recall': np.float64(0.26737967914438504), 'tp': 100, 'fp': 0, 'fn': 274, 'tn': 1035}
{'N': 250, 'selected': 250, 'precision': np.float64(1.0), 'recall': np.float64(0.6684491978609626), 'tp': 250, 'fp': 0, 'fn': 124, 'tn': 1035}
{'N': 500, 'selected': 500, 'precision': np.float64(0.734), 'recall': np.float64(0.9812834224598931), 'tp': 367, 'fp': 133, 'fn': 7, 'tn': 902}
{'N': 1000, 'selected': 1000, 'precision': np.float64(0.374), 'recall': np.float64(1.0), 'tp': 374, 'fp': 626, 'fn': 0, 'tn': 409}


## SQL Segment Analysis (DuckDB)

SQL is useful for fast, business-readable summaries such as churn rate by segment.
We use DuckDB in-notebook (no setup) to mirror how churn analysis often happens in warehouses.

This supports decisions like:
- which segments churn most
- where interventions might have the biggest impact

In [16]:
import duckdb
con = duckdb.connect()
con.register("churn", df)

<_duckdb.DuckDBPyConnection at 0x7808b93a5eb0>

In [17]:
# Build a SQL expression that yields 0/1 churn
if target == "ChurnLabel":
    churn_expr = "CASE WHEN ChurnLabel = 'Yes' THEN 1 ELSE 0 END"
elif target == "Churn":
    churn_expr = "CASE WHEN LOWER(CAST(Churn AS VARCHAR)) IN ('yes','true','1') THEN 1 ELSE 0 END"
else:
    churn_expr = f"CAST({target} AS INTEGER)"

query = f"""
SELECT
  Contract,
  COUNT(*) AS customers,
  AVG({churn_expr}) AS churn_rate
FROM churn
GROUP BY Contract
ORDER BY churn_rate DESC;
"""
con.execute(query).df()

Unnamed: 0,Contract,customers,churn_rate
0,Month-to-Month,3610,0.458449
1,One Year,1550,0.107097
2,Two Year,1883,0.025491


In [18]:
con.close() 

## Key Takeaways from Baseline Modeling

- The baseline **Logistic Regression model exhibits strong ranking performance** (ROC-AUC and PR-AUC), confirming it can reliably prioritize higher-risk customers even before threshold selection.

- **Predicted probabilities require operationalization**; effective churn management depends on translating risk scores into concrete intervention policies rather than relying on accuracy or default cutoffs.

- **Precision–recall tradeoffs vary sharply across thresholds**, demonstrating that model effectiveness is inseparable from business capacity and cost constraints.

- A **Top-N intervention strategy aligns naturally with real retention workflows**, enabling high-precision targeting under fixed outreach or budget limits and avoiding arbitrary probability thresholds.

- **SQL-based segment analysis contextualizes model outputs**, revealing where churn risk is structurally concentrated (e.g., contract type) and supporting targeted, explainable intervention strategies.

**Together, these results establish a defensible baseline and motivate comparison against more flexible models (e.g., tree-based approaches) and cost-aware intervention simulations.**