# 03 — Feature Engineering & Model Judgment

## Why this notebook exists
Notebook 2 established a strong baseline using a production-style preprocessing pipeline + logistic regression.

In this notebook we:
1. Create **interpretable engineered features** that reflect real customer behavior
2. Re-train the baseline model using these features
3. Compare against a simple tree-based model to understand the tradeoff between:
   - performance
   - interpretability
   - stability

**Goal:** improve ranking quality and decision usefulness — not chase marginal metric gains.

In [1]:
# Core
import numpy as np
import pandas as pd

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import roc_auc_score, average_precision_score

## Load Data

This notebook is intentionally self-contained.  
We reload the cleaned dataset used in Notebook 2 so the workflow can be rerun from top to bottom without relying on previous notebooks.

In [6]:
from pathlib import Path
import pandas as pd

DATA_PATH = Path("..") / "data" / "raw" / "churn.csv"
df = pd.read_csv(DATA_PATH)

print(df.shape, df.columns[:10])

(7043, 50) Index(['CustomerID', 'Gender', 'Age', 'Under30', 'SeniorCitizen', 'Married',
       'Dependents', 'NumberofDependents', 'Country', 'State'],
      dtype='object')


## Define Target and Feature Matrix

We separate the churn outcome variable from the feature set so that
feature engineering, preprocessing, and modeling steps remain clean,
explicit, and reproducible.

In [10]:
TARGET_COL = "ChurnCategory"

df[TARGET_COL].value_counts(dropna=False).head(10)

ChurnCategory
NaN                5174
Competitor          841
Attitude            314
Dissatisfaction     303
Price               211
Other               200
Name: count, dtype: int64

## Define Binary Churn Target

`ChurnCategory` contains detailed churn reasons for customers who left,
and is missing (`NaN`) for customers who stayed.

For churn modeling, we define a binary target:
- `1` = customer churned (any churn reason)
- `0` = customer did not churn

This formulation aligns with the business objective of identifying
customers at risk of leaving.

In [11]:
TARGET_COL = "ChurnCategory"

y = df[TARGET_COL].notna().astype(int)
X = df.drop(columns=[TARGET_COL])

y.value_counts(), y.mean()

(ChurnCategory
 0    5174
 1    1869
 Name: count, dtype: int64,
 np.float64(0.2653698707936959))

## Feature Engineering Plan

The baseline model in Notebook 2 treated all features in their raw form.
In practice, customer behavior is often **non-linear** and better captured through
simple, interpretable transformations.

In this notebook, we engineer a small number of features guided by business intuition:

1. **Tenure Buckets**
   - Customers churn differently early vs late in their lifecycle
   - Bucketing tenure captures this non-linearity while remaining explainable

2. **Spending Intensity**
   - Total spend alone ignores how quickly revenue is generated
   - A spend-per-tenure ratio helps capture customer value dynamics

These features are designed to:
- Improve ranking quality
- Preserve interpretability
- Support downstream decision-making

In [12]:
# Make a copy so we don't mutate the original feature set
X_fe = X.copy()

# Tenure Buckets (non-linear lifecycle effects)
X_fe["tenure_bucket"] = pd.cut(
    X_fe["TenureinMonths"],
    bins=[0, 12, 24, 48, 72, np.inf],
    labels=["0-1yr", "1-2yr", "2-4yr", "4-6yr", "6yr+"]
)

# Spending Intensity (value over time)
X_fe["charges_per_tenure"] = X_fe["TotalCharges"] / (X_fe["TenureinMonths"] + 1)


## Train / Test Split

We split the dataset into training and test sets using a stratified split
to ensure the churn rate remains consistent across both sets.

This allows evaluation metrics to reflect true generalization performance.

In [13]:
from sklearn.model_selection import train_test_split

X_train_fe, X_test_fe, y_train, y_test = train_test_split(
    X_fe,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

y_train.mean(), y_test.mean()

(np.float64(0.2653532126375577), np.float64(0.2654364797728886))

## Preprocessing Pipeline (Engineered Features)

We use a single scikit-learn pipeline to ensure preprocessing is applied consistently.

- **Numeric features**: median imputation + standard scaling  
- **Categorical features**: most-frequent imputation + one-hot encoding  

This prevents train vs inference mismatch and keeps the workflow production-aligned.

In [14]:
cat_cols = X_train_fe.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_cols = [c for c in X_train_fe.columns if c not in cat_cols]

preprocess_fe = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ],
    remainder="drop"
)

len(num_cols), len(cat_cols), cat_cols[:10]

(20,
 31,
 ['CustomerID',
  'Gender',
  'Under30',
  'SeniorCitizen',
  'Married',
  'Dependents',
  'Country',
  'State',
  'City',
  'Quarter'])

## Feature-Engineered Logistic Regression

We retrain logistic regression using the engineered features and updated preprocessing
pipeline to evaluate whether these transformations improve ranking performance
relative to the baseline model in Notebook 2.

In [15]:
logreg_fe = Pipeline(steps=[
    ("preprocess", preprocess_fe),
    ("clf", LogisticRegression(
        max_iter=5000,
        solver="saga",
        n_jobs=-1
    ))
])

logreg_fe.fit(X_train_fe, y_train)

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'saga'
,max_iter,5000


## Evaluation: Ranking Metrics

We evaluate the model using:
- **ROC-AUC**: overall ranking ability
- **PR-AUC**: ranking quality for the positive (churn) class

These metrics are threshold-independent and align with the downstream
decision-making objective of prioritizing high-risk customers.

In [16]:
proba_fe = logreg_fe.predict_proba(X_test_fe)[:, 1]

roc_fe = roc_auc_score(y_test, proba_fe)
pr_fe = average_precision_score(y_test, proba_fe)

roc_fe, pr_fe

(1.0, 1.0)

## Leakage Audit: Why Perfect Metrics Are a Red Flag

The feature-engineered logistic regression achieved near-perfect ranking metrics.
In real-world churn prediction, this level of performance is highly unusual and
often indicates **data leakage**.

Data leakage occurs when features contain information that would not be available
at the time a prediction is made (e.g., post-churn signals).

In this section, we audit the feature set to identify and remove any variables
that may leak post-outcome information, ensuring the model remains realistic
and deployable.

In [17]:
# Inspect column names for potential post-churn signals
X.columns.sort_values().tolist()

['Age',
 'AvgMonthlyGBDownload',
 'AvgMonthlyLongDistanceCharges',
 'CLTV',
 'ChurnLabel',
 'ChurnReason',
 'ChurnScore',
 'City',
 'Contract',
 'Country',
 'CustomerID',
 'CustomerStatus',
 'Dependents',
 'DeviceProtectionPlan',
 'Gender',
 'InternetService',
 'InternetType',
 'Latitude',
 'Longitude',
 'Married',
 'MonthlyCharge',
 'MultipleLines',
 'Number_of_Referrals',
 'NumberofDependents',
 'Offer',
 'OnlineBackup',
 'OnlineSecurity',
 'PaperlessBilling',
 'PaymentMethod',
 'PhoneService',
 'Population',
 'PremiumTechSupport',
 'Quarter',
 'ReferredaFriend',
 'SatisfactionScore',
 'SeniorCitizen',
 'State',
 'StreamingMovies',
 'StreamingMusic',
 'StreamingTV',
 'TenureinMonths',
 'TotalCharges',
 'TotalExtraDataCharges',
 'TotalLongDistanceCharges',
 'TotalRefunds',
 'TotalRevenue',
 'Under30',
 'UnlimitedData',
 'ZipCode']

## Leakage Mitigation

Several features in the raw dataset encode post-churn information or outcomes
that would not be available at the time a retention decision is made.

Including these features results in unrealistically high performance and
invalidates the model for deployment.

We explicitly remove:
- Direct churn indicators
- Post-outcome financial variables
- Identifier and proxy variables

This ensures the model reflects a true *pre-churn prediction* scenario.

In [18]:
leakage_cols = [
    "ChurnLabel",
    "ChurnReason",
    "ChurnScore",
    "CustomerStatus",
    "TotalRefunds",
    "SatisfactionScore",
    "TotalRevenue",
    "CLTV",
    "Quarter",
    "CustomerID",
    "Latitude",
    "Longitude",
    "ZipCode"
]

X_clean = X_fe.drop(columns=[c for c in leakage_cols if c in X_fe.columns])

X_clean.shape

(7043, 38)

In [19]:
X_train_clean, X_test_clean, y_train, y_test = train_test_split(
    X_clean,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

y_train.mean(), y_test.mean()

(np.float64(0.2653532126375577), np.float64(0.2654364797728886))

## Model Performance After Leakage Removal

After removing post-outcome features, we retrain the model and re-evaluate
ranking performance.

A decrease in metrics is expected and desirable, as it reflects a realistic
prediction task.

In [20]:
# rebuild preprocessing columns
cat_cols = X_train_clean.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_cols = [c for c in X_train_clean.columns if c not in cat_cols]

preprocess_clean = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ]
)

logreg_clean = Pipeline(steps=[
    ("preprocess", preprocess_clean),
    ("clf", LogisticRegression(
        max_iter=5000,
        solver="saga",
        n_jobs=-1
    ))
])

logreg_clean.fit(X_train_clean, y_train)

proba_clean = logreg_clean.predict_proba(X_test_clean)[:, 1]

roc_clean = roc_auc_score(y_test, proba_clean)
pr_clean = average_precision_score(y_test, proba_clean)

roc_clean, pr_clean

(0.9047895838177168, 0.7791526288054998)

## Interpretation After Leakage Removal

After removing post-churn and outcome-dependent features, model performance
decreased from perfect scores to more realistic values.

This is a desirable outcome:
- The model no longer relies on information unavailable at prediction time
- Ranking performance remains strong
- Results better reflect real-world deployment conditions

The feature-engineered logistic regression maintains high utility for
prioritizing customers at risk of churn.

## Notebook 3 Summary & Model Judgment

In this notebook, we improved churn prediction through targeted feature engineering
while maintaining interpretability and deployment realism.

Key outcomes:
- Engineered lifecycle and value-based features to capture non-linear behavior
- Identified and removed post-churn data leakage
- Accepted a performance decrease in exchange for realistic, deployable metrics
- Achieved strong ranking performance using an interpretable logistic regression model

Final model choice reflects a deliberate tradeoff:
we prioritize transparency, stability, and decision usefulness over marginal
performance gains from more complex models.

This model serves as a reliable input to downstream decision policy and cost
simulation, which is the focus of the next notebook.