# Telecom Customer Churn: Feature Engineering

This notebook performs feature engineering on the Telco Customer Churn dataset. It loads the processed data, verifies schema constraints, runs chi-squared tests to identify categorical features associated with churn, creates derived boolean and ordinal features (for example `contract_stability`, `fiber_no_support`, `manual_payment_early`, and `high_risk_new_monthly`). drops non-significant or redundant columns, and writes the finalized featured dataset to `data/03_featured/churn_featured.csv`.
The goal is to prepare a clean, informative feature set ready for modeling and evaluation.

### Imports

In [1]:
import pandas as pd
import os

from scipy.stats import chi2_contingency

# Local src
from src.df_overview import df_overview
from src.feature_engineering import feature_engineering
from src.schemas import POST_WRANGLING_SCHEMA

print('Done')

Done


In [2]:
data_path = os.path.join('..', 'data', '02_processed', 'customer_churn.csv')
df = pd.read_csv(data_path)

In [3]:
EXPECTED_COLUMNS = POST_WRANGLING_SCHEMA["expected_columns"]
MIN_ROWS = POST_WRANGLING_SCHEMA.get("min_rows", 0)

assert set(df.columns) == EXPECTED_COLUMNS, f"Column mismatch: expected {EXPECTED_COLUMNS}, got {set(df.columns)} — check data source."
assert len(df) >= MIN_ROWS, f"Row count {len(df)} is below minimum {MIN_ROWS} — check data source."
print(f"Schema OK: {len(df)} rows, {len(df.columns)} columns.")

Schema OK: 7032 rows, 20 columns.


### Chi2 Test

In [4]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
cat_cols.remove('churn')
results = []

for col in cat_cols:
    table = pd.crosstab(df[col], df["churn"])
    chi2, p, dof, expected = chi2_contingency(table)

    results.append({
        "Feature": col,
        "Chi2": chi2,
        "p_value": p
    })

results_df = pd.DataFrame(results).sort_values("p_value")
results_df

Unnamed: 0,Feature,Chi2,p_value
13,contract,1179.545829,7.326182e-257
7,online_security,846.677389,1.400687e-184
10,tech_support,824.925564,7.407808e-180
6,internet_service,728.695614,5.831199e-159
15,payment_method,645.4299,1.42631e-139
8,online_backup,599.175185,7.776099e-131
9,device_protection,555.880327,1.959389e-121
12,streaming_movies,374.268432,5.35356e-82
11,streaming_tv,372.456502,1.324641e-81
14,paperless_billing,256.874908,8.236203e-58


- gender & phone service are not statistically significant (0.49 & 0.35 p-values), can be considered to drop in order to reduce noise during predictive modeling;

### Feature Engineering

In [5]:
df = feature_engineering(df, drop_replaced=True)
for i in df.columns:
    print(f"{i}: {df[i].unique()}")

senior_citizen: ['No' 'Yes']
partner: ['Yes' 'No']
dependents: ['No' 'Yes']
tenure: [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 39]
multiple_lines: ['No phone service' 'No' 'Yes']
internet_service: ['DSL' 'Fiber optic' 'No']
online_security: ['No' 'Yes' 'No internet service']
online_backup: ['Yes' 'No' 'No internet service']
device_protection: ['No' 'Yes' 'No internet service']
tech_support: ['No' 'Yes' 'No internet service']
streaming_tv: ['No' 'Yes' 'No internet service']
streaming_movies: ['No' 'Yes' 'No internet service']
paperless_billing: ['Yes' 'No']
payment_method: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
monthly_charges: [29.85 56.95 53.85 ... 63.1  44.2  78.7 ]
churn: ['No' 'Yes']
high_risk_tenure: ['high_risk_category', 'low_risk_category', 'medium_risk_cat

- created contract_stability, fiber_no_support, manual_payment_early, high_risk_new_monthly;
- dropped columns identified as not significant by the chi2 test, as well as features that are no longer needed due to other feature engineered in order to have no redundancy and efficiency in using computational resources during training; 
- plus total_charges due to the 'paradox', as I needed to choose between monthly_charges and total_charges, monthly_charges stores more important info;

In [6]:
# Columns already dropped by feature_engineering(drop_replaced=True) above.
# Verify the expected schema:
df_overview(df)

(7032, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   senior_citizen         7032 non-null   object  
 1   partner                7032 non-null   object  
 2   dependents             7032 non-null   object  
 3   tenure                 7032 non-null   int64   
 4   multiple_lines         7032 non-null   object  
 5   internet_service       7032 non-null   object  
 6   online_security        7032 non-null   object  
 7   online_backup          7032 non-null   object  
 8   device_protection      7032 non-null   object  
 9   tech_support           7032 non-null   object  
 10  streaming_tv           7032 non-null   object  
 11  streaming_movies       7032 non-null   object  
 12  paperless_billing      7032 non-null   object  
 13  payment_method         7032 non-null   object  
 14  monthly_charges        7032 n

In [7]:
pd_to_csv_path = os.path.join('..', 'data', '03_featured', 'churn_featured.csv')

os.makedirs(os.path.dirname(pd_to_csv_path), exist_ok=True)

df.to_csv(pd_to_csv_path, index=False)

print(f"CSV saved to {pd_to_csv_path}")

CSV saved to ..\data\03_featured\churn_featured.csv


## Summary of Feature Engineering

### Features Created:

| Feature | Description |
|---|---|
| `contract_stability` | Ordinal score (0–2) encoding contract type risk; month-to-month = 0, two year = 2 |
| `fiber_no_support` | Boolean flag for fiber optic users without tech support — highest early-churn cohort |
| `manual_payment_early` | Boolean flag for electronic/mailed check payers in early tenure (≤12 months) |
| `high_risk_new_monthly` | Boolean flag combining month-to-month contract, early tenure, and high monthly charges |

### Columns Dropped:

| Column | Reason |
|---|---|
| `gender`, `phone_service` | Not statistically significant (chi² p-values: 0.49 & 0.35) |
| `contract`, `internet_service`, `payment_method` | Replaced by engineered features; retaining originals would be redundant |
| `total_charges` | Collinear with `tenure`; `monthly_charges` carries more actionable signal |

### Next Steps:
The featured dataset has been saved to `../data/03_featured/churn_featured.csv` and is ready for:
1. Classifier training and evaluation
2. Feature importance analysis
3. Class imbalance handling (e.g., SMOTE or class weights)
