## EDA 01: Macro-Economic Sensitivity (The "Golden Cross" Signal)
### 1. Overview
This analysis explores the impact of ***Consumer Price Index (CPI)*** growth on customer churn. 
To overcome the limitations of a static dataset lacking timestamps, we utilize ***tenure*** as a proxy for time. 
This allows us to define specific segments exposed to economic shocks, particularly the high-inflation period and the policy shift announcement in September 2025.

### 2. Research Hypothesis
- **Research**: Financial pressure from rising CPI will exert stronger churn pressure on customers with higher **`MonthlyCharges`**.
- **Analysis Baseline**: 
- Data Extraction Date: January 2026.
- **'Survivor' Segment**: Customers with `tenure > 4`, who joined before the September 2025 policy announcement and remained active through the subsequent market volatility.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load data
df = pd.read_csv('../data/processed/cleaned_data.csv')

In [None]:
# Checking for missing values in key columns - confirmation of the cleaned data
df.info()
df.isnull().sum().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5534 entries, 0 to 5533
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        5534 non-null   object 
 1   gender            5534 non-null   object 
 2   SeniorCitizen     5534 non-null   int64  
 3   Partner           5534 non-null   object 
 4   Dependents        5534 non-null   object 
 5   tenure            5534 non-null   float64
 6   PhoneService      5534 non-null   object 
 7   MultipleLines     5534 non-null   object 
 8   InternetService   5534 non-null   object 
 9   OnlineSecurity    5534 non-null   object 
 10  OnlineBackup      5534 non-null   object 
 11  DeviceProtection  5534 non-null   object 
 12  TechSupport       5534 non-null   object 
 13  StreamingTV       5534 non-null   object 
 14  StreamingMovies   5534 non-null   object 
 15  Contract          5534 non-null   object 
 16  PaperlessBilling  5534 non-null   object 


np.int64(0)

In [None]:
# Define Groups: Before Sep 2025 / Sep-Oct 2025 / After Nov 2025
def categorize_tenure(t):
    if t>= 5: return 'Pre-Inflection (5+mo)'   # Before Sep 2025
    elif t>=3: return 'Inflection (3-5mo)'     # Sep-Oct 2025
    else: return 'Fresh (<3mo)'                # After Nov 2025

df['Segment'] = df['tenure'].apply(categorize_tenure)

print(df[['tenure', 'Segment']].head())

# Check the count of each group
print("\nValue counts for each segment")
print(df['Segment'].value_counts())


   tenure                Segment
0     1.0           Fresh (<3mo)
1    34.0  Pre-Inflection (5+mo)
2     2.0           Fresh (<3mo)
3    45.0  Pre-Inflection (5+mo)
4     2.0           Fresh (<3mo)

Value counts for each segment
Segment
Pre-Inflection (5+mo)    4595
Fresh (<3mo)              638
Inflection (3-5mo)        301
Name: count, dtype: int64


**If *Survivor* group actually has a higher churn rate when their monthly bills are high.
1. Calculate Average Churn Rate
2. Comparison Visualization

In [26]:
# Converting Churn to numeric
df['Churn'].unique()

# Convert yes--> 1, No --> 0 >> Churn_num
# Defensive code: Strip whitespaces and convert to a standard case to avoid errors
df['Churn_num'] = (df['Churn'].str.strip().str.title()=='Yes').astype(int)

# Verify if the conversion captured all cases correctly
print("Conversion check")
print(df['Churn_num'].value_counts())

Conversion check
Churn_num
0    4063
1    1471
Name: count, dtype: int64


In [27]:
# Group by segment and calculate the mean of churn
# This represents the churn probability for each group

segment_churn = df.groupby('Segment')['Churn_num'].mean()

print("Average Churn Rate by Segment")
print(segment_churn)


Average Churn Rate by Segment
Segment
Fresh (<3mo)             0.590909
Inflection (3-5mo)       0.485050
Pre-Inflection (5+mo)    0.206311
Name: Churn_num, dtype: float64


### Quantifying the 'Breaking Point' via Cross-Analysis
- To validate the hypothesis that CPI-driven financial pressure exerts stronger churn pressure on high-value users, perform a cross-analysis between  **segments** and **price tiers**
 1. ***Segmenting by MonthlyCharges***: categorized customers into **Low**, **Medium**, and **High** price tiers to measure specific ***Macro-Economic Sensitivity***.
 2. ***Testing Resilience***: By comparing the churn probability of the **Survivor** group against **New Adopters** in the *High* price tier, identify if long-term exposure to inflation has reached a critical breaking point.

In [28]:
# Create price tiers based on MonthlyCharges 
df['Price_tier'] = pd.qcut(df['MonthlyCharges'], q=3, labels=['Low', 'Medium', 'High'])

# Compare Churn rate by Segment and Price tier (Survivor/New with Price tier + Yes/No)
pivot_result = pd.crosstab(
                [df['Segment'], df['Price_tier']],
                df['Churn'],
                normalize='index'
                ).mul(100).round(2)


# Display 'Yes' column to focus on Churn Probability
print("Churn Probability (%) per Group:")
print(pivot_result[['Yes']])

Churn Probability (%) per Group:
Churn                               Yes
Segment               Price_tier       
Fresh (<3mo)          Low         43.30
                      Medium      72.37
                      High        86.67
Inflection (3-5mo)    Low         32.54
                      Medium      55.74
                      High        69.81
Pre-Inflection (5+mo) Low          8.86
                      Medium      20.11
                      High        30.60


In [31]:
# Load reference files
cpi = pd.read_csv('../data/references/CPIRECSL_consumerPriceIndex_2023_2026.csv')
complaints = pd.read_csv('../data/references/youtube_premium_complain.csv')
churn_ts = pd.read_csv('../data/references/youtube_churn.csv')

# Convert dates
cpi['observation_date'] = pd.to_datetime(cpi['observation_date'])
complaints['Time'] = pd.to_datetime(complaints['Time'])
churn_ts['Time'] = pd.to_datetime(churn_ts['Time'])

print(f"* Reference data loaded: {len(cpi)} CPI records, {len(complaints)} complaint records, {len(churn_ts)} churn records")


* Reference data loaded: 37 CPI records, 26 complaint records, 26 churn records
