# Customer Churn EDA

This notebook performs exploratory data analysis (EDA) on the same synthetic
telecom churn dataset used by the training pipeline in this repository.

- Data generation code lives in `training/preprocess.py`.
- The model training script lives in `training/train.py`.

The goal is to show:

- Overall dataset structure
- Churn rate distribution
- Feature distributions and relationships with churn
- Simple correlations and baseline model behavior


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from training.preprocess import (
    FEATURE_COLUMNS,
    TARGET_COLUMN,
    generate_synthetic_churn_data,
)

%matplotlib inline
sns.set(style="whitegrid", context="notebook")

In [None]:
# Generate the same synthetic dataset used by training
df = generate_synthetic_churn_data(n_samples=2000, random_state=42)
df.head()

In [None]:
df.shape, df.dtypes

In [None]:
df.isna().sum()

## Churn rate distribution

The target column is `churn` (1 = churned, 0 = retained).

In [None]:
churn_rate = df["churn"].mean()
print(f"Overall churn rate: {churn_rate:.3f}")

ax = sns.countplot(x="churn", data=df)
ax.set_xticklabels(["Retained (0)", "Churned (1)"])
ax.set_title("Churn vs. Retained counts")
plt.show()

## Feature distributions

We inspect the distributions of key numeric features.

In [None]:
numeric_cols = ["tenure", "monthly_charges", "support_calls"]

fig, axes = plt.subplots(1, len(numeric_cols), figsize=(15, 4))
for ax, col in zip(axes, numeric_cols):
    sns.histplot(df[col], bins=30, kde=True, ax=ax)
    ax.set_title(f"Distribution of {col}")
plt.tight_layout()
plt.show()

## Churn rate by key segments

The synthetic generation code includes several drivers:

- Higher `monthly_charges`
- Shorter `tenure`
- `contract_type` = 0 (month-to-month)
- Higher `support_calls`
- `is_senior` = 1

We can verify these patterns by looking at segment-level churn rates.

In [None]:
def plot_segment_churn_rate(df, by, title):
    rates = df.groupby(by)["churn"].mean().reset_index()
    ax = sns.barplot(x=by, y="churn", data=rates)
    ax.set_ylabel("Churn rate")
    ax.set_title(title)
    plt.show()

plot_segment_churn_rate(df, "contract_type", "Churn rate by contract type")
plot_segment_churn_rate(df, "has_internet", "Churn rate by internet service")
plot_segment_churn_rate(df, "is_senior", "Churn rate by senior citizen flag")

In [None]:
# Churn rate by binned tenure
df["tenure_bin"] = pd.cut(df["tenure"], bins=[0, 6, 12, 24, 48, 72])
plot_segment_churn_rate(df, "tenure_bin", "Churn rate by tenure bin")

# Churn rate by binned monthly charges
df["monthly_bin"] = pd.cut(df["monthly_charges"], bins=[20, 40, 60, 80, 100, 120])
plot_segment_churn_rate(df, "monthly_bin", "Churn rate by monthly charges bin")

## Correlations

We compute a simple correlation matrix for the features and the churn label.

In [None]:
corr = df[FEATURE_COLUMNS + [TARGET_COLUMN]].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True)
plt.title("Correlation heatmap")
plt.show()

## Baseline model check (optional)

We train a simple logistic regression model on the synthetic data to get a
quick sense of how separable churn vs. non-churn cases are. This is separate
from the main XGBoost model used by the training script, but it gives a nice
    sanity check on the signal in the features.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay, roc_auc_score
from sklearn.model_selection import train_test_split

X = df[FEATURE_COLUMNS]
y = df[TARGET_COLUMN]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_proba = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"Logistic regression ROC AUC: {auc:.3f}")

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, ax=ax[0])
RocCurveDisplay.from_predictions(y_test, y_proba, ax=ax[1])
plt.tight_layout()
plt.show()