# ðŸ“Š Customer Churn - Exploratory Data Analysis

This notebook explores the customer churn dataset to understand patterns and prepare for model building.

## Contents
1. Data Loading & Overview
2. Univariate Analysis
3. Churn Distribution
4. Correlation Analysis
5. Key Insights

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
sns.set_palette('husl')

print("Libraries loaded! âœ…")

## 1. Data Loading & Overview

In [None]:
# Load data
df = pd.read_csv('../data/raw/customers.csv')

print(f"Dataset Shape: {df.shape}")
print(f"\nColumns ({len(df.columns)}):")
print(df.columns.tolist())

df.head(10)

In [None]:
# Data types and info
print("Data Types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
# Statistical summary
df.describe()

## 2. Univariate Analysis

In [None]:
# Numerical features distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Tenure
axes[0].hist(df['tenure'], bins=20, color='steelblue', edgecolor='white')
axes[0].set_title('Tenure Distribution', fontweight='bold')
axes[0].set_xlabel('Months')
axes[0].axvline(df['tenure'].mean(), color='red', linestyle='--', label=f'Mean: {df["tenure"].mean():.1f}')
axes[0].legend()

# Monthly Charges
axes[1].hist(df['monthly_charges'], bins=20, color='coral', edgecolor='white')
axes[1].set_title('Monthly Charges Distribution', fontweight='bold')
axes[1].set_xlabel('$ Amount')
axes[1].axvline(df['monthly_charges'].mean(), color='red', linestyle='--', label=f'Mean: ${df["monthly_charges"].mean():.0f}')
axes[1].legend()

# Total Charges
axes[2].hist(df['total_charges'], bins=20, color='seagreen', edgecolor='white')
axes[2].set_title('Total Charges Distribution', fontweight='bold')
axes[2].set_xlabel('$ Amount')
axes[2].axvline(df['total_charges'].mean(), color='red', linestyle='--', label=f'Mean: ${df["total_charges"].mean():.0f}')
axes[2].legend()

plt.tight_layout()
plt.show()

In [None]:
# Categorical features distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

cat_cols = ['contract_type', 'payment_method', 'internet_service', 
            'tech_support', 'gender', 'senior_citizen']

for idx, col in enumerate(cat_cols):
    ax = axes[idx // 3, idx % 3]
    df[col].value_counts().plot(kind='bar', ax=ax, color=plt.cm.Pastel1.colors)
    ax.set_title(col.replace('_', ' ').title(), fontweight='bold')
    ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 3. Churn Distribution & Analysis

In [None]:
# Overall churn rate
churn_counts = df['churn'].value_counts()
churn_rate = df['churn'].mean()

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[0].pie(churn_counts, labels=['Retained', 'Churned'], autopct='%1.1f%%',
            colors=colors, explode=(0, 0.05), shadow=True, startangle=90)
axes[0].set_title(f'Overall Churn Rate: {churn_rate:.1%}', fontsize=14, fontweight='bold')

# Bar chart
churn_counts.plot(kind='bar', ax=axes[1], color=colors)
axes[1].set_title('Customer Count by Churn Status', fontweight='bold')
axes[1].set_xticklabels(['Retained', 'Churned'], rotation=0)
axes[1].set_ylabel('Count')

for i, v in enumerate(churn_counts):
    axes[1].text(i, v + 2, str(v), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Churn by contract type
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Contract Type
churn_by_contract = df.groupby('contract_type')['churn'].mean().sort_values(ascending=False)
churn_by_contract.plot(kind='bar', ax=axes[0], color=['#e74c3c', '#f39c12', '#2ecc71'])
axes[0].set_title('Churn Rate by Contract Type', fontweight='bold')
axes[0].set_ylabel('Churn Rate')
axes[0].tick_params(axis='x', rotation=45)

# Payment Method
churn_by_payment = df.groupby('payment_method')['churn'].mean().sort_values(ascending=False)
churn_by_payment.plot(kind='bar', ax=axes[1], color=plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, 4)))
axes[1].set_title('Churn Rate by Payment Method', fontweight='bold')
axes[1].set_ylabel('Churn Rate')
axes[1].tick_params(axis='x', rotation=45)

# Internet Service
churn_by_internet = df.groupby('internet_service')['churn'].mean().sort_values(ascending=False)
churn_by_internet.plot(kind='bar', ax=axes[2], color=['#e74c3c', '#3498db', '#2ecc71'])
axes[2].set_title('Churn Rate by Internet Service', fontweight='bold')
axes[2].set_ylabel('Churn Rate')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Churn by tenure groups
df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 72, 100],
                           labels=['0-1yr', '1-2yr', '2-4yr', '4-6yr', '6+yr'])

fig, ax = plt.subplots(figsize=(10, 5))
churn_by_tenure = df.groupby('tenure_group')['churn'].agg(['mean', 'count'])

bars = ax.bar(churn_by_tenure.index, churn_by_tenure['mean'], 
              color=plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, 5)), edgecolor='white', linewidth=2)
ax.set_title('Churn Rate by Customer Tenure', fontsize=14, fontweight='bold')
ax.set_xlabel('Tenure Group')
ax.set_ylabel('Churn Rate')

# Add count labels
for bar, count in zip(bars, churn_by_tenure['count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f'n={count}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

## 4. Correlation Analysis

In [None]:
# Numerical correlation with churn
numeric_cols = ['tenure', 'monthly_charges', 'total_charges', 'senior_citizen', 'churn']
corr_matrix = df[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, 
            fmt='.2f', square=True, linewidths=1, ax=ax)
ax.set_title('Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nCorrelation with Churn:")
print(corr_matrix['churn'].sort_values(ascending=False))

In [None]:
# Churned vs Retained - Box plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, col in enumerate(['tenure', 'monthly_charges', 'total_charges']):
    df.boxplot(column=col, by='churn', ax=axes[idx])
    axes[idx].set_title(col.replace('_', ' ').title(), fontweight='bold')
    axes[idx].set_xlabel('Churned')

plt.suptitle('Feature Distribution by Churn Status', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## 5. Key Insights

### ðŸ“Š Summary of Findings

**Churn Drivers (High Risk Factors):**
1. **Month-to-month contracts** have the highest churn rate (~50%+)
2. **Electronic check payments** correlate with higher churn
3. **Fiber optic internet** users churn more than DSL users
4. **New customers (tenure < 1 year)** are most likely to churn
5. **Higher monthly charges** are associated with increased churn

**Retention Indicators (Low Risk Factors):**
1. **Two-year contracts** have very low churn rates
2. **Longer tenure** strongly correlates with retention
3. **Bank transfer/credit card payments** indicate stable customers
4. **Tech support subscribers** are less likely to churn

### ðŸ’¡ Recommendations
- Focus retention efforts on first-year customers
- Incentivize annual contract upgrades
- Investigate fiber optic service quality issues
- Promote automatic payment methods

In [None]:
# Final summary statistics
print("="*60)
print("DATASET SUMMARY")
print("="*60)
print(f"Total Customers: {len(df):,}")
print(f"Churned Customers: {df['churn'].sum():,} ({df['churn'].mean():.1%})")
print(f"Retained Customers: {(1-df['churn']).sum():,.0f} ({1-df['churn'].mean():.1%})")
print(f"\nAverage Tenure: {df['tenure'].mean():.1f} months")
print(f"Average Monthly Charges: ${df['monthly_charges'].mean():.2f}")
print(f"Average Total Charges: ${df['total_charges'].mean():.2f}")
print("="*60)