# 01 - Exploratory Data Analysis (EDA)
## Customer Retention Analytics - Raw Data Exploration

**Objective**: Explore the raw telecom customer data to understand:
- Dataset structure and dimensions
- Missing values and data quality issues
- Distribution of key variables
- Churn rates and patterns
- Correlations between features

**Data Source**: `data/raw/telecom_customer_data.csv`

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set figure size default
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Libraries imported successfully")

## 1. Load Raw Data

In [None]:
# Load the raw dataset
df = pd.read_csv('../data/raw/telecom_customer_data.csv')

print(f"Dataset loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 2. Dataset Overview

In [None]:
# Display first few rows
print("\nüìä First 5 rows:")
df.head()

In [None]:
# Dataset information
print("\nüìã Dataset Info:")
df.info()

In [None]:
# Statistical summary
print("\nüìà Statistical Summary (Numerical Features):")
df.describe()

## 3. Missing Values Analysis

In [None]:
# Check for missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print("\n‚ö†Ô∏è Missing Values Detected:")
    print(missing_df)
    
    # Visualize missing values
    fig, ax = plt.subplots(figsize=(10, 4))
    missing_df['Percentage'].plot(kind='barh', ax=ax, color='coral')
    ax.set_xlabel('Percentage of Missing Values')
    ax.set_title('Missing Values by Column')
    plt.tight_layout()
    plt.show()
else:
    print("\n‚úì No missing values found!")

## 4. Target Variable Analysis: Churn

In [None]:
# Churn distribution
churn_counts = df['Churn'].value_counts()
churn_pct = (churn_counts / len(df)) * 100

print("\nüéØ Churn Distribution:")
print(f"  No Churn:  {churn_counts.get('No', 0):,} ({churn_pct.get('No', 0):.1f}%)")
print(f"  Churned:   {churn_counts.get('Yes', 0):,} ({churn_pct.get('Yes', 0):.1f}%)")
print(f"\n  Churn Rate: {churn_pct.get('Yes', 0):.1f}%")

In [None]:
# Visualize churn distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[0].pie(churn_counts, labels=churn_counts.index, autopct='%1.1f%%', 
            colors=colors, startangle=90)
axes[0].set_title('Churn Distribution (Pie Chart)', fontsize=14, fontweight='bold')

# Bar chart
sns.countplot(data=df, x='Churn', palette=colors, ax=axes[1])
axes[1].set_title('Churn Distribution (Count)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Churn Status')
axes[1].set_ylabel('Count')

# Add value labels on bars
for container in axes[1].containers:
    axes[1].bar_label(container, fmt='%d')

plt.tight_layout()
plt.show()

print("\nüí° Insight: Dataset has an imbalanced target variable with ~34% churn rate.")

## 5. Numerical Features Distribution

In [None]:
# Select numerical columns
numerical_cols = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'SupportCalls']

# Convert TotalCharges to numeric (it might have strings)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, col in enumerate(numerical_cols):
    # Histogram with KDE
    sns.histplot(data=df, x=col, kde=True, ax=axes[idx], color='steelblue')
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    
    # Add statistics
    mean_val = df[col].mean()
    median_val = df[col].median()
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=1.5, label=f'Mean: {mean_val:.1f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=1.5, label=f'Median: {median_val:.1f}')
    axes[idx].legend()

plt.tight_layout()
plt.show()

print("\nüìä Key Observations:")
print(f"  ‚Ä¢ Tenure: Mean = {df['Tenure'].mean():.1f} months, ranges from {df['Tenure'].min()} to {df['Tenure'].max()}")
print(f"  ‚Ä¢ Monthly Charges: Mean = ${df['MonthlyCharges'].mean():.2f}, ranges from ${df['MonthlyCharges'].min():.2f} to ${df['MonthlyCharges'].max():.2f}")
print(f"  ‚Ä¢ Support Calls: Mean = {df['SupportCalls'].mean():.1f} calls per customer")

## 6. Categorical Features Analysis

In [None]:
# Select key categorical features
categorical_cols = ['Contract', 'PaymentMethod', 'InternetService', 'Gender']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, col in enumerate(categorical_cols):
    value_counts = df[col].value_counts()
    
    # Bar plot
    sns.countplot(data=df, x=col, ax=axes[idx], palette='Set2', order=value_counts.index)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Count')
    axes[idx].tick_params(axis='x', rotation=45)
    
    # Add value labels
    for container in axes[idx].containers:
        axes[idx].bar_label(container, fmt='%d')

plt.tight_layout()
plt.show()

## 7. Churn Rate by Categorical Variables

In [None]:
# Calculate churn rate by contract type
def calculate_churn_rate(df, column):
    churn_by_cat = df.groupby(column)['Churn'].apply(lambda x: (x == 'Yes').sum() / len(x) * 100)
    return churn_by_cat.sort_values(ascending=False)

# Analyze churn by key categorical variables
categories_to_analyze = ['Contract', 'PaymentMethod', 'InternetService', 'Region']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for idx, col in enumerate(categories_to_analyze):
    churn_rate = calculate_churn_rate(df, col)
    
    # Bar plot
    ax = axes[idx]
    bars = ax.bar(range(len(churn_rate)), churn_rate.values, color='coral')
    ax.set_xticks(range(len(churn_rate)))
    ax.set_xticklabels(churn_rate.index, rotation=45, ha='right')
    ax.set_title(f'Churn Rate by {col}', fontsize=12, fontweight='bold')
    ax.set_ylabel('Churn Rate (%)')
    ax.set_xlabel(col)
    ax.axhline(y=churn_pct.get('Yes', 0), color='red', linestyle='--', linewidth=1.5, label=f'Overall: {churn_pct.get("Yes", 0):.1f}%')
    ax.legend()
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.1f}%', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print("\nüîç Churn Rate Insights:")
for col in categories_to_analyze:
    churn_rate = calculate_churn_rate(df, col)
    print(f"\n{col}:")
    for cat, rate in churn_rate.items():
        print(f"  ‚Ä¢ {cat}: {rate:.1f}%")

## 8. Numerical Variables vs Churn

In [None]:
# Box plots for numerical variables by churn status
numerical_for_boxplot = ['Tenure', 'MonthlyCharges', 'TotalCharges', 'SupportCalls']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, col in enumerate(numerical_for_boxplot):
    sns.boxplot(data=df, x='Churn', y=col, palette=['#2ecc71', '#e74c3c'], ax=axes[idx])
    axes[idx].set_title(f'{col} by Churn Status', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Churn')
    axes[idx].set_ylabel(col)

plt.tight_layout()
plt.show()

# Statistical comparison
print("\nüìä Average Values by Churn Status:\n")
for col in numerical_for_boxplot:
    no_churn_mean = df[df['Churn'] == 'No'][col].mean()
    yes_churn_mean = df[df['Churn'] == 'Yes'][col].mean()
    print(f"{col}:")
    print(f"  No Churn:  {no_churn_mean:.2f}")
    print(f"  Churned:   {yes_churn_mean:.2f}")
    print(f"  Difference: {yes_churn_mean - no_churn_mean:.2f}\n")

## 9. Correlation Analysis

In [None]:
# Select numerical columns for correlation
df_numeric = df[['SeniorCitizen', 'Tenure', 'MonthlyCharges', 'TotalCharges', 'SupportCalls']].copy()

# Add binary encoding for Churn
df_numeric['Churn_Binary'] = (df['Churn'] == 'Yes').astype(int)

# Calculate correlation matrix
correlation_matrix = df_numeric.corr()

# Visualize correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix (Numerical Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüîó Correlation with Churn:")
churn_corr = correlation_matrix['Churn_Binary'].sort_values(ascending=False)
for feature, corr in churn_corr.items():
    if feature != 'Churn_Binary':
        print(f"  ‚Ä¢ {feature}: {corr:.3f}")

## 10. Data Quality Issues Summary

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
duplicate_ids = df['CustomerID'].duplicated().sum()

# Check for inconsistent values
gender_values = df['Gender'].unique()

print("\nüîç Data Quality Summary:\n")
print(f"1. Missing Values:")
if len(missing_df) > 0:
    for col, row in missing_df.iterrows():
        print(f"   ‚Ä¢ {col}: {row['Missing Count']:.0f} ({row['Percentage']:.2f}%)")
else:
    print(f"   ‚úì No missing values")

print(f"\n2. Duplicate Records:")
print(f"   ‚Ä¢ Full duplicates: {duplicates}")
print(f"   ‚Ä¢ Duplicate CustomerIDs: {duplicate_ids}")

print(f"\n3. Data Type Issues:")
print(f"   ‚Ä¢ TotalCharges should be numeric (currently: {df['TotalCharges'].dtype})")

print(f"\n4. Categorical Inconsistencies:")
print(f"   ‚Ä¢ Gender values: {list(gender_values)}")
if len(gender_values) > 2:
    print(f"   ‚ö†Ô∏è Should be standardized to Male/Female only")

print("\nüí° These issues will be addressed in the cleaning stage (02_Cleaning_EDA.ipynb)")

## 11. Key Findings Summary

### Dataset Overview
- **Size**: 7,537 records √ó 23 features
- **Churn Rate**: ~33.7% (high - above industry average)
- **Data Quality**: Minor issues with missing TotalCharges and duplicate records

### Key Insights

#### 1. **Contract Type is Critical**
- Month-to-month contracts have significantly higher churn (~45%)
- Two-year contracts show lowest churn (~11%)
- **Action**: Incentivize longer contract commitments

#### 2. **Payment Method Matters**
- Electronic check users churn more frequently
- Automatic payment methods show lower churn
- **Action**: Encourage migration to autopay

#### 3. **Tenure Impact**
- Churned customers have lower average tenure (18 vs 24 months)
- Early months are critical for retention
- **Action**: Focus on new customer onboarding

#### 4. **Support Calls Correlation**
- Higher support calls correlate with increased churn
- May indicate service dissatisfaction
- **Action**: Proactive support for high-call customers

#### 5. **Revenue Paradox**
- Churned customers actually pay slightly higher monthly charges ($82.89 vs $80.59)
- Suggests price sensitivity may not be the primary driver
- **Action**: Focus on value delivery, not just pricing

### Next Steps
1. **Data Cleaning**: Address missing values, duplicates, and standardization
2. **Feature Engineering**: Create derived features (tenure groups, retention scores)
3. **Deep Analysis**: Segment customers and identify at-risk profiles

In [None]:
print("\n" + "="*70)
print("  EDA COMPLETE - Proceed to 02_Cleaning_EDA.ipynb")
print("="*70)