# Notebook 1: Data Exploration

## Objectives
- Load the Telco Customer Churn dataset
- Inspect dataset structure and types
- Analyze missing values and basic statistics
- Visualize churn distribution and key relationships
- Identify potential churn drivers

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))

from data_preprocessing import load_data, clean_data

pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

## 1. Load Data

In [None]:
filepath = '../data/WA_Fn-UseC_-Telco-Customer-Churn.csv'
# Check if file exists to avoid error
if not os.path.exists(filepath):
    print("Dataset not found. Please download it from Kaggle and place it in 'data/' folder.")
else:
    df = load_data(filepath)
    df.head()

## 2. Dataset Overview

In [None]:
if 'df' in locals():
    print(df.info())
    print("\nShape:", df.shape)

## 3. Data Cleaning (Initial Look)
The `TotalCharges` column is often an object type and needs conversion. We also check for missing values.

In [None]:
if 'df' in locals():
    df_clean = clean_data(df)
    print("Missing values after cleaning:\n", df_clean.isnull().sum().sum())

## 4. Visualizations

### Churn Distribution

In [None]:
if 'df_clean' in locals():
    plt.figure(figsize=(6, 6))
    churn_counts = df_clean['Churn'].value_counts()
    plt.pie(churn_counts, labels=['No Churn (0)', 'Churn (1)'], autopct='%1.1f%%', colors=['#66b3ff','#ff9999'])
    plt.title('Churn Distribution')
    plt.show()

### Numerical Features: Tenure & Monthly Charges

In [None]:
if 'df_clean' in locals():
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    sns.histplot(data=df_clean, x='tenure', hue='Churn', multiple='stack', kde=True, ax=axes[0])
    axes[0].set_title('Tenure Distribution by Churn')
    
    sns.boxplot(data=df_clean, x='Churn', y='MonthlyCharges', ax=axes[1])
    axes[1].set_title('Monthly Charges by Churn')
    plt.show()

**Insight**: Churn is higher for customers with lower tenure (new customers) and higher monthly charges.

### Categorical Features: Contract Type

In [None]:
if 'df_clean' in locals():
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df_clean, x='Contract', hue='Churn')
    plt.title('Churn Rate by Contract Type')
    plt.show()

**Insight**: Month-to-month contracts have significantly higher churn.

### Correlation Heatmap

In [None]:
if 'df_clean' in locals():
    # Select numerical columns only for correlation
    numeric_df = df_clean.select_dtypes(include=[np.number])
    plt.figure(figsize=(10, 8))
    sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
    plt.title('Correlation Heatmap')
    plt.show()