# Phase 2: Data Understanding & Exploratory Data Analysis

## Dataset Overview

- **Total Customers:** 7,043
- **Features:** 20 (after removing customerID)
- **Target Variable:** Churn (Yes/No)

## Key Findings

### Target Distribution

The dataset shows a moderate class imbalance:
- **No Churn:** 5,174 customers (73.5%)
- **Yes Churn:** 1,869 customers (26.5%)

This 3:1 ratio confirms that accuracy alone would be a misleading metric. A naive model predicting "No Churn" for all customers would achieve 73.5% accuracy while providing zero business value.

### Data Quality Issues

**TotalCharges Data Type Problem:**
- Currently stored as object (string) instead of numeric
- Contains 11 records with empty space characters
- Requires conversion and handling of missing values in Phase 3

**Encoding Inconsistency:**
- SeniorCitizen is encoded as binary integers (0, 1)
- Other categorical variables use string labels (Yes/No, Male/Female)
- Standardization needed for model compatibility

## Next Steps

Further analysis required to understand:
- Relationship between tenure and churn
- Impact of monthly charges on churn rate
- Contract type influence on customer retention
- Service-related features correlation with churn

In [13]:
# ============================
# Importing Libraries
# ============================

import pandas as pd


# ============================
# Loading Data
# ============================

data = pd.read_csv("Teleco_customer_churn.csv")

# Basic inspection
#print(data.head())
#print(data.shape)
#print(data.columns)
#print(data.info())


# ============================
# Identifier Handling
# ============================

# Drop customerID ONLY if it exists
if "customerID" in data.columns:
    data = data.drop(columns=["customerID"])


# ============================
# Target Distribution
# ============================

print(data["Churn"].value_counts())
print(data["Churn"].value_counts(normalize=True))
print(data['TotalCharges'].dtype)
print((data['TotalCharges'] == ' ').sum())  # Check for spaces

print(data['SeniorCitizen'].unique())
print(data['gender'].unique())

# ============================
# Data Quality Check
# ============================

#print(data.isnull().sum())


Churn
No     5174
Yes    1869
Name: count, dtype: int64
Churn
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64
object
11
[0 1]
['Female' 'Male']
