# Step 1 — Data Cleaning & Preprocessing

This notebook performs initial data cleaning for the Telco Customer Churn dataset. It will:

- Inspect the dataset for missing values and incorrect dtypes
- Clean the `TotalCharges` column (common issue: loaded as object)
- Apply a small, explicit missing-value policy (drop or impute based on fraction)
- Encode categorical variables (one-hot for multi-class, keep binary as 0/1)
- Save a cleaned CSV to `data/cleaned_telco_churn.csv`

Checklist:
- [ ] Load `WA_Fn-UseC_-Telco-Customer-Churn.csv`
- [ ] Inspect for missing values and bad dtypes
- [ ] Fix `TotalCharges` dtype and handle resulting NaNs
- [ ] Encode categorical variables
- [ ] Save cleaned dataset for modeling


In [1]:
# Imports
import pandas as pd
from pathlib import Path
import numpy as np

pd.set_option('display.max_columns', 80)
pd.set_option('display.width', 180)

# Load dataset (adjust path if needed)
csv_path = Path('..') / 'data' / 'WA_Fn-UseC_-Telco-Customer-Churn.csv'  # notebook lives in notebooks/
df = pd.read_csv(csv_path)
print('Loaded', csv_path)
print('Shape:', df.shape)
df.head()

# Quick structure check and missing values
df.info()
print('\nMissing values per column:')
print(df.isnull().sum())
print('\nBlank-string counts (object cols may contain spaces):')
for c in df.select_dtypes(include=['object']).columns:
    blanks = (df[c].astype(str).str.strip() == '').sum()
    if blanks > 0:
        print(f'{c}: {blanks} blank-like values')

# TotalCharges is frequently stored as object due to blank strings. Convert to numeric.
print('Before conversion dtype:', df['TotalCharges'].dtype)
# coerce errors to NaN so we can count/fix them
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
print('After conversion dtype:', df['TotalCharges'].dtype)
num_missing_total = df['TotalCharges'].isna().sum()
print('TotalCharges NaNs:', num_missing_total, 'of', len(df), f'({num_missing_total/len(df):.2%})')


Loaded ../data/WA_Fn-UseC_-Telco-Customer-Churn.csv
Shape: (7043, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7

In [2]:
# Decide strategy for missing TotalCharges: drop if very few, otherwise impute with median.
missing_ratio = df['TotalCharges'].isna().mean()
if missing_ratio == 0:
    print('No missing TotalCharges — nothing to do')
elif missing_ratio <= 0.05:
    print('Missing fraction <=5%: dropping rows with missing TotalCharges')
    df = df[~df['TotalCharges'].isna()].copy()
else:
    med = df['TotalCharges'].median()
    print(f'Missing fraction >5%: imputing TotalCharges with median = {med:.2f}')
    df['TotalCharges'] = df['TotalCharges'].fillna(med)

print('New shape after TotalCharges handling:', df.shape)

# Trim whitespace in object columns and normalize common binary columns
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].astype(str).str.strip()

# Convert SeniorCitizen (0/1) to int (it's numeric already in many versions)
if 'SeniorCitizen' in df.columns:
    df['SeniorCitizen'] = df['SeniorCitizen'].astype(int)

# Drop customerID if present — not a feature
if 'customerID' in df.columns:
    df = df.drop(columns=['customerID'])

print('Dtypes after cleanup:')
print(df.dtypes)

# Identify categorical columns
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print('Categorical columns (to encode):', cat_cols)
print('Numeric columns:', num_cols)

# One-hot encode categorical variables using pandas.get_dummies
# For modeling later we often drop one level to avoid collinearity; keep all here for interpretability, or set drop_first=True if desired.
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=False)
print('Shape before encoding:', df.shape)
print('Shape after encoding:', df_encoded.shape)

# Show a few columns to confirm encoding
df_encoded.iloc[:, :30].head()

# Ensure output directory exists and save cleaned dataset
out_dir = Path('..') / 'data'
out_dir.mkdir(exist_ok=True)
out_path = out_dir / 'cleaned_telco_churn.csv'
df_encoded.to_csv(out_path, index=False)
print('Saved cleaned dataset to', out_path)
print('Final dataframe shape:', df_encoded.shape)


Missing fraction <=5%: dropping rows with missing TotalCharges
New shape after TotalCharges handling: (7032, 21)
Dtypes after cleanup:
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object
Categorical columns (to encode): ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
Numeric c

## Notes & Next Steps

- This notebook completes Step 1: cleaning and basic encoding. The cleaned CSV `data/cleaned_telco_churn.csv` is ready for feature engineering and model building.
- Next notebook will perform feature engineering and train baseline models (logistic regression / tree) and produce baseline metrics.
- Later we will add explainability (SHAP/LIME) to show drivers of churn for business stakeholders.
