## 🧼 03. Data Cleaning & Preprocessing

In this notebook, we clean and prepare the dataset for modeling. The main steps include:

- Fixing incorrect data types (e.g., `TotalCharges`)
- Handling missing or invalid values
- Encoding categorical variables
- Creating derived or transformed features if needed


In [1]:

import pandas as pd
import numpy as np

# Load cleaned dataset from 02 step
df = pd.read_csv("../data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Reapply conversion logic in case user skips notebook 02
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df = df.dropna(subset=['TotalCharges']).reset_index(drop=True)

# Step 1: Convert binary Yes/No columns to 1/0
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']

for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

# Step 2: Convert target variable
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Step 3: One-hot encode categorical features
cat_cols = df.select_dtypes(include='object').columns.tolist()

# Remove columns that were already encoded
cat_cols = [col for col in cat_cols if col not in ['Churn']]

# One-hot encode with drop_first
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# Final check
print("✅ Final cleaned dataset shape:", df_encoded.shape)
df_encoded.head()

df_encoded.to_csv("../data/cleaned_churn_data.csv", index=False)


✅ Final cleaned dataset shape: (7032, 7062)


## ✅ Data Cleaning Summary & Key Insights

- ✅ Converted `TotalCharges` from string to numeric and removed 11 rows with invalid values.
- ✅ Mapped binary columns (`Yes`/`No`) such as `Partner`, `Dependents`, and `PaperlessBilling` into numeric (1/0) format.
- ✅ Target variable `Churn` was encoded as a binary label (1 for churned, 0 for retained).
- ✅ Performed one-hot encoding on all remaining categorical variables to ensure model compatibility.
- ✅ Final cleaned dataset ，now it's ready for modeling.

These cleaning steps ensure consistent data types and structure, reduce model bias from categorical misinterpretation, and prepare the dataset for machine learning workflows.
