## Telco Customer Churn
### Data Cleaning & Preparation

**Objective**
Apply well-justified data cleaning and transformation rules based on
findings from previous EDA steps, while preserving business meaning
and analytical integrity.

### 1. Loading libraries and data

In [1]:
# loading libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import missingno as msno

import warnings
warnings.filterwarnings("ignore")

In [2]:
# loading data

df = pd.read_csv("../1_dataset/raw_data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

### 2. Re-evaluate Known Data Quality Issues

In [3]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors = "coerce")
df["TotalCharges"].isna().sum()

np.int64(11)

In [4]:
df[df["TotalCharges"].isna()][["tenure", "MonthlyCharges", "TotalCharges"]].head()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
488,0,52.55,
753,0,20.25,
936,0,80.85,
1082,0,25.75,
1340,0,56.05,


- `TotalCharges` contains missing values caused by blank strings.
- These cases are concentrated among customers with very low tenure.
- They likely represent new customers with limited billing history,
  not invalid records.


### 3. Decision-Driven Bivariate Check 

In [5]:
missing_tc = df[df["TotalCharges"].isna()]

missing_tc["Churn"].value_counts(normalize = True) * 100

Churn
No    100.0
Name: proportion, dtype: float64

In [6]:
df["Churn"].value_counts(normalize = True) * 100

Churn
No     73.463013
Yes    26.536987
Name: proportion, dtype: float64

- Churn distribution for customers with missing `TotalCharges`
  does not materially differ from the overall population.
- Removing these rows would bias early-tenure churn analysis.

### 4. Cleaning Rule Definition

- We will keep all rows, and fill null `TotalCharges` with `MonthlyCharges` x 'Tenure'.

In [7]:
df.loc[df["TotalCharges"].isna(), "TotalCharges"] = (
    df.loc[df["TotalCharges"].isna(), "MonthlyCharges"] * df.loc[df["TotalCharges"].isna(), "tenure"]
)

- This approach preserves customer records while creating a
business-consistent approximation of total charges.