Load cleaned data

In [22]:
import pandas as pd

df = pd.read_csv("../data/raw/Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [23]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Handle missing values (business logic)

In [24]:
df['TotalCharges'] = df['TotalCharges'].fillna(
    df['TotalCharges'].median()
)

To Create business-driven features

Tenure bucket

In [15]:
df['tenure_bucket'] = pd.cut(
    df['tenure'],
    bins=[0, 12, 24, 48, 72],
    labels=['0-1yr', '1-2yr', '2-4yr', '4-6yr']
)

High value customer

In [16]:
df['high_value_customer'] = (df['MonthlyCharges'] > 70).astype(int)

Encode target variable

In [17]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

Encode categorical features

Separate features

In [18]:
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn']

One-hot encoding

In [19]:
X = pd.get_dummies(X, drop_first=True)

Scale numeric features

In [20]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

X[num_cols] = scaler.fit_transform(X[num_cols])

To Save processed data

In [21]:
X.to_csv("../data/processed/X_features.csv", index=False)
y.to_csv("../data/processed/y_target.csv", index=False)

Feature Engineering Summary

  Missing values in TotalCharges were handled using median imputation to maintain billing distribution.

  Tenure-based buckets were created to capture customer lifecycle stages and non-linear churn behavior.

  A high-value customer flag was added to identify revenue-critical customers for retention.

  The target variable Churn was encoded as binary, treating churn as the positive class.

  Categorical features were one-hot encoded and numeric features were scaled for model compatibility.

  Processed features and targets were saved to enable reproducible and modular model training.