# 🧹 Customer Churn – Data Preprocessing

This notebook cleans and preprocesses the churn dataset to prepare it for modeling. It includes:
- Dropping unnecessary columns
- Encoding categorical features
- Handling class imbalance (optional)
- Exporting the cleaned dataset


In [None]:
# 📦 Import libraries
import pandas as pd
import numpy as np

# 📂 Load dataset
df = pd.read_csv("../data/customer_churn.csv")

# 👁️ Quick preview
print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.\n")
df.head()


Dataset has 100 rows and 21 columns.



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,CUST0000,Male,0,Yes,Yes,23,No,No,DSL,Yes,...,Yes,Yes,No,No,Month-to-month,No,Electronic check,117.11,2693.53,Yes
1,CUST0001,Female,1,No,Yes,71,Yes,Yes,DSL,Yes,...,No,Yes,No,No,Month-to-month,No,Mailed check,114.43,8124.53,Yes
2,CUST0002,Male,1,Yes,Yes,35,No,No,DSL,Yes,...,No,Yes,No internet service,Yes,One year,Yes,Credit card,67.42,2359.7,No
3,CUST0003,Male,1,Yes,No,37,Yes,No,No,No internet service,...,Yes,Yes,Yes,Yes,Two year,No,Bank transfer,106.2,3929.4,Yes
4,CUST0004,Male,1,No,No,24,No,Yes,DSL,Yes,...,No,No,Yes,No internet service,One year,Yes,Credit card,104.45,2506.8,Yes


### 🧹 Drop Unnecessary Columns

In this step, we remove columns that do not contribute meaningful information for modeling. The `customerID` column is an identifier and does not carry predictive value, so we drop it.


In [None]:
# ❌ Drop non-informative columns
df.drop(columns=["customerID"], inplace=True)

# ✅ Confirm the change
print("Remaining columns:", df.columns.tolist())


Remaining columns: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


### 🔤 Encode Categorical Variables

Machine learning models require numerical input. In this step, we convert categorical features into numeric format using one-hot encoding. This allows the model to interpret qualitative data.


In [None]:
# 🔍 Identify categorical columns
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("Categorical columns to encode:", categorical_cols)

# 🧠 Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# ✅ Show shape and sample
print(f"Encoded dataset has {df_encoded.shape[1]} columns.")
df_encoded.head()


Categorical columns to encode: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
Encoded dataset has 31 columns.


Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,...,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card,PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes
0,0,23,117.11,2693.53,True,True,True,False,False,False,...,False,False,False,False,False,False,False,True,False,True
1,1,71,114.43,8124.53,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,True,True
2,1,35,67.42,2359.7,True,True,True,False,False,False,...,False,False,True,True,False,True,True,False,False,False
3,1,37,106.2,3929.4,True,True,False,True,False,False,...,True,False,True,False,True,False,False,False,False,True
4,1,24,104.45,2506.8,True,False,False,False,False,True,...,True,True,False,True,False,True,True,False,False,True


### ⚖️ Handle Class Imbalance

In some datasets, the target classes (e.g., churned vs. not churned) may be imbalanced, which can negatively impact model performance. One common approach to address this is using oversampling techniques like SMOTE (Synthetic Minority Over-sampling Technique).


In [1]:
# 📦 Install imbalanced-learn if not already available
# !pip install imbalanced-learn  # Uncomment and run this line if needed

from imblearn.over_sampling import SMOTE

# 🧩 Separate features and target
X = df_encoded.drop("Churn_Yes", axis=1)
y = df_encoded["Churn_Yes"]

# 🔁 Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# 📊 Show new class distribution
print("Class distribution after SMOTE:")
print(y_resampled.value_counts())


NameError: name 'df_encoded' is not defined

### 💾 Export Cleaned Dataset

After preprocessing, we export the balanced dataset to the `data/` folder. This file will be used in the modeling step.


In [None]:
# 💾 Export cleaned and balanced dataset
df_final = X_resampled.copy()
df_final["Churn_Yes"] = y_resampled

output_path = "../data/customer_churn_cleaned.csv"
df_final.to_csv(output_path, index=False)
print(f"Cleaned dataset exported to: {output_path}")


Cleaned dataset exported to: ../data/customer_churn_cleaned.csv
