The goal of this notebook is to transform raw customer attributes into meaningful, model-ready features informed by EDA insights, while removing redundant or non-informative variables.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [2]:
# Load cleaned data
df = pd.read_csv("cleaned_telco_churn.csv")


In [3]:
df.head(10)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,Female,0,No,No,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,Male,0,No,Yes,62,Yes,No,DSL,Yes,Yes,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


In [4]:
df.shape

(7032, 20)

<h2>FEATURE ENGINEERING</h2>


In [5]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

Models require numeric targets. Binary encoding is standard.

In [6]:
df["tenure_group"] = pd.cut(
    df["tenure"],
    bins=[0, 12, 24, 48, 72],
    labels=["0-1yr", "1-2yr", "2-4yr", "4-6yr"]
)


EDA showed early tenure = high churn

Buckets capture lifecycle stages

Helps linear models & interpretability

In [7]:
df["high_monthly_charge"] = (
    df["MonthlyCharges"] > df["MonthlyCharges"].median()
).astype(int)

Captures price sensitivity

Robust to outliers

Easy to interpret

In [8]:
contract_map = {
    "Month-to-month": 0,
    "One year": 1,
    "Two year": 2
}

df["contract_length"] = df["Contract"].map(contract_map)


Ordinal relationship exists

Month-to-month has the highest churn

Avoids unnecessary one-hot encoding

In [9]:
df["auto_payment"] = df["PaymentMethod"].isin([
    "Bank transfer (automatic)",
    "Credit card (automatic)"
]).astype(int)

Auto-pay customers churn less

Strong behavioral signal

Reduces categorical complexity

In [10]:
support_cols = ["OnlineSecurity", "TechSupport"]

df["has_support_services"] = (
    (df[support_cols] == "Yes").any(axis=1)
).astype(int)

EDA showed support services reduce churn

Aggregating reduces noise

In [11]:
df["fiber_optic"] = (df["InternetService"] == "Fiber optic").astype(int)

Fiber optic users churn more (EDA insight)

Binary feature is cleaner than one-hot

In [12]:
drop_cols = [
    "Contract",
    "PaymentMethod",
    "OnlineSecurity",
    "TechSupport",
    "InternetService"
]

df = df.drop(columns=drop_cols)

<h2> Encoding Remaining Categorical Features </h2>

In [13]:
cat_cols = df.select_dtypes(include=["object","category"]).columns

df_encoded = pd.get_dummies(
    df,
    columns=cat_cols,
    drop_first=True
)


In [14]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 28 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   SeniorCitizen                         7032 non-null   int64  
 1   tenure                                7032 non-null   int64  
 2   MonthlyCharges                        7032 non-null   float64
 3   TotalCharges                          7032 non-null   float64
 4   Churn                                 7032 non-null   int64  
 5   high_monthly_charge                   7032 non-null   int64  
 6   contract_length                       7032 non-null   int64  
 7   auto_payment                          7032 non-null   int64  
 8   has_support_services                  7032 non-null   int64  
 9   fiber_optic                           7032 non-null   int64  
 10  gender_Male                           7032 non-null   bool   
 11  Partner_Yes      

In [15]:
df_encoded.shape


(7032, 28)

In [16]:
df_encoded.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'high_monthly_charge', 'contract_length', 'auto_payment',
       'has_support_services', 'fiber_optic', 'gender_Male', 'Partner_Yes',
       'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service',
       'MultipleLines_Yes', 'OnlineBackup_No internet service',
       'OnlineBackup_Yes', 'DeviceProtection_No internet service',
       'DeviceProtection_Yes', 'StreamingTV_No internet service',
       'StreamingTV_Yes', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'PaperlessBilling_Yes', 'tenure_group_1-2yr',
       'tenure_group_2-4yr', 'tenure_group_4-6yr'],
      dtype='object')

In [17]:
drop_cols = [
    "OnlineBackup_No internet service",
    "OnlineBackup_Yes",
    "DeviceProtection_No internet service",
    "DeviceProtection_Yes",
    "StreamingTV_No internet service",
    "StreamingTV_Yes",
    "StreamingMovies_No internet service",
    "StreamingMovies_Yes",
   
]

df_encoded = df_encoded.drop(columns=drop_cols)


Their signal is already captured in has_support_services

Reduces dimensionality

Improves interpretability

Reduces overfitting risk

In [18]:
df_encoded.columns

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'high_monthly_charge', 'contract_length', 'auto_payment',
       'has_support_services', 'fiber_optic', 'gender_Male', 'Partner_Yes',
       'Dependents_Yes', 'PhoneService_Yes', 'MultipleLines_No phone service',
       'MultipleLines_Yes', 'PaperlessBilling_Yes', 'tenure_group_1-2yr',
       'tenure_group_2-4yr', 'tenure_group_4-6yr'],
      dtype='object')

In [19]:
df_encoded.head(10)

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,high_monthly_charge,contract_length,auto_payment,has_support_services,fiber_optic,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,PaperlessBilling_Yes,tenure_group_1-2yr,tenure_group_2-4yr,tenure_group_4-6yr
0,0,1,29.85,29.85,0,0,0,0,0,0,False,True,False,False,True,False,True,False,False,False
1,0,34,56.95,1889.5,0,0,1,0,1,0,True,False,False,True,False,False,False,False,True,False
2,0,2,53.85,108.15,1,0,0,0,1,0,True,False,False,True,False,False,True,False,False,False
3,0,45,42.3,1840.75,0,0,1,1,1,0,True,False,False,False,True,False,False,False,True,False
4,0,2,70.7,151.65,1,1,0,0,0,1,False,False,False,True,False,False,True,False,False,False
5,0,8,99.65,820.5,1,1,0,0,0,1,False,False,False,True,False,True,True,False,False,False
6,0,22,89.1,1949.4,0,1,0,1,0,1,True,False,True,True,False,True,True,True,False,False
7,0,10,29.75,301.9,0,0,0,0,1,0,False,False,False,False,True,False,False,False,False,False
8,0,28,104.8,3046.05,1,1,0,0,1,1,False,True,False,True,False,True,True,False,True,False
9,0,62,56.15,3487.95,0,0,1,1,1,0,True,False,True,True,False,False,False,False,False,True


In [20]:
df_encoded.to_csv("telco_model_ready.csv", index=False)

After feature engineering, I reduced redundancy by removing service-level variables whose information was already captured in aggregated behavioral features. This improved interpretability and reduced noise without sacrificing predictive signal.

In [21]:
df_encoded.dtypes

SeniorCitizen                       int64
tenure                              int64
MonthlyCharges                    float64
TotalCharges                      float64
Churn                               int64
high_monthly_charge                 int64
contract_length                     int64
auto_payment                        int64
has_support_services                int64
fiber_optic                         int64
gender_Male                          bool
Partner_Yes                          bool
Dependents_Yes                       bool
PhoneService_Yes                     bool
MultipleLines_No phone service       bool
MultipleLines_Yes                    bool
PaperlessBilling_Yes                 bool
tenure_group_1-2yr                   bool
tenure_group_2-4yr                   bool
tenure_group_4-6yr                   bool
dtype: object