# Customer Churn â€” Feature Engineering

## Objective
This notebook focuses on creating meaningful features from the cleaned dataset
to improve model performance and interpretability.

We will:
- Create new features from existing columns
- Group related services into logical signals
- Convert raw values into business-meaningful indicators
- Preserve interpretability

No encoding or scaling is performed here.


In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [3]:
DATA_PATH = "../data/processed/cleaned_telco_churn.csv"

df = pd.read_csv(DATA_PATH)

print("Dataset Shape:", df.shape)
df.head()


Dataset Shape: (7032, 20)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   int64  
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null   object 


### Feature Engineering Strategy

We will engineer:
1. Service usage intensity
2. Contract risk indicators
3. Customer lifecycle features
4. Billing behavior signals
5. Support & security dependency signals


In [9]:
def tenure_group(tenure):
    if tenure <= 12:
        return "0-1 Year"
    elif tenure <= 24:
        return "1-2 Years"
    elif tenure <= 48:
        return "2-4 Years"
    else:
        return "4+ Years"

df["TenureGroup"] = df["tenure"].apply(tenure_group)

df["TenureGroup"].value_counts()


TenureGroup
4+ Years     2239
0-1 Year     2175
2-4 Years    1594
1-2 Years    1024
Name: count, dtype: int64

In [11]:
# Monthly charges category
df["MonthlyChargeLevel"] = pd.qcut(
    df["MonthlyCharges"],
    q=3,
    labels=["Low", "Medium", "High"]
)

df["MonthlyChargeLevel"].value_counts()


MonthlyChargeLevel
Low       2345
Medium    2345
High      2342
Name: count, dtype: int64

In [12]:
# Total service count
service_cols = [
    "PhoneService",
    "MultipleLines",
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies"
]

df["TotalServices"] = df[service_cols].apply(
    lambda x: sum(x == "Yes"), axis=1
)

df["TotalServices"].describe()


count    7032.000000
mean        3.363339
std         2.062067
min         0.000000
25%         1.000000
50%         3.000000
75%         5.000000
max         8.000000
Name: TotalServices, dtype: float64

In [13]:
# Internet Dependency feature
df["HasInternet"] = df["InternetService"].apply(
    lambda x: "No" if x == "No" else "Yes"
)

df["HasInternet"].value_counts()


HasInternet
Yes    5512
No     1520
Name: count, dtype: int64

In [15]:
# Support and Security Risk Feature
support_cols = [
    "OnlineSecurity",
    "TechSupport"
]

df["SupportRisk"] = df[support_cols].apply(
    lambda x: "HighRisk" if all(x == "No") else "LowRisk",
    axis=1
)

df["SupportRisk"].value_counts()


SupportRisk
LowRisk     4479
HighRisk    2553
Name: count, dtype: int64

In [16]:
# Contract Risk Feature
df["ContractRisk"] = df["Contract"].map({
    "Month-to-month": "High",
    "One year": "Medium",
    "Two year": "Low"
})

df["ContractRisk"].value_counts()


ContractRisk
High      3875
Low       1685
Medium    1472
Name: count, dtype: int64

In [18]:
# Average Monthly Spend Feature
df["AvgMonthlySpend"] = df["TotalCharges"] / df["tenure"]
df["AvgMonthlySpend"].replace([np.inf, -np.inf], 0, inplace=True)

df["AvgMonthlySpend"].describe()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["AvgMonthlySpend"].replace([np.inf, -np.inf], 0, inplace=True)


count    7032.000000
mean       64.799424
std        30.185891
min        13.775000
25%        36.179891
50%        70.373239
75%        90.179560
max       121.400000
Name: AvgMonthlySpend, dtype: float64

In [20]:
FEATURED_PATH = "../data/processed/featured_telco_churn.csv"

df.to_csv(FEATURED_PATH, index=False)

print(f"Feature-engineered dataset saved to: {FEATURED_PATH}")


Feature-engineered dataset saved to: ../data/processed/featured_telco_churn.csv
