# Feature Engineering for Customer Churn Prediction

## Objective
The goal of this notebook is to:
- Transform raw customer data into meaningful features
- Encode business signals identified during EDA
- Prepare a clean, model-ready dataset for churn prediction

In [1]:
# Import Packages
import pandas as pd

In [2]:
# Load Data
df = pd.read_csv("../data/processed/cleaned_churn_data_jupyter.csv")

In [3]:
# Check Columns
print(df.columns)

Index(['customer_id', 'gender', 'senior_citizen', 'tenure_months',
       'contract_type', 'monthly_charges', 'total_charges', 'payment_method',
       'avg_monthly_usage', 'usage_trend', 'support_tickets_last_3m', 'churn'],
      dtype='object')


In [4]:
# Display Data Shape
df.shape

(10000, 12)

In [5]:
# Display Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   customer_id              10000 non-null  object 
 1   gender                   10000 non-null  int64  
 2   senior_citizen           10000 non-null  int64  
 3   tenure_months            10000 non-null  int64  
 4   contract_type            10000 non-null  int64  
 5   monthly_charges          10000 non-null  float64
 6   total_charges            10000 non-null  float64
 7   payment_method           10000 non-null  int64  
 8   avg_monthly_usage        10000 non-null  float64
 9   usage_trend              10000 non-null  int64  
 10  support_tickets_last_3m  10000 non-null  int64  
 11  churn                    10000 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 937.6+ KB


In [6]:
# Display Summary
df.describe()

Unnamed: 0,gender,senior_citizen,tenure_months,contract_type,monthly_charges,total_charges,payment_method,avg_monthly_usage,usage_trend,support_tickets_last_3m,churn
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.5013,0.1511,23.4746,0.6523,70.380092,1649.044038,1.4882,300.02333,1.1395,1.1919,0.3172
std,0.500023,0.358164,16.187037,0.79465,24.575732,1341.289226,1.11552,100.354051,0.852826,1.081384,0.465409
min,0.0,0.0,1.0,0.0,20.0,20.0,0.0,50.0,0.0,0.0,0.0
25%,0.0,0.0,11.0,0.0,52.8975,684.9825,0.0,231.8,0.0,0.0,0.0
50%,1.0,0.0,20.0,0.0,70.015,1290.14,1.0,301.3,1.0,1.0,0.0
75%,1.0,0.0,32.0,1.0,87.2525,2225.5225,2.0,367.9,2.0,2.0,1.0
max,1.0,1.0,72.0,2.0,150.0,10050.0,3.0,600.0,2.0,7.0,1.0


In [7]:
# Display First Five Rows
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,tenure_months,contract_type,monthly_charges,total_charges,payment_method,avg_monthly_usage,usage_trend,support_tickets_last_3m,churn
0,CUST_1,1,0,58,0,40.63,2356.54,0,335.3,2,1,0
1,CUST_2,0,0,19,0,88.05,1672.95,0,271.9,2,2,0
2,CUST_3,1,1,12,0,44.73,536.76,1,227.8,0,2,0
3,CUST_4,1,0,11,0,84.89,933.79,2,124.0,2,3,1
4,CUST_5,1,1,4,0,82.63,330.52,2,425.4,0,1,0


# Data Cleaning

In [8]:
# Check Null Values
print(df.isnull().sum())

customer_id                0
gender                     0
senior_citizen             0
tenure_months              0
contract_type              0
monthly_charges            0
total_charges              0
payment_method             0
avg_monthly_usage          0
usage_trend                0
support_tickets_last_3m    0
churn                      0
dtype: int64


In [9]:
# Check NaN Values
print(df.isna().sum())

customer_id                0
gender                     0
senior_citizen             0
tenure_months              0
contract_type              0
monthly_charges            0
total_charges              0
payment_method             0
avg_monthly_usage          0
usage_trend                0
support_tickets_last_3m    0
churn                      0
dtype: int64


In [10]:
# Check Duplicates
duplicated = df.duplicated()

print(duplicated)

0       False
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Length: 10000, dtype: bool


# Feature Engineering

In [11]:
# Business-Driven Features

# Tenure Bucket
df["tenure_bucket"] = pd.cut(
    df["tenure_months"],
    bins=[0, 12, 36, 72],
    labels=["New", "Mid", "Long"]
)

**Tenure Bucket**

Customers are segmented into New, Mid, and Long tenure groups.
New customers are typically more likely to churn.

In [12]:
# High-Value Customer Flag
df["high_value_customer"] = (df["monthly_charges"] > 90).astype(int)

**High-Value Customer Flag**

Identifies customers paying premium prices, which helps quantify
revenue risk associated with churn.

In [13]:
# Support Intensity Level
df["support_intensity"] = pd.cut(
    df["support_tickets_last_3m"],
    bins=[-1, 1, 3, 10],
    labels=["Low", "Medium", "High"]
)

**Support Intensity**

Higher support interaction often signals dissatisfaction and churn risk.

In [14]:
# Encode Categorical Features

# Machine learning models require numeric inputs.

df_encoded = pd.get_dummies(
    df,
    columns=["tenure_bucket", "support_intensity"],
    drop_first=True
)

In [15]:
# Check And Review
df_encoded.head()

Unnamed: 0,customer_id,gender,senior_citizen,tenure_months,contract_type,monthly_charges,total_charges,payment_method,avg_monthly_usage,usage_trend,support_tickets_last_3m,churn,high_value_customer,tenure_bucket_Mid,tenure_bucket_Long,support_intensity_Medium,support_intensity_High
0,CUST_1,1,0,58,0,40.63,2356.54,0,335.3,2,1,0,0,False,True,False,False
1,CUST_2,0,0,19,0,88.05,1672.95,0,271.9,2,2,0,0,True,False,True,False
2,CUST_3,1,1,12,0,44.73,536.76,1,227.8,0,2,0,0,False,False,True,False
3,CUST_4,1,0,11,0,84.89,933.79,2,124.0,2,3,1,0,False,False,True,False
4,CUST_5,1,1,4,0,82.63,330.52,2,425.4,0,1,0,0,False,False,False,False


In [16]:
# Final Feature Review
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               10000 non-null  object 
 1   gender                    10000 non-null  int64  
 2   senior_citizen            10000 non-null  int64  
 3   tenure_months             10000 non-null  int64  
 4   contract_type             10000 non-null  int64  
 5   monthly_charges           10000 non-null  float64
 6   total_charges             10000 non-null  float64
 7   payment_method            10000 non-null  int64  
 8   avg_monthly_usage         10000 non-null  float64
 9   usage_trend               10000 non-null  int64  
 10  support_tickets_last_3m   10000 non-null  int64  
 11  churn                     10000 non-null  int64  
 12  high_value_customer       10000 non-null  int64  
 13  tenure_bucket_Mid         10000 non-null  bool   
 14  tenure_

The dataset now contains only numerical and one-hot encoded features,
making it suitable for machine learning models.

In [17]:
# Save Feature-Engineered Dataset
df_encoded.to_csv(
    "../data/processed/feature_engineered_churn_jupyter.csv",
    index=False
)

## Feature Engineering Summary

- Converted tenure into categorical buckets to capture lifecycle effects
- Created a high-value customer indicator to quantify revenue risk
- Grouped support interactions to represent customer friction
- Applied one-hot encoding to prepare features for modeling

These features reflect real-world churn drivers identified during EDA.