# Step 6: Feature Engineering

## Preprocessing Pipeline Overview

This preprocessing pipeline outlines the steps necessary to prepare the Telco Customer Churn dataset for our modeling. Each step is designed to address specific aspects of data quality, transformation, and feature creation. We will cover each step in a separate jupyter notebook file.

**Step 1: Data Loading**: Loading the datasets into the workspace, ensuring all necessary files are correctly imported for analysis. This includes the Kaggle dataset and the IBM datasets.

**Step 2: Dataset Integration**: Combining relevant datasets into a single, unified dataset that will serve as the foundation for subsequent analysis.

**Step 3: Handling Missing Values**: Identifying and addressing missing values in the dataset to ensure data integrity. This step ensures no significant gaps hinder the analysis.

**Step 4: Data Type Conversion**: Converting data columns to appropriate data types to optimize memory usage and prepare for feature engineering. Ensure consistency across all columns.

**Step 5: Data Exploration**: Perform initial exploratory data analysis (EDA) to understand the dataset's structure and characteristics, visualizing key features to gain insights into the data.

**Step 6: Feature Engineering**: Creating new features from the existing data to enhance model performance and capture additional insights. This includes transformations and derived features.

**Step 7: Outlier Detection**: Identifying and addressing outliers in the dataset to ensure they do not negatively impact the analysis or models.

**Step 8: Dataset Splitting**: Splitting the dataset into training and testing subsets to prepare for model development and evaluation. This step ensures reproducibility and robust performance metrics.

In [129]:
# Load the dataset
import pandas as pd

# Assuming the dataset is in the same directory
df = pd.read_csv('../2_data/telcocustomerchurn_encoded.csv')

# Set display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Print the columns of the dataframe
print(df.columns.tolist())

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'Churn Value', 'Churn Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Cont

## 6.1 Interaction Features

We will work with the `telcocustomerchurn_encoded`, combine existing columns to create interaction features that could capture relationships between variables:
- Senior Citizen * Number of Dependents: Indicates how many dependents senior citizens typically have.
- Tenure in Months / Age: Represents tenure as a fraction of the customer’s age, useful to identify customers with a longer proportional relationship with the company.
- Avg Monthly GB Download * Streaming Services Count: Combine average download with the number of streaming services (Streaming TV, Movies, Music) subscribed.
- Monthly Charge / Total Charges: Indicates if customers are recent subscribers with high charges relative to their lifetime charges.
- Avg Monthly GB Download * Tenure in Months: Indicates cumulative engagement with the internet services.

In [130]:
# Assuming the dataframe is named df and it contains the necessary columns

# Senior Citizen * Number of Dependents
df['SeniorCitizen_Dependents'] = df['Senior Citizen'] * df['Number of Dependents']

# Tenure in Months / Age
df['Tenure_Age_Ratio'] = df['Tenure in Months'] / df['Age']

# Avg Monthly GB Download * Streaming Services Count
df['AvgMonthlyGB_StreamServices'] = df['Avg Monthly GB Download'] * (df['Streaming TV'] + df['Streaming Movies'] + df['Streaming Music'])

# Monthly Charge / Total Charges
df['MonthlyCharge_TotalCharges_Ratio'] = df['Monthly Charge'] / df['Total Charges']

# Avg Monthly GB Download * Tenure in Months
df['AvgMonthlyGB_Tenure'] = df['Avg Monthly GB Download'] * df['Tenure in Months']

# Display the new features
print(df[['SeniorCitizen_Dependents', 'Tenure_Age_Ratio', 'AvgMonthlyGB_StreamServices', 'MonthlyCharge_TotalCharges_Ratio', 'AvgMonthlyGB_Tenure']])

      SeniorCitizen_Dependents  Tenure_Age_Ratio  AvgMonthlyGB_StreamServices  \
0                            0          0.243243                           16   
1                            0          0.195652                           20   
2                            0          0.080000                            0   
3                            0          0.166667                            8   
4                            0          0.040000                           11   
5                            0          0.391304                          219   
6                            0          1.059701                           42   
7                            0          1.211538                            0   
8                            0          0.102941                            0   
9                            0          1.511628                           42   
10                           0          1.148936                           10   
11                          

## 6.2 Aggregated Binary Counts

Creates features that sum up “Yes” responses across multiple services to give a composite score for engagement:
- Total Services Used: Sum of services such as Phone Service, Online Security, Device Protection Plan, etc.
- Streaming Services Count: Total “Yes” responses for Streaming TV, Streaming Movies, and Streaming Music.
- Internet and Phone Bundles: Binary column for customers who have both Phone Service and Internet Service.

In [131]:
# Total Services Used
df['Total_Services_Used'] = (
    df['Phone Service'] + 
    df['Online Security'] + 
    df['Device Protection Plan'] + 
    df['Premium Tech Support'] + 
    df['Streaming TV'] + 
    df['Streaming Movies'] + 
    df['Streaming Music'] + 
    df['Unlimited Data'] + 
    df['Paperless Billing']
)

# Streaming Services Count
df['Streaming_Services_Count'] = (
    df['Streaming TV'] + 
    df['Streaming Movies'] + 
    df['Streaming Music']
)

# Internet and Phone Bundles
df['Internet_Phone_Bundle'] = ((df['Phone Service'] == 1) & (df['Internet Service'] == 1)).astype(int)

# Display the new features
print(df[['Total_Services_Used', 'Streaming_Services_Count', 'Internet_Phone_Bundle']])


      Total_Services_Used  Streaming_Services_Count  Internet_Phone_Bundle
0                       5                         1                      1
1                       3                         2                      1
2                       4                         0                      1
3                       6                         2                      1
4                       5                         1                      1
5                       7                         3                      1
6                       9                         3                      1
7                       4                         0                      1
8                       4                         0                      1
9                       9                         3                      1
10                      4                         1                      0
11                      9                         3                      1
12                      3

## 6.3: Contract and Payment Analysis

- Tenure in Years: Convert Tenure in Months into years for easier interpretability.
- Payment Flexibility: Combine Payment Method with Paperless Billing into categories (e.g., “Digital Payments”, “Traditional Payments”). Could reveal trends in payment preferences.

In [132]:
# Tenure in Years
df['Tenure_in_Years'] = df['Tenure in Months'] / 12

# Payment Flexibility
df['Digital_Payments'] = df.apply(
    lambda row: 1 if row['Payment Method_Bank Withdrawal'] == 1 or row['Payment Method_Credit Card'] == 1 else 0, axis=1
)
df['Traditional_Payments'] = df.apply(
    lambda row: 1 if row['Payment Method_Mailed Check'] == 1 else 0, axis=1
)

print(df[['Tenure_in_Years', 'Digital_Payments', 'Traditional_Payments']])

      Tenure_in_Years  Digital_Payments  Traditional_Payments
0            0.750000                 1                     0
1            0.750000                 1                     0
2            0.333333                 1                     0
3            1.083333                 1                     0
4            0.250000                 1                     0
5            0.750000                 1                     0
6            5.916667                 1                     0
7            5.250000                 1                     0
8            0.583333                 1                     0
9            5.416667                 1                     0
10           4.500000                 1                     0
11           6.000000                 1                     0
12           0.416667                 1                     0
13           6.000000                 1                     0
14           4.666667                 1                     0
15      

## 6.4: Scaling

- Monthly Charges Scaled: Scale the Monthly Charges to a yearly basis for better comparison with Total Charges.

In [133]:
# Monthly Charges Scaled
df['Monthly_Charges_Scaled'] = df['Monthly Charge'] * 12

print(df[['Monthly_Charges_Scaled']])

      Monthly_Charges_Scaled
0                      787.2
1                      718.8
2                      886.8
3                     1176.0
4                     1006.8
5                      832.8
6                     1316.4
7                     1015.8
8                      578.4
9                     1085.4
10                     542.4
11                    1401.6
12                     827.4
13                    1215.6
14                     540.6
15                    1149.0
16                     735.0
17                     865.2
18                     752.4
19                     301.2
20                     302.4
21                    1129.2
22                    1005.0
23                     238.2
24                     244.2
25                     366.0
26                    1244.4
27                     244.8
28                     235.2
29                     236.4
30                    1094.4
31                     245.4
32                    1389.6
33            

## 6.5: others

- Revenue Clusters: Use Monthly Charge and Total Charges to cluster customers into revenue tiers (e.g., Low, Medium, High). To create revenue clusters, we will use the 25th percentile as the threshold for the "Low" tier and the 75th percentile as the threshold for the "High" tier. The remaining values will be categorized as "Medium". This approach is similar to using the mean and confidence intervals.
- Distance from Service Center: Calculate distances using the zip code or Lat Long if service center coordinates are known. This feature can help correlate service quality with geography.
- Refund-to-Charges Ratio: Total Refunds / Total Charges.Highlights customers who frequently request refunds.
- Extra Data Usage Cost Proportion: Total Extra Data Charges / Total Charges. Indicates how much of the customer’s charges are driven by exceeding data limits.
- Lifetime Value per Month: CLTV / Tenure in Months. Gives a per-month value of the customer’s lifetime value.

In [134]:
# Revenue Clusters
df['Revenue_Cluster'] = pd.qcut(df['Total Charges'], q=[0, .25, .75, 1], labels=['Low', 'Medium', 'High'])
# One-hot encode the Revenue_Cluster column
df = pd.get_dummies(df, columns=['Revenue_Cluster'], prefix='Revenue_Cluster')

# Convert True/False labels to 0/1 for the one-hot encoded Revenue_Cluster columns
df['Revenue_Cluster_Low'] = df['Revenue_Cluster_Low'].astype(int)
df['Revenue_Cluster_Medium'] = df['Revenue_Cluster_Medium'].astype(int)
df['Revenue_Cluster_High'] = df['Revenue_Cluster_High'].astype(int)

# Refund-to-Charges Ratio
df['Refund_to_Charges_Ratio'] = df['Total Refunds'] / df['Total Charges']

# Extra Data Usage Cost Proportion
df['Extra_Data_Usage_Cost_Proportion'] = df['Total Extra Data Charges'] / df['Total Charges']

# Lifetime Value per Month
df['Lifetime_Value_per_Month'] = df['CLTV'] / df['Tenure in Months']

# Display the new features
print(df[['Revenue_Cluster_Low', 'Revenue_Cluster_Medium', 'Revenue_Cluster_High', 'Refund_to_Charges_Ratio', 'Extra_Data_Usage_Cost_Proportion', 'Lifetime_Value_per_Month']])

      Revenue_Cluster_Low  Revenue_Cluster_Medium  Revenue_Cluster_High  \
0                       0                       1                     0   
1                       0                       1                     0   
2                       1                       0                     0   
3                       0                       1                     0   
4                       1                       0                     0   
5                       0                       1                     0   
6                       0                       0                     1   
7                       0                       0                     1   
8                       1                       0                     0   
9                       0                       0                     1   
10                      0                       1                     0   
11                      0                       0                     1   
12                      1

In [135]:
print(df.columns.tolist())
print(df.dtypes)

['Unnamed: 0', 'Count', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married', 'Dependents', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Paperless Billing', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Satisfaction Score', 'Churn Value', 'Churn Score', 'CLTV', 'LoyaltyID', 'Partner', 'Tenure', 'Monthly Charges', 'Churn', 'Country_United States', 'State_California', 'Quarter_Q3', 'Offer_Offer A', 'Offer_Offer B', 'Offer_Offer C', 'Offer_Offer D', 'Offer_Offer E', 'Internet Type_Cable', 'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Cont

In [136]:
df.to_csv('../2_data/telcocustomerchurn_featured.csv', index=False)