# üõ†Ô∏è Customer Churn Analysis: Feature Engineering

**Goal**: Create new features and optimize existing variables to improve churn prediction accuracy for telecom customers.

**Dataset**: Cleaned and processed customer churn data with 7,043 records and 20+ original features, plus engineered features.

**Process**:
- Data loading and inspection
- Missing value handling
- Categorical and numerical encoding
- New feature creation (tenure groups, charge ratios, etc.)
- Feature scaling and transformation
- Saving engineered dataset

**Outcome**: Enhanced, machine-learning-ready dataset with improved predictive signals for customer churn modeling.


### Import required libraries

In [1]:
import os
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import custom functions
from src.feature_engineering import (
    create_engineered_features,
    encode_categorical_variables,
    handle_missing_values,
    print_feature_engineering_summary
)

### Load processed/cleaned data

In [2]:
data_path = Path('../data/processed/customer_churn_cleaned.csv')
df = pd.read_csv(data_path)

print(f"Data loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"Columns: {list(df.columns)}")
df.head()


Data loaded: 7,043 rows √ó 21 columns
Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,3167-SNQPL,Male,1,Yes,Yes,38.0,Yes,No,Fiber optic,No,...,No,No,Yes,Yes,Month-to-month,No,Electronic check,101.15,1398.6,0
1,6905-NIQIN,Male,0,No,No,1.0,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,No,Mailed check,70.4,50.65,1
2,3898-GUYTS,Male,1,No,No,45.0,Yes,Yes,Fiber optic,Yes,...,No,No,Yes,No,Month-to-month,Yes,Electronic check,97.05,4385.05,0
3,8499-BRXTD,Male,0,No,No,18.0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,No,Mailed check,20.1,401.85,0
4,4629-NRXKX,Female,0,Yes,Yes,2.0,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,No,Electronic check,70.4,1398.6,1


### Display data information

In [3]:
print("DATA OVERVIEW")
print("="*70)
print("\nData Info:")
df.info()

print("\n\nMissing Values:")
print(df.isnull().sum()[df.isnull().sum() > 0])

print("\n\nData Types:")
print(df.dtypes.value_counts())


DATA OVERVIEW

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   float64
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling

### Feature Types

In [4]:
# Identify categorical and numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove target variable from features
if 'Churn' in numerical_cols:
    numerical_cols.remove('Churn')

print("FEATURE TYPES")
print("="*70)
print(f"\nNumerical Features ({len(numerical_cols)}):")
print(numerical_cols)
print(f"\nCategorical Features ({len(categorical_cols)}):")
print(categorical_cols)

FEATURE TYPES

Numerical Features (4):
['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

Categorical Features (16):
['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']


### Create new features

In [5]:
df_engineered = create_engineered_features(df)

# Display new features
print("\n\nNew Features Created:")
new_features = [col for col in df_engineered.columns if col not in df.columns]
print(new_features)
print(f"\nTotal new features: {len(new_features)}")

# Show sample of engineered data
df_engineered.head()

CREATING NEW FEATURES

[Data Type Validation]
‚úì TotalCharges already numeric (float64)
‚úì MonthlyCharges already numeric (float64)
‚úì tenure already numeric (float64)
‚úì Created: tenure_group
‚úì Created: is_new_customer
‚úì Created: is_long_term
‚úì Created: total_revenue
‚úì Created: charge_category
‚úì Created: charge_ratio
‚úì Created: num_services (counted from 8 service columns)
‚úì Created: is_monthly_contract
‚úì Created: paperless_billing_binary
‚úì Created: family_size

New shape after feature engineering: (7043, 31)
Added 10 new features


New Features Created:
['tenure_group', 'is_new_customer', 'is_long_term', 'total_revenue', 'charge_category', 'charge_ratio', 'num_services', 'is_monthly_contract', 'paperless_billing_binary', 'family_size']

Total new features: 10


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,tenure_group,is_new_customer,is_long_term,total_revenue,charge_category,charge_ratio,num_services,is_monthly_contract,paperless_billing_binary,family_size
0,3167-SNQPL,Male,1,Yes,Yes,38.0,Yes,No,Fiber optic,No,...,2-4 years,0,0,3843.7,High,0.363774,4,1,0,2
1,6905-NIQIN,Male,0,No,No,1.0,Yes,No,Fiber optic,No,...,0-1 year,1,0,70.4,High,0.709384,2,1,0,0
2,3898-GUYTS,Male,1,No,No,45.0,Yes,Yes,Fiber optic,Yes,...,2-4 years,0,0,4367.25,High,1.003846,4,1,1,0
3,8499-BRXTD,Male,0,No,No,18.0,Yes,No,No,No internet service,...,1-2 years,0,0,361.8,Low,1.107635,1,0,0,0
4,4629-NRXKX,Female,0,Yes,Yes,2.0,Yes,No,Fiber optic,No,...,0-1 year,1,0,140.8,High,9.863188,1,1,0,2


###  Encode categorical variables

In [6]:
df_encoded = encode_categorical_variables(df_engineered)

# Show results
print("\n\nDataFrame after encoding:")
df_encoded.head()


ENCODING CATEGORICAL VARIABLES

Categorical columns to encode: 16
‚úì One-Hot Encoded: customerID (7043 categories)
‚úì Label Encoded (binary): gender
‚úì Label Encoded (binary): Partner
‚úì Label Encoded (binary): Dependents
‚úì Label Encoded (binary): PhoneService
‚úì One-Hot Encoded: MultipleLines (3 categories)
‚úì One-Hot Encoded: InternetService (3 categories)
‚úì One-Hot Encoded: OnlineSecurity (3 categories)
‚úì One-Hot Encoded: OnlineBackup (3 categories)
‚úì One-Hot Encoded: DeviceProtection (3 categories)
‚úì One-Hot Encoded: TechSupport (3 categories)
‚úì One-Hot Encoded: StreamingTV (3 categories)
‚úì One-Hot Encoded: StreamingMovies (3 categories)
‚úì One-Hot Encoded: Contract (3 categories)
‚úì Label Encoded (binary): PaperlessBilling
‚úì One-Hot Encoded: PaymentMethod (4 categories)

Shape after encoding: (7043, 7083)


DataFrame after encoding:


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,1,1,1,38.0,1,0,101.15,1398.6,0,...,False,False,True,False,True,False,False,False,True,False
1,1,0,0,0,1.0,1,0,70.4,50.65,1,...,False,False,False,False,False,False,False,False,False,True
2,1,1,0,0,45.0,1,1,97.05,4385.05,0,...,False,False,True,False,False,False,False,False,True,False
3,1,0,0,0,18.0,1,0,20.1,401.85,0,...,False,True,False,True,False,True,False,False,False,True
4,0,0,1,1,2.0,1,0,70.4,1398.6,1,...,False,False,False,False,False,False,False,False,True,False


###  Handle missing values

In [7]:
df_clean = handle_missing_values(df_encoded)

# Verify no missing values remain
print("\n\nVerification - Missing values after cleaning:")
print(df_clean.isnull().sum().sum())

print("\n" + "=" * 70)
print("FEATURE SELECTION")
print("=" * 70)

if 'customerID' in df_clean.columns:
    df_clean.drop('customerID', axis=1, inplace=True)
    print("‚úì Removed 'customerID' column (identifier, not a feature)")
    print(f"  Remaining columns: {len(df.columns)}")



HANDLING MISSING VALUES

Missing values found:
tenure_group    11
dtype: int64
‚úì Filled tenure_group with mode ('4+ years')

Final missing values: 0


Verification - Missing values after cleaning:
0

FEATURE SELECTION


### Preparing Final Dataset

In [8]:
# Separate features and target
X = df_clean.drop('Churn', axis=1)
y = df_clean['Churn']

print("="*70)
print("PREPARING FINAL DATASET")
print("="*70)
print(f"\nFeatures (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nChurn rate: {y.mean():.2%}")


PREPARING FINAL DATASET

Features (X) shape: (7043, 7082)
Target (y) shape: (7043,)

Target distribution:
Churn
0    5174
1    1869
Name: count, dtype: int64

Churn rate: 26.54%


### Print comprehensive summary

In [9]:
print_feature_engineering_summary(
    df_original=df,
    df_engineered=df_engineered,
    df_encoded=df_encoded,
    X=X,
    y=y
)



FEATURE ENGINEERING SUMMARY

[Dataset Transformation]:
   Original features: 21
   After engineering: 31
   After encoding: 7083
   Final features for modeling: 7082

[Target Variable]:
   Churn rate (overall): 26.54%
   Churned customers: 1,869
   Retained customers: 5,174

[‚úì] Feature Engineering Complete!
[‚úì] Data is ready for machine learning modeling


In [10]:
print("=" * 70)
print("SAVING PROCESSED DATA")
print("=" * 70)

processed_dir = Path("../data/processed")
processed_dir.mkdir(parents=True, exist_ok=True)

final_df = X.copy()
final_df["Churn"] = y.values

final_path = processed_dir / "customer_churn_engineered.csv"
final_df.to_csv(final_path, index=False)

print(f"‚úì Final engineered dataset saved to: {final_path}")
print(f"‚úì Shape: {final_df.shape}")

X.to_csv(processed_dir / "X_engineered.csv", index=False)
y.to_csv(processed_dir / "y.csv", index=False)

print("‚úì X_engineered.csv and y.csv saved")

SAVING PROCESSED DATA
‚úì Final engineered dataset saved to: ..\data\processed\customer_churn_engineered.csv
‚úì Shape: (7043, 7083)
‚úì X_engineered.csv and y.csv saved
