# Telco Customer Churn - Feature Engineering

## Purpose and Objectives

This notebook focuses on feature engineering and preprocessing for the Telco Customer Churn prediction model. Based on insights from our exploratory data analysis, we will:

### Why Feature Engineering is Critical for Churn Prediction:

**🎯 Data Quality Issues**: Our EDA revealed that `TotalCharges` contains string values that need proper conversion to numeric format for machine learning algorithms.

**🔄 Mixed Data Types**: The dataset contains both numerical features (tenure, charges) and categorical features (contract type, services) that require different preprocessing approaches.

### Key Steps in This Notebook:

1. **Data Loading & Cleaning**: Fix data quality issues identified in EDA
2. **Feature Type Separation**: Distinguish numerical vs categorical features
3. **Preprocessing Pipeline Development**: Create robust sklearn pipelines
4. **Data Transformation**: Apply preprocessing to training data
5. **Artifact Saving**: Store configurations for production use

## Import Required Libraries

In [55]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import json
from pathlib import Path

# Sklearn preprocessing and model selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Warnings
import warnings
warnings.filterwarnings('ignore')

In [56]:
# Configure display and plotting
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('default')
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

## Data Loading and Initial Setup

In [57]:
# Load the raw dataset
data_path = Path('../data/raw/Telco-Customer-Churn.csv')
df = pd.read_csv(data_path)

print(f"Dataset loaded successfully - Shape: {df.shape}")
print(f"Target variable: Churn")

Dataset loaded successfully - Shape: (7043, 21)
Target variable: Churn


In [58]:
# Basic dataset information
print(f"Dataset shape: {df.shape}")
print(f"Target distribution: {df['Churn'].value_counts().to_dict()}")

Dataset shape: (7043, 21)
Target distribution: {'No': 5174, 'Yes': 1869}


## Data Cleaning

Based on our EDA findings, we need to address several data quality issues before feature engineering.

In [59]:
# Convert TotalCharges to numeric (identify empty string values)
print(f"TotalCharges original type: {df['TotalCharges'].dtype}")
empty_totalcharges = df[df['TotalCharges'] == ' ']
print(f"Found {len(empty_totalcharges)} customers with empty TotalCharges")

TotalCharges original type: object
Found 11 customers with empty TotalCharges


In [60]:
# Convert TotalCharges to numeric (empty strings become NaN)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
print(f"TotalCharges converted to: {df['TotalCharges'].dtype}")
print(f"Missing values created: {df['TotalCharges'].isnull().sum()}")

TotalCharges converted to: float64
Missing values created: 11


In [61]:
# Analyze missing values
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
})

missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_summary) > 0:
    print("Missing values found:")
    print(missing_summary)
else:
    print("No missing values found")

Missing values found:
                    Column  Missing_Count  Missing_Percentage
TotalCharges  TotalCharges             11            0.156183


In [62]:
# Remove identifier columns
print(f"Original columns: {df.shape[1]}")
df_clean = df.drop(columns=['customerID'])
print(f"Final shape after removing customerID: {df_clean.shape}")

Original columns: 21
Final shape after removing customerID: (7043, 20)


## Feature Type Identification

We need to separate features into numerical and categorical types for appropriate preprocessing.

In [63]:
# Identify numerical and categorical columns
# Get all columns except target
all_features = [col for col in df_clean.columns if col != 'Churn']

# Automatically detect numerical columns
numerical_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
if 'Churn' in numerical_cols:
    numerical_cols.remove('Churn')

# Automatically detect categorical columns  
categorical_cols = df_clean.select_dtypes(include=['object']).columns.tolist()
if 'Churn' in categorical_cols:
    categorical_cols.remove('Churn')

print(f"Numerical features ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical features ({len(categorical_cols)}): {categorical_cols}")
print(f"Total features: {len(numerical_cols) + len(categorical_cols)}")

Numerical features (4): ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
Categorical features (15): ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
Total features: 19


## Preprocessing Pipeline Development

We'll create sklearn pipelines to handle numerical and categorical features differently, ensuring robust preprocessing for both training and production.

In [64]:
# Create numerical preprocessing pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
print("Numerical pipeline created")

Numerical pipeline created


In [65]:
# Create categorical preprocessing pipeline  
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False))
])
print("Categorical pipeline created")

Categorical pipeline created


In [66]:
# Combine pipelines with ColumnTransformer
preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, numerical_cols),
    ('categorical', categorical_pipeline, categorical_cols)
], remainder='passthrough')

print(f"Combined preprocessor created: {len(numerical_cols)} numerical + {len(categorical_cols)} categorical features")

Combined preprocessor created: 4 numerical + 15 categorical features


## Save Column Configuration

We'll save the feature column lists to a JSON file for reuse in training and production scripts.

In [67]:
# Create processed data directory and column configuration
processed_dir = Path('../data/processed')
processed_dir.mkdir(exist_ok=True)

# Prepare column configuration for saving
column_config = {
    'columns': {
        'numerical': numerical_cols,
        'categorical': categorical_cols,
        'all_features': numerical_cols + categorical_cols,
        'target': 'Churn'
    },
    'metadata': {
        'total_features': len(numerical_cols) + len(categorical_cols),
        'numerical_count': len(numerical_cols),
        'categorical_count': len(categorical_cols),
        'dataset_shape': df_clean.shape,
        'preprocessing_steps': {
            'numerical': ['median_imputation', 'standard_scaling'],
            'categorical': ['constant_imputation', 'one_hot_encoding']
        }
    }
}

print(f"Column configuration prepared: {len(numerical_cols)} numerical + {len(categorical_cols)} categorical features")

Column configuration prepared: 4 numerical + 15 categorical features


In [68]:
# Save column configuration to JSON file
columns_file = processed_dir / 'columns.json'
with open(columns_file, 'w') as f:
    json.dump(column_config, f, indent=2)

print(f"Column configuration saved to: {columns_file}")

Column configuration saved to: ..\data\processed\columns.json


## Train-Test Split

We'll create stratified train-test splits to ensure balanced representation of churn classes in both sets.

In [70]:
# Prepare features and target
all_feature_cols = numerical_cols + categorical_cols
X = df_clean[all_feature_cols]
y = df_clean['Churn']

print(f"Features shape: {X.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")

Features shape: (7043, 19)
Target distribution: {'No': 5174, 'Yes': 1869}


In [71]:
# Perform stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Train set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Train target distribution: {y_train.value_counts().to_dict()}")
print(f"Test target distribution: {y_test.value_counts().to_dict()}")

Train set: 5634 samples (80.0%)
Test set: 1409 samples (20.0%)
Train target distribution: {'No': 4139, 'Yes': 1495}
Test target distribution: {'No': 1035, 'Yes': 374}


## Preprocessing Demonstration

Let's fit the preprocessor on training data and demonstrate the transformation process.

In [72]:
# Fit the preprocessor on training data
preprocessor.fit(X_train)

# Get feature names after transformation
categorical_feature_names = list(preprocessor.named_transformers_['categorical'].named_steps['encoder'].get_feature_names_out(categorical_cols))
numerical_feature_names = numerical_cols
all_feature_names = numerical_feature_names + categorical_feature_names

print(f"Preprocessor fitted successfully")
print(f"Total features after preprocessing: {len(all_feature_names)} ({len(numerical_feature_names)} numerical + {len(categorical_feature_names)} categorical)")

Preprocessor fitted successfully
Total features after preprocessing: 30 (4 numerical + 26 categorical)


## Save Processed Data

Save the processed training and test sets for use in model training scripts.

In [73]:
# Transform both training and test sets
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"Training set: {X_train.shape} -> {X_train_processed.shape}")
print(f"Test set: {X_test.shape} -> {X_test_processed.shape}")

# Save processed data to numpy arrays
import os
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

np.save(f"{processed_dir}/X_train_processed.npy", X_train_processed)
np.save(f"{processed_dir}/X_test_processed.npy", X_test_processed)
np.save(f"{processed_dir}/y_train.npy", y_train.values)
np.save(f"{processed_dir}/y_test.npy", y_test.values)

# Save feature names for reference
with open(f"{processed_dir}/feature_names.json", 'w') as f:
    json.dump({
        'numerical_features': numerical_cols,
        'categorical_features': categorical_cols,
        'all_feature_names': all_feature_names
    }, f, indent=2)

print(f"Processed data saved to {processed_dir}/")

Training set: (5634, 19) -> (5634, 30)
Test set: (1409, 19) -> (1409, 30)
Processed data saved to ../data/processed/


## Summary

This notebook completed the feature engineering pipeline for the telco churn prediction project:

### Key Components:
- **Data Cleaning**: Converted TotalCharges to numeric, handled missing values, removed identifier columns
- **Feature Preprocessing**: StandardScaler for numerical features, OneHotEncoder for categorical features  
- **Pipeline Creation**: Reproducible sklearn pipelines for consistent preprocessing
- **Train-Test Split**: Stratified 80/20 split preserving target distribution
- **Data Export**: Processed arrays saved to `/data/processed/` for model training