# Feature Engineering for Churn Prediction

## Objective
Transform our raw customer data into a format suitable for machine learning models.

## What We'll Do
1. **Handle categorical variables** - Convert text to numbers
2. **Create new features** - Based on EDA insights
3. **Handle data types** - Ensure everything is numeric
4. **Split data** - Prepare for training and testing

## Why This Matters
Machine learning models only understand numbers, not text like "Male" or "Month-to-month". We need to convert everything to numerical format while preserving the information.

---

In [68]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [69]:
df = pd.read_csv("C:/Users/Administrator/Desktop/user-churn-prediction/data/raw/Telco-Customer-Churn.csv")

In [70]:
df.shape

(7043, 21)

In [71]:
df.head(5)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [72]:


# Display basic info
print(f"Dataset shape: {df.shape}")
print(f"Total customers: {len(df):,}")
print(f"\nFirst few rows:")

Dataset shape: (7043, 21)
Total customers: 7,043

First few rows:


In [73]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [74]:
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

In [115]:
help(df.select_dtypes)

Help on method select_dtypes in module pandas.core.frame:

select_dtypes(include=None, exclude=None) -> 'Self' method of pandas.core.frame.DataFrame instance
    Return a subset of the DataFrame's columns based on the column dtypes.
    
    Parameters
    ----------
    include, exclude : scalar or list-like
        A selection of dtypes or strings to be included/excluded. At least
        one of these parameters must be supplied.
    
    Returns
    -------
    DataFrame
        The subset of the frame including the dtypes in ``include`` and
        excluding the dtypes in ``exclude``.
    
    Raises
    ------
    ValueError
        * If both of ``include`` and ``exclude`` are empty
        * If ``include`` and ``exclude`` have overlapping elements
        * If any kind of string dtype is passed in.
    
    See Also
    --------
    DataFrame.dtypes: Return Series with the data type of each column.
    
    Notes
    -----
    * To select all *numeric* types, use ``np.number`` or

In [None]:
df.select_

In [75]:
df.select_dtypes(include=['object']).nunique()

customerID          7043
gender                 2
Partner                2
Dependents             2
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
TotalCharges        6531
Churn                  2
dtype: int64

In [44]:
df.select_dtypes(include=['object']).columns

Index(['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges',
       'Churn'],
      dtype='object')

In [37]:
df["gender"].unique()

array(['Female', 'Male'], dtype=object)

In [42]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(f"\nNumerical columns ({len(numerical_cols)}):")
for col in numerical_cols:
    print(f"  - {col}")


Numerical columns (3):
  - SeniorCitizen
  - tenure
  - MonthlyCharges


In [45]:
df['TotalCharges'].dtype

dtype('O')

In [67]:
df['TotalCharges'].head(5)

0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: float64

In [76]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')


In [77]:
df['TotalCharges'].dtype

dtype('float64')

In [78]:
df['TotalCharges'].head(5)

0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: float64

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [81]:
df.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [83]:
df[df["TotalCharges"].isnull()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges']]

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges
488,4472-LVYGI,0,52.55,
753,3115-CZMZD,0,20.25,
936,5709-LVOEQ,0,80.85,
1082,4367-NUYAO,0,25.75,
1340,1371-DWPAZ,0,56.05,
3331,7644-OMVMY,0,19.85,
3826,3213-VVOLG,0,25.35,
4380,2520-SGTTA,0,20.0,
5218,2923-ARZLG,0,19.7,
6670,4075-WKNIU,0,73.35,


In [86]:
df['TotalCharges'].fillna(0, inplace=True)

In [87]:
df['TotalCharges'].isnull().sum()

np.int64(0)

In [89]:
df.select_dtypes(include=['object']).nunique()

customerID          7043
gender                 2
Partner                2
Dependents             2
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
Churn                  2
dtype: int64

In [90]:
df["gender"].unique()

array(['Female', 'Male'], dtype=object)

In [93]:
df["TechSupport"].unique()

array(['No', 'Yes', 'No internet service'], dtype=object)

In [94]:
df["PaymentMethod"].unique()

array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)

In [95]:
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

In [96]:
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

In [97]:
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})

In [98]:
df[['gender', 'Partner', 'Dependents', 'Churn']].head()


Unnamed: 0,gender,Partner,Dependents,Churn
0,0,1,0,0
1,1,0,0,0
2,1,0,0,1
3,1,0,0,0
4,0,0,0,1


In [99]:
service_cols = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 
                'TechSupport', 'StreamingTV', 'StreamingMovies']

In [100]:
for col in service_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0, 'No internet service': 0})


In [101]:
df['MultipleLines'] = df['MultipleLines'].map({'Yes': 1, 'No': 0, 'No phone service': 0})

In [102]:
df[['MultipleLines', 'OnlineSecurity', 'TechSupport']].head()

Unnamed: 0,MultipleLines,OnlineSecurity,TechSupport
0,0,0,0
1,0,1,0
2,0,1,0
3,0,1,1
4,0,0,0


In [103]:
onehot_cols = ['InternetService', 'Contract', 'PaymentMethod']

In [104]:
df = pd.get_dummies(df, columns=onehot_cols, drop_first=False)

In [120]:
# Check new columns
print("New column names:")
df.columns.tolist()


New column names:


['customerID',
 'gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'MultipleLines',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling',
 'MonthlyCharges',
 'TotalCharges',
 'Churn',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']

In [122]:
df["InternetService_DSL"].head() 

0     True
1     True
2     True
3     True
4    False
Name: InternetService_DSL, dtype: bool

In [123]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,...,InternetService_DSL,InternetService_Fiber optic,InternetService_No,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,7590-VHVEG,0,0,1,0,1,0,0,0,1,...,True,False,False,True,False,False,False,False,True,False
1,5575-GNVDE,1,0,0,0,34,1,0,1,0,...,True,False,False,False,True,False,False,False,False,True
2,3668-QPYBK,1,0,0,0,2,1,0,1,1,...,True,False,False,True,False,False,False,False,False,True
3,7795-CFOCW,1,0,0,0,45,0,0,1,0,...,True,False,False,False,True,False,True,False,False,False
4,9237-HQITU,0,0,0,0,2,1,0,0,0,...,False,True,False,True,False,False,False,False,True,False


In [129]:
len(df.columns)

28

In [130]:
df = df.drop('customerID', axis=1)

In [131]:
print("Final dataset shape:", df.shape)
print("Total features:", df.shape[1] - 1)

Final dataset shape: (7043, 27)
Total features: 26


## Step 6: Separate Features and Target
X = all columns except Churn (what we use to predict)

y = Churn column (what we're trying to predict)

In [138]:
X = df.drop('Churn', axis=1)
y = df['Churn']

print("Features shape:", X.shape)
print("Target shape:", y.shape)

print(f"\nTarget distribution: {y.value_counts()}")


Features shape: (7043, 26)
Target shape: (7043,)

Target distribution: Churn
0    5174
1    1869
Name: count, dtype: int64


## Step 7: Split Data into Training and Testing Sets
80% for training the model, 20% for testing its performance.

In [140]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)



In [141]:
print("Training set:")
print("  X_train shape:", X_train.shape)
print("  y_train shape:", y_train.shape)
print("\nTest set:")
print("  X_test shape:", X_test.shape)
print("  y_test shape:", y_test.shape)

print("\nChurn distribution in training set:")
print(y_train.value_counts())
print("\nChurn distribution in test set:")
print(y_test.value_counts())

Training set:
  X_train shape: (5634, 26)
  y_train shape: (5634,)

Test set:
  X_test shape: (1409, 26)
  y_test shape: (1409,)

Churn distribution in training set:
Churn
0    4139
1    1495
Name: count, dtype: int64

Churn distribution in test set:
Churn
0    1035
1     374
Name: count, dtype: int64


## Step 8: Save Processed Data
Save the cleaned and encoded data for modeling phase.

In [142]:
import os
if not os.path.exists('../data/processed'):
    os.makedirs('../data/processed')

In [143]:
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)


## Summary: Feature Engineering Complete!

### What We Did:
1. ✅ Fixed TotalCharges data type (converted to numeric)
2. ✅ Handled 11 missing values (filled with 0)
3. ✅ Encoded binary variables (Yes/No → 1/0)
4. ✅ Encoded multi-category variables (one-hot encoding)
5. ✅ Removed customerID (not predictive)
6. ✅ Split into training (80%) and test (20%) sets
7. ✅ Saved processed data

### Results:
- **26 features** ready for machine learning
- **5,634 training samples**
- **1,409 test samples**
- **Data is clean, numeric, and ready for modeling!**

### Next Phase:
Build machine learning models to predict customer churn!