# Telco Customer Churn Prediction - Feature Engineering & Preprocessing

## 1.0 Introduction

This notebook extends the work from `1.0-EDA-DataCleaning.ipynb` by focusing on preparing the cleaned Telco Churn dataset for machine learning model training. This involves transforming categorical features into a numerical format, scaling numerical features, and splitting the dataset into training and testing sets.

## 2.0 Data Loading and Initial Setup
* **Objective:** Load the dataset and re-apply the essential cleaning and initial preprocessing steps from the previous notebook to ensure the data is in a consistent state before advanced preprocessing.
### 2.1 Import Libraries

In [70]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer



### 2.2 Load Data and Re-apply Basic Cleaning
The data is loaded, and the essential cleaning steps (column renaming, totalcharges handling, churn conversion, and customerid removal) are reapplied to get the DataFrame into its cleaned state.

In [72]:
#load raw data
df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')
#perform cleaning/preprocessing steps from EDA
df.columns = df.columns.str.lower()
df.drop('customerid', axis=1, inplace=True)
df['totalcharges'] = pd.to_numeric(df['totalcharges'], errors='coerce')
df.dropna(subset=['totalcharges'], inplace=True)
df['churn'] = df['churn'].map({'Yes': 1, 'No':0})

In [73]:
df.shape

(7032, 20)

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   seniorcitizen     7032 non-null   int64  
 2   partner           7032 non-null   object 
 3   dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   phoneservice      7032 non-null   object 
 6   multiplelines     7032 non-null   object 
 7   internetservice   7032 non-null   object 
 8   onlinesecurity    7032 non-null   object 
 9   onlinebackup      7032 non-null   object 
 10  deviceprotection  7032 non-null   object 
 11  techsupport       7032 non-null   object 
 12  streamingtv       7032 non-null   object 
 13  streamingmovies   7032 non-null   object 
 14  contract          7032 non-null   object 
 15  paperlessbilling  7032 non-null   object 
 16  paymentmethod     7032 non-null   object 
 17  

## 3.0 Feature and Target Separation
* **Objective:** Separate the independent features X from the dependent feature (target, y). This is a standard practice before applying machine learning algorithms.

In [76]:
X = df.drop('churn', axis=1)
y = df['churn']
print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')

X shape: (7032, 19)
y shape: (7032,)


## 4.0 Feature Engineering and Preprocessing Pipelines
* **Objective:** Prepare the features for machine learning models. This involves converting categorical features into a numerical format and scaling numerical features.
### 4.1 Identify column types
* Features are categorized into numerical and categorical types to apply appropriate preprocessing steps.

In [78]:
#split features into numerical and categorical features for preprocessing steps. 
numerical_features = ['tenure', 'monthlycharges', 'totalcharges']
categorical_features = [col for col in X.columns if col not in numerical_features]

### 4.2 One-Hot Encoding for Categorical Features
* **Implementation:** We use `OneHotEncoder` from `sklearn.preprocessing`. The `handle_unknown='ignore'` parameter prevents errors if a new, unseen category appears during testing. The `drop='first'` parameter is used to prevent multicollinearity (the "Dummy Variable Trap") by dropping one of the one-hot encoded columns for each original feature.
### 4.3 Scaling for Numerical Categories
* **Implementation:** `StandardScaler` from `sklearn.preprocessing` is applied to the numerical features.
### 4.4 Combining Preprocessing Steps with ColumnTransformer
* **Implementation:** `ColumnTransformer` is a powerful tool from `sklearn.compose` that allows different transformations to be applied to different columns of the input data simultaneously.


In [80]:
#preprocessing
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numerical_features), 
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features)
     ])
X_processed = preprocessor.fit_transform(X)
 

## 5.0 Train-Test Split
* **Objective:** Divide the processed dataset into training and testing subsets. The model will be trained solely on the training data and evaluated on the unseen testing data to assess its generalization performance.

**Parameters:**

- `test_size=0.2`: 20% of the data will be used for the test set, and 80% for the training set.
- `random_state=42`: A fixed integer for the random state ensures that the split is reproducible, meaning you'll get the same train/test split every time you run the code.
- `stratify=y`: This is crucial for classification tasks, especially with imbalanced datasets (like churn). It ensures that the proportion of target classes (churned vs. non-churned) is approximately the same in both the training and testing sets as it is in the original dataset. This prevents scenarios where one set might have significantly more or fewer churned customers, leading to biased evaluation.

In [82]:
#train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, random_state=42, test_size=0.2, stratify=y)

In [83]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (5625, 30)
X_test shape: (1407, 30)
y_train shape: (5625,)
y_test shape: (1407,)


## 6.0 Saving Processed Data
* **Objective:** Save the preprocessed and split datasets (X_train, X_test, y_train, y_test) to disk. This allows for quick loading in subsequent notebooks (e.g., for model training) without re-running the entire preprocessing pipeline.

* **Method:** The data is saved using numpy.savez, which efficiently stores multiple NumPy arrays in a single compressed .npz file. A dedicated `data/processed/` directory is used for these output files.



In [85]:
processed_data_dir = '../data/processed/'
np.savez(processed_data_dir+'telco_churn_processed_data.npz', 
         X_train=X_train, 
         X_test=X_test,
         y_train=y_train,
         y_test=y_test)
print(f"\nProcessed data saved to {processed_data_dir}telco_churn_processed_data.npz")



Processed data saved to ../data/processed/telco_churn_processed_data.npz


## 7.0 Conclusion & Next Steps
This notebook has successfully transformed the raw Telco Churn dataset into a clean, preprocessed, and split format suitable for machine learning.

The next steps will involve:
1. **Model Training:** Selecting and training various machine learning classification models.
2. **Model Evaluation:** Assessing model performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score, ROC AUC).
3. **Hyperparameter Tuning:** Optimizing model parameters for improved performance.
4. **Model Interpretation:** Gaining insights into which features are most important for predicting churn.
