“In feature engineering, I cleaned the data, removed non-informative columns, encoded categorical variables, scaled numerical features, and built a preprocessing pipeline using ColumnTransformer to prevent data leakage and ensure consistency between training and testing.”

In [5]:
import pandas as pd

df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df = df.dropna()
df = df.drop(columns=['customerID'])


In [6]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})


In [7]:
df['Churn'].value_counts()


Churn
0    5163
1    1869
Name: count, dtype: int64

In [8]:
X = df.drop('Churn', axis=1)
y = df['Churn']


In [9]:
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns


In [10]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [12]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)


In [13]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


In [14]:
X_train_processed.shape, X_test_processed.shape


((5625, 45), (1407, 45))

In [16]:
import joblib
joblib.dump(preprocessor, "../src/preprocessor.pkl")


['../src/preprocessor.pkl']

* Dropped the unique identifier column as it does not contribute to predictive learning.

* Encoded categorical features into numerical format suitable for machine learning models.

* Scaled numerical features to ensure uniform feature magnitude.

* Applied a train–test split to evaluate model generalization.

* Prepared a clean, model-ready dataset for downstream training and evaluation.