# üè¶ Bank Customer Churn Prediction
## Notebook 4 ‚Äî Feature Engineering & Preprocessing

**Goal:** Transform the cleaned dataset into the numeric format required by scikit-learn models.

Two transformations happen here:
1. **Standardise numerical features** ‚Äî put all features on the same scale (mean=0, std=1).
2. **Encode categorical features** ‚Äî convert text labels to numeric dummy variables.

### ‚ö†Ô∏è Why SMOTE is NOT applied here

A common mistake is to apply SMOTE (oversampling) to the **full dataset** before the train/test split.  
This causes **data leakage** into the test set:

```
‚ùå WRONG ORDER (what causes 100% fake accuracy)
   Full data ‚Üí SMOTE ‚Üí train/test split
   Problem: synthetic samples generated from ALL data appear in the test set.
             The model has effectively seen variations of the test samples during training.
             Test accuracy becomes artificially inflated and meaningless.

‚úÖ CORRECT ORDER (implemented in N5)
   Full data ‚Üí train/test split ‚Üí SMOTE on X_train only ‚Üí train model ‚Üí evaluate on REAL test set
   The test set contains only original, real customers the model has never seen.
```

This notebook saves the **imbalanced** (but scaled and encoded) data.  
SMOTE is applied correctly in N5, **after** the split.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

sns.set_theme(style='whitegrid')

df = pd.read_csv('df_cleaned.csv')
print(f'Cleaned data loaded: {df.shape}')
df.head()

## 1. Separate Features (X) from Target (y)

In [None]:
X = df.drop('Exited', axis=1)   # Feature matrix ‚Äî everything except the target
y = df['Exited']                 # Target vector ‚Äî 0 or 1

print(f'X shape: {X.shape}  (features)')
print(f'y shape: {y.shape}  (target)')
print()
print('Feature columns:', X.columns.tolist())

## 2. Standardise Numerical Features

Many ML algorithms are sensitive to the *scale* of features. Without standardisation, `Balance` (range 0‚Äì250,000) would dominate `Tenure` (range 0‚Äì10) simply due to magnitude.

**StandardScaler** transforms each column to **mean = 0, std = 1**:  
`z = (x - Œº) / œÉ`

We use a **CustomScaler** that only scales selected columns. Binary flags like `HasCrCard` and `IsActiveMember` must NOT be scaled ‚Äî their 0/1 values are already meaningful.

> **Critical:** We call `.fit()` on the **full X** here because in N5 we will split
> the already-scaled data. The scaler learns mean/std from all 10,000 real customers,
> which is the correct reference for inference on new data.

In [None]:
class CustomScaler(BaseEstimator, TransformerMixin):
    """
    Applies StandardScaler only to specified columns.
    All other columns are returned unchanged.
    Inheriting from BaseEstimator + TransformerMixin gives sklearn compatibility
    and a free .fit_transform() method.
    """

    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        self.scaler    = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        self.columns   = columns
        self.with_mean = with_mean
        self.with_std  = with_std
        self.copy      = copy
        self.mean_     = None
        self.std_      = None

    def fit(self, X, y=None):
        """Learn mean and std from the data (called once on training data)."""
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.std_  = np.std(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        """Apply scaling, preserving the original column order."""
        init_col_order = X.columns
        X_scaled    = pd.DataFrame(
            self.scaler.transform(X[self.columns]),
            columns=self.columns,
            index=X.index
        )
        X_notscaled = X.loc[:, ~X.columns.isin(self.columns)]
        return pd.concat([X_notscaled, X_scaled], axis=1)[init_col_order]

print('CustomScaler defined ‚úì')

In [None]:
# Columns to scale ‚Äî explicitly listed to exclude binary flags
numerical_cols = ['CreditScore', 'Age', 'Tenure', 'Balance',
                  'NumOfProducts', 'EstimatedSalary',
                  'Satisfaction Score', 'Point Earned']

churn_scaler = CustomScaler(columns=numerical_cols)
churn_scaler.fit(X)          # Learn mean/std from all 10,000 real customers
X = churn_scaler.transform(X)

print('Scaling applied. Sanity checks:')
print(f'  Balance mean  (should ‚âà 0) : {X["Balance"].mean():.4f}')
print(f'  Balance std   (should ‚âà 1) : {X["Balance"].std():.4f}')
print(f'  HasCrCard mean (unchanged)  : {X["HasCrCard"].mean():.4f}  ‚Üê still a binary proportion')
X.head()

In [None]:
# Save the fitted scaler ‚Äî must be the same object used during inference
# CRITICAL: inference pipeline must call .transform(), never .fit_transform()
with open('Scaler_file.pkl', 'wb') as f:
    pickle.dump(churn_scaler, f)

print('‚úÖ Scaler saved  ‚Üí  Scaler_file.pkl')

## 3. One-Hot Encoding for Categorical Features

ML models require numbers. We convert text categories to binary (0/1) columns.

**`drop_first=True`** removes one category per feature to avoid the **dummy variable trap** (perfect multicollinearity):

| Feature | Categories | After encoding |
|---|---|---|
| Geography | France, Germany, Spain | `Geography_Germany`, `Geography_Spain` (France = 0, 0) |
| Gender | Female, Male | `Gender_Male` (Female = 0) |
| Card Type | DIAMOND, GOLD, PLATINUM, SILVER | `_GOLD`, `_PLATINUM`, `_SILVER` (DIAMOND = 0,0,0) |

In [None]:
categorical_cols = ['Geography', 'Gender', 'Card Type']

print('Unique values before encoding:')
for col in categorical_cols:
    print(f'  {col}: {sorted(X[col].unique())}')

data_dummies = pd.get_dummies(X, columns=categorical_cols, drop_first=True, dtype='int')

print(f'\nShape before encoding: {X.shape}')
print(f'Shape after  encoding: {data_dummies.shape}')

new_cols = [c for c in data_dummies.columns if c not in X.columns.tolist()]
print('\nNew dummy columns:', new_cols)

In [None]:
# Reorder to a fixed column sequence and re-attach the target
# Fixed order is critical ‚Äî the model expects features in this exact arrangement
FEATURE_COLUMNS = [
    'HasCrCard', 'IsActiveMember', 'CreditScore', 'Age', 'Tenure',
    'Balance', 'NumOfProducts', 'EstimatedSalary', 'Satisfaction Score',
    'Point Earned', 'Geography_Germany', 'Geography_Spain', 'Gender_Male',
    'Card Type_GOLD', 'Card Type_PLATINUM', 'Card Type_SILVER'
]

data_processed = pd.concat([data_dummies[FEATURE_COLUMNS], y], axis=1)

print('Final processed data:')
print(f'  Shape : {data_processed.shape}  (10,000 real customers ‚Äî NO synthetic samples)')
print(f'  Target: {data_processed["Exited"].value_counts().to_dict()}  (imbalanced ‚Äî SMOTE applied in N5)')
data_processed.head()

## 4. Class Imbalance ‚Äî Acknowledge, Do Not Fix Here

The target is imbalanced (~80% stayed, ~20% churned). We visualise it here to document the state of the data, but **do not apply SMOTE yet**.

In [None]:
counts = data_processed['Exited'].value_counts()

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(['Stayed (0)', 'Churned (1)'], counts.values,
       color=['#4C72B0', '#DD8452'], edgecolor='white', linewidth=1.5)
for i, v in enumerate(counts.values):
    ax.text(i, v + 50, f'{v:,}\n({v/len(data_processed)*100:.1f}%)',
            ha='center', fontweight='bold')
ax.set_title('Class Imbalance in Saved Data\n(SMOTE will be applied in N5 after train/test split)',
             fontsize=11)
ax.set_ylabel('Count')
ax.set_ylim(0, 9500)
plt.tight_layout()
plt.show()

print('Imbalance ratio:', f"{counts[0]/counts[1]:.1f} : 1  (stayed : churned)")
print()
print('This imbalance is intentionally preserved in the saved file.')
print('SMOTE will be applied ONLY to X_train in N5, after the train/test split.')

## 5. Save Checkpoint

In [None]:
data_processed.to_csv('data_processed.csv', index=False)

print('‚úÖ data_processed.csv saved')
print(f'   Rows    : {data_processed.shape[0]:,}  (original, real customers only)')
print(f'   Columns : {data_processed.shape[1]}  (16 features + 1 target)')
print()
print('Artifacts produced by N4:')
print('  Scaler_file.pkl    ‚Üê fitted CustomScaler for inference pipeline')
print('  data_processed.csv ‚Üê scaled + encoded, imbalanced ‚Äî ready for N5')

---
## ‚úÖ Feature Engineering Summary

| Step | Input | Output | Notes |
|---|---|---|---|
| Separate X / y | 14 cols | 13 features + 1 target | |
| Standardise 8 numerical cols | Raw values | mean=0, std=1 | HasCrCard, IsActiveMember NOT scaled |
| One-hot encode 3 categorical cols | Text | +6 binary columns | drop_first=True |
| Reorder columns | Mixed order | 16 features in fixed order | |
| **Save** | ‚Äî | **10,000 rows √ó 17 cols ‚Äî imbalanced** | |
| SMOTE | ‚ùå Not here | ‚Äî | Applied in N5 after split |

‚û°Ô∏è Continue to **N5_Model_Train_Test_Selection** where the correct train/test split and SMOTE pipeline is implemented.