  <h3 style="color: teal; background-color: white; padding: 10px; border-radius: 5px; text-align:center">
  3: Data Understanding & Exploratory Data Analysis (EDA)
</h3>

In [55]:
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


In [56]:
# -------------------------------
#  Load Data Saved from 02_
# -------------------------------

# Load DataFrame
df = pd.read_pickle("../outputs/bank_df.pkl")


<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.1 Target Variable Encoding
</h4>

The target variable `y` indicates whether a client subscribed to a term deposit after a marketing campaign. Since machine learning models require numerical targets, this variable is encoded as a binary value:

`no` → `0`

`yes` → `1`

In [57]:
# Encode target variable 'y' to numerical values
df['y_encoded'] = df['y'].map({'no': 0, 'yes': 1})
df[['y', 'y_encoded']].head()



Unnamed: 0,y,y_encoded
0,no,0
1,no,0
2,no,0
3,no,0
4,no,0


The encoded variable `y_encoded` is used as the target for all subsequent modeling tasks.

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.2 Feature Engineering
</h4>

In [58]:
# Campaign-related features
df['total_contacts'] = df['campaign'] + df['previous']  # total number of contacts
df['contacted_before'] = df['pdays'].apply(lambda x: 0 if x == 999 else 1)  # contacted before or not


<b>`total_contacts`</b>

Why?:
- campaign shows contacts in the current campaign
- previous shows contacts in past campaigns
- Individually, they don’t reflect the overall contact pressure

What this feature captures:
- The total exposure of a client to marketing efforts
- Helps identify whether excessive contact impacts subscription likelihood

<b>`contacted_before`</b>

Why?:
- `pdays` = 999 means the client was never contacted before
- Treating pdays directly can confuse models

What this feature captures:
- A clear binary signal:
  - -1: Client has been contacted before
  - 0: Client has never been contacted

Benefit:
- Simplifies interpretation
- Helps models distinguish new vs returning clients

In [59]:
# Age-based feature
bins = [17, 25, 35, 45, 55, 65, 100]
labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

<b>`age_group`</b>

Why?:
- Raw age is a continuous variable
- Customer behavior often differs by life stage, not exact age

What this feature captures:
- Age-based segments that are easier to interpret
- Non-linear relationships between age and subscription behavior

Benefit:
- Improves interpretability
- Works well with categorical models and visualizations

In [60]:
# Transform 'pdays'
df['pdays_transformed'] = df['pdays'].replace(999, -1)

<b>`pdays_transformed`</b>

Why?:
- 999 is not a real numeric value; it represents “never contacted”
- Models may interpret it incorrectly as a large number

What this feature captures:
- A cleaner numerical representation
- -1 clearly indicates “no previous contact”

Benefit:
- Prevents misleading distance-based calculations
- Makes the feature usable for numerical models

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.3 Feature Selection and Leakage Prevention
</h4>

The original dataset contains a variable named `duration`, representing the length of the phone call. This variable is only known after the call has been completed and therefore would not be available at the time of prediction in a real-world scenario.

Including this variable would introduce data leakage, leading to overly optimistic and unrealistic model performance. Consequently, duration is removed from the feature set.

In [61]:
# Prepare feature matrix X and target vector y
X = df.drop(columns=['y', 'y_encoded', 'duration'])
y = df['y_encoded']

<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.4 Feature Type Identification
</h4>

In [66]:
# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Identify categorical and numerical features
numerical_features = (
    X.select_dtypes(include=['int64', 'float64'])
      .columns
      .tolist()
)

print("Categorical Features:", categorical_features)
print("Numerical Features:", numerical_features)



Categorical Features: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']
Numerical Features: ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'total_contacts', 'contacted_before', 'pdays_transformed']


<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.5 Feature Transformation Pipeline
</h4>

A unified preprocessing pipeline is constructed using a ColumnTransformer to ensure consistent and scalable feature transformations.

- Numerical features are standardized using StandardScaler, which centers the data and scales it to unit variance.
- Categorical features are encoded using One-Hot Encoding, with:
  - drop='first' to reduce multicollinearity
  - handle_unknown='ignore' to safely process unseen categories in the test set

In [63]:


preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
    ]
)


<h4 style="
  margin-bottom: 4px;
  background-color: #f3f4f6;
  padding: 4px 8px;
  border-radius: 4px;
  display: inline-block;
  color: black;
">
  3.5 Summary
</h4>

The data preparation process ensures that:
- The target variable is correctly encoded
- Data leakage is explicitly avoided
- Feature types (numerical and categorical) are handled appropriately
- Domain-informed feature engineering is applied to capture customer behavior more effectively
- All transformations, including engineered features, are reproducible and compatible with model pipelines

Specifically, additional features were engineered to summarize campaign contact history, indicate prior customer contact, segment customers by age group, and handle special values in contact timing.

In [64]:
# save artifacts using pickle to use in the next notebooks

import os
import pickle

os.makedirs("../outputs", exist_ok=True)

# Save prepared DataFrame
with open("../outputs/data_prepared.pkl", "wb") as f:
    pickle.dump(df, f)

# Save feature metadata
feature_metadata = {
    "categorical_features": categorical_features,
    "numerical_features": numerical_features,
    "engineered_features": [
        "total_contacts",
        "contacted_before",
        "age_group",
        "pdays_transformed"
    ]
}

with open("../outputs/feature_metadata.pkl", "wb") as f:
    pickle.dump(feature_metadata, f)

print("Data preparation artifacts saved using pickle.")

# Save preprocessor
with open("../outputs/preprocessor.pkl", "wb") as f:
    pickle.dump(preprocessor, f)

print("Preprocessor saved to outputs/preprocessor.pkl")


Data preparation artifacts saved using pickle.
Preprocessor saved to outputs/preprocessor.pkl
