# **03 — Feature Engineering**

# 1. Objective
The objective of this notebook is to prepare the dataset for predictive modeling by transforming raw variables into a machine-learning–ready feature matrix.

Based on insights obtained during exploratory data analysis, this step includes:

* Target variable encoding

* Explicit handling of data leakage risks

* Treatment of categorical variables

* Creation of the final feature matrix (X) and target vector (y)

* Train-test split with class stratification

No models are trained or evaluated in this notebook.

# 2. Dataset Loading

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [13]:
# Load dataset
df = pd.read_csv(
    "data/bank-additional-full.csv",
    sep=";"
)

# 3. Target Variable Encoding
The target variable `y` is binary and represents whether the client subscribed to a term deposit.

In [14]:
# Encode target variable
df['y_binary'] = df['y'].map({'no': 0, 'yes': 1})

# Sanity check
df[['y', 'y_binary']].head()


Unnamed: 0,y,y_binary
0,no,0
1,no,0
2,no,0
3,no,0
4,no,0


# 4. Data Leakage Control

## 4.1 Excluding Call Duration

The variable `duration` represents the length of the last call and is only known after the call has occurred.

Including this variable would introduce data leakage, as it would not be available at prediction time.

In [15]:
# Drop target and leakage-prone feature
leakage_features = ['y', 'duration']
df_model = df.drop(columns=leakage_features)


# 5. Handling Categorical Variables

## 5.1 Identification of Categorical and Numerical Features

In [16]:
categorical_features = df_model.select_dtypes(include=['object']).columns.tolist()
numerical_features = df_model.select_dtypes(exclude=['object']).columns.tolist()

categorical_features, numerical_features


(['job',
  'marital',
  'education',
  'default',
  'housing',
  'loan',
  'contact',
  'month',
  'day_of_week',
  'poutcome'],
 ['age',
  'campaign',
  'pdays',
  'previous',
  'emp.var.rate',
  'cons.price.idx',
  'cons.conf.idx',
  'euribor3m',
  'nr.employed',
  'y_binary'])

## 5.2 Treatment of Informative Missing Values
Several categorical variables contain values such as `"unknown"` or `"nonexistent"`.

These values are not imputed or removed, as they may carry predictive information and represent real business states (e.g., lack of prior contact).

No action is taken at this stage.

## 5.3 One-Hot Encoding

Categorical variables are converted into numerical format using One-Hot Encoding.

In [17]:
df_encoded = pd.get_dummies(
    df_model,
    columns=categorical_features,
    drop_first=True
)


# 6. Feature Matrix and Target Vector

In [18]:
# Separate features and target
X = df_encoded.drop(columns=['y_binary'])
y = df_encoded['y_binary']

X.shape, y.shape


((41188, 52), (41188,))

# 7. Train-Test Split
Given the strong class imbalance observed during EDA, stratification is applied to preserve class proportions.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [20]:
# Verify class distribution
y_train.mean(), y_test.mean()


(np.float64(0.11265553869499241), np.float64(0.11264870114105366))

# 8. Output Artifacts
This notebook produces the following outputs for downstream modeling:

* X_train, X_test

* y_train, y_test

* Fully encoded, leakage-free feature matrix

These artifacts are used directly in the modeling and evaluation stage.