# 02 – Preprocessing Pipeline (Imputation, Encoding, Scaling, Split)

### 🎯 Objectives
- Handle missing values (imputation).
- Encode categorical variables.
- Scale numerical features.
- Split dataset into train/test sets.

---

### 1. Missing Values (Imputation)
- **Task:** Identify missing values per feature.
- **Plan:**  
  - Numeric features → impute using mean/median.  
  - Categorical features → impute using mode.  
- **Acceptance Criteria:**  
  - Missing values summarized in a table.  
  - Imputation strategy documented and applied.

---

### 2. Encoding (Categorical Variables)
- **Task:** Convert categorical variables to numeric.  
- **Plan:**  
  - Binary variables (e.g., `male`, `currentSmoker`, `diabetes`) → label encoding (0/1).  
  - Multi-category (if any) → one-hot encoding.  
- **Acceptance Criteria:**  
  - All categorical columns converted to numeric without loss of information.

---

### 3. Scaling (Numerical Variables)
- **Task:** Standardize/normalize numeric features.  
- **Plan:**  
  - Try StandardScaler (mean=0, std=1).  
  - Optionally compare with MinMaxScaler (range 0–1).  
- **Acceptance Criteria:**  
  - Scaled dataset ready for models sensitive to feature scale (e.g., Logistic Regression, SVM).

---

### 4. Splitting the Dataset
- **Task:** Create training and testing sets.  
- **Plan:**  
  - Use train_test_split with stratify = target (TenYearCHD).  
  - Ratio: 80% train / 20% test.  
- **Acceptance Criteria:**  
  - Train/test sets created with preserved class balance.

---

### ✅ Expected Outcome
- Clean dataset saved in `data/processed/`.  
- Clear documentation of imputation, encoding, scaling, and splitting steps.  
- Ready-to-use data for modeling.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, model_selection
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv("../data/processed/eda_cleaned.csv")

In [5]:
df.isnull().sum()

male                0
age                 0
education          91
currentSmoker       0
cigsPerDay          0
BPMeds             51
prevalentStroke     0
prevalentHyp        0
diabetes            0
totChol             0
sysBP               0
diaBP               0
BMI                 0
heartRate           0
glucose             0
TenYearCHD          0
MAP                 0
Age_Group           0
HeavySmoker         0
PackYears           0
glucose_missing     0
outlier_IF          0
outlier_count       0
dtype: int64

Column    | Missing Before | Strategy | Missing After
--------- | -------------- | -------- | -------------
education | 91             | Mode     | 0
BPMeds    | 51             | Mode     | 0


In [7]:
for col in df.columns:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].mode().iloc[0])

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3774 entries, 0 to 3773
Data columns (total 23 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   male             3774 non-null   int64  
 1   age              3774 non-null   int64  
 2   education        3774 non-null   float64
 3   currentSmoker    3774 non-null   int64  
 4   cigsPerDay       3774 non-null   float64
 5   BPMeds           3774 non-null   float64
 6   prevalentStroke  3774 non-null   int64  
 7   prevalentHyp     3774 non-null   int64  
 8   diabetes         3774 non-null   int64  
 9   totChol          3774 non-null   float64
 10  sysBP            3774 non-null   float64
 11  diaBP            3774 non-null   float64
 12  BMI              3774 non-null   float64
 13  heartRate        3774 non-null   float64
 14  glucose          3774 non-null   float64
 15  TenYearCHD       3774 non-null   int64  
 16  MAP              3774 non-null   float64
 17  Age_Group     

### Derived Features – Outlier Indicators

- **outlier_IF**  
  - Binary flag (0/1) generated by IsolationForest.  
  - Indicates whether a record was detected as an outlier.  

- **outlier_count**  
  - Numeric feature representing the number of variables where a record was flagged as an outlier.  
  - Higher values indicate more extreme cases.  

**Decision:**  
- Keep both features for now.  
- They may capture additional risk information.  
- Will evaluate their impact during modeling (compare models with and without these features).


In [11]:
df_encoded = pd.get_dummies(df, columns=['Age_Group'], prefix='Age_Group', drop_first=True)

In [23]:
from sklearn.preprocessing import StandardScaler

x = df_encoded.drop(columns=['TenYearCHD'])
y = df_encoded['TenYearCHD']

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

x_scaled = pd.DataFrame(x_scaled, columns=x.columns)

df_scaled = pd.concat([x_scaled, y.reset_index(drop=True)], axis=1)

df_scaled

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,...,MAP,HeavySmoker,PackYears,glucose_missing,outlier_IF,outlier_count,Age_Group_Middle Age II (50–59),Age_Group_Senior (60–70),Age_Group_Young Adult (30–39),TenYearCHD
0,1.123650,-1.235977,1.996573,-0.973845,-0.756532,-0.175687,-0.076574,-0.672382,-0.166667,-0.965341,...,-1.206052,-0.348634,-0.725434,0.0,0.225745,-0.342636,-0.674876,-0.444652,2.573760,0
1,-0.889957,-0.420349,0.041444,-0.973845,-0.756532,-0.175687,-0.076574,-0.672382,-0.166667,0.319160,...,-0.350541,-0.348634,-0.725434,0.0,0.225745,-0.342636,-0.674876,-0.444652,-0.388537,0
2,1.123650,-0.187312,-0.936120,1.026858,0.935961,-0.175687,-0.076574,-0.672382,-0.166667,0.202388,...,-0.246493,-0.348634,0.933781,0.0,0.225745,-0.342636,-0.674876,-0.444652,-0.388537,0
3,-0.889957,1.327426,1.019009,1.026858,1.782207,-0.175687,-0.076574,1.487251,-0.166667,-0.264704,...,0.967408,2.868342,2.841879,0.0,0.225745,1.395415,-0.674876,2.248952,-0.388537,1
4,-0.889957,-0.420349,1.019009,1.026858,1.189835,-0.175687,-0.076574,-0.672382,-0.166667,1.136570,...,-0.003713,2.868342,1.055457,0.0,0.225745,-0.342636,-0.674876,-0.444652,-0.388537,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3769,1.123650,0.162243,1.019009,1.026858,2.882327,-0.175687,-0.076574,-0.672382,-0.166667,-0.685086,...,-0.269615,2.868342,3.198610,0.0,0.225745,1.395415,1.481755,-0.444652,-0.388537,0
3770,-0.889957,-0.187312,0.041444,1.026858,0.935961,-0.175687,-0.076574,-0.672382,-0.166667,0.272451,...,-0.535517,-0.348634,0.933781,0.0,0.225745,-0.342636,-0.674876,-0.444652,-0.388537,0
3771,-0.889957,0.278761,0.041444,-0.973845,-0.756532,-0.175687,-0.076574,-0.672382,-0.166667,0.762897,...,0.030970,-0.348634,-0.725434,0.0,0.225745,-0.342636,1.481755,-0.444652,-0.388537,0
3772,1.123650,-1.119459,1.019009,-0.973845,-0.756532,-0.175687,-0.076574,1.487251,-0.166667,-1.198886,...,0.898042,-0.348634,-0.725434,0.0,0.225745,-0.342636,-0.674876,-0.444652,-0.388537,0


### Scaling Strategy – Standardization

- **Scaler Used:** `StandardScaler` (from scikit-learn).  
- **Transformation:** Each feature is standardized:  
$$
z = \frac{x - \mu}{\sigma}
$$
where $\mu$ = mean and $\sigma$ = standard deviation.

- **Why Standardization?**
  - Logistic Regression (our baseline model) is sensitive to feature scales.  
  - Ensures that features with larger numeric ranges (e.g., `sysBP`, `totChol`) do not dominate smaller ones (e.g., `diabetes` = 0/1).  
  - Makes coefficients interpretable: one unit change = one standard deviation change.  
  - Improves convergence and training stability for models using gradient descent.  

- **Note:**  
  - Tree-based models (Random Forest, Gradient Boosting) do not require scaling.  
  - However, we keep standardized features to allow fair comparison across models.

All features standardized → mean = 0, std = 1.  


In [26]:
from sklearn.model_selection import train_test_split

X = df_scaled.drop(columns="TenYearCHD")
y = df_scaled["TenYearCHD"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (3019, 24)
X_test shape: (755, 24)
y_train shape: (3019,)
y_test shape: (755,)


In [27]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

TenYearCHD
0    0.847963
1    0.152037
Name: proportion, dtype: float64
TenYearCHD
0    0.847682
1    0.152318
Name: proportion, dtype: float64


# 02 – Preprocessing Pipeline

## ✦ MT1.2 | Preprocessing Steps

### 1. Missing Values (Imputation)
- **education**
  - Missing before: 91
  - Strategy: Mode (most common value)
  - Missing after: 0
- **BPMeds**
  - Missing before: 51
  - Strategy: Mode (most common value = 0)
  - Missing after: 0

✅ All missing values imputed.

---

### 2. Encoding (Categorical Variables)
- **Age_Group**
  - One-Hot Encoding with `drop_first=True`.
  - New columns:
    - Age_Group_Middle Age II (50–59)
    - Age_Group_Senior (60–70)
    - Age_Group_Young Adult (30–39)
- Other categorical variables (male, currentSmoker, diabetes, prevalentStroke, prevalentHyp, BPMeds) already in binary numeric format.

✅ All categorical variables converted to numeric.

---

### 3. Scaling (Numerical Features)
- Applied `StandardScaler` to all features (excluding target).
- Transformation:
  $$
  z = \frac{x - \mu}{\sigma}
  $$
  where $\mu$ = mean, $\sigma$ = standard deviation.
- **Benefits:**
  - Ensures fair comparison across features with different ranges.
  - Logistic Regression coefficients interpretable (1 unit = 1 std change).
  - Improves optimization and convergence for gradient-based models.

✅ All features standardized (mean=0, std=1).

---

### 4. Splitting the Dataset
- Used `train_test_split` with 80% train / 20% test.
- `stratify=y` to preserve class balance.
- Sizes:
  - X_train: (3019, 24)
  - X_test: (755, 24)
  - y_train: (3019,)
  - y_test: (755,)
- Target distribution preserved:
  - Train: 84.8% healthy / 15.2% CHD
  - Test: 84.7% healthy / 15.2% CHD

✅ Train/test sets created with stratified sampling.

---

## ✅ Outcome
- Dataset fully imputed, encoded, scaled, and split.
- Final objects ready for modeling:
  - `X_train, X_test, y_train, y_test`
- Next step → move to **03-modeling.ipynb** to build and evaluate models.


In [None]:
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)