# 03 — Feature Engineering & Preprocessing
## HumanForYou — Employee Attrition Prediction

---

### Objective

Transform raw merged data into **model-ready features**:
1. Handle missing values with justified strategies
2. Encode categorical variables (ordinal vs. one-hot)
3. Engineer new features from existing ones
4. Scale numerical features
5. Address class imbalance (SMOTE)
6. Export train/test splits for modeling

> This notebook expects `merged_data.csv` from **01_Data_Validation_Pipeline**.

## Section 1: Setup

In [1]:
# ==============================================================================
# IMPORTS
# ==============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

# Only suppress expected warnings, not real errors
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OrdinalEncoder
from sklearn.impute import KNNImputer
from imblearn.over_sampling import SMOTE

# --- Path Configuration (same logic as notebook 01) ---
_cwd = Path.cwd()
if (_cwd / "data" / "raw").exists():
    PROJECT_ROOT = _cwd
elif (_cwd.parent / "data" / "raw").exists():
    PROJECT_ROOT = _cwd.parent
else:
    raise FileNotFoundError(
        "Cannot find project root: 'data/raw/' not found in CWD or parent. "
        "Run this notebook from the project root or notebooks/ directory."
    )

OUTPUT_DIR = str(PROJECT_ROOT / "outputs")

df = pd.read_csv(f"{OUTPUT_DIR}/merged_data.csv")

# Binary target
df["Attrition"] = (df["Attrition"] == "Yes").astype(int)

print(f"Loaded: {df.shape[0]} rows x {df.shape[1]} columns")

Loaded: 4410 rows x 29 columns


## Section 2: Feature Engineering & Encoding (before split)

Features that don't require fitting on data (no leakage risk) are created before the split.
Imputation happens **after** the split to avoid data leakage.

In [2]:
# ==============================================================================
# FEATURE ENGINEERING (safe before split — no fitted statistics)
# ==============================================================================

# Income per job level
if "MonthlyIncome" in df.columns and "JobLevel" in df.columns:
    df["IncomePerJobLevel"] = df["MonthlyIncome"] / df["JobLevel"]

# Promotion stagnation
if "YearsSinceLastPromotion" in df.columns and "YearsAtCompany" in df.columns:
    df["PromotionStagnation"] = df["YearsSinceLastPromotion"] / (df["YearsAtCompany"] + 1)

# Satisfaction composite score
survey_items = ["EnvironmentSatisfaction", "JobSatisfaction", "WorkLifeBalance"]
existing_items = [c for c in survey_items if c in df.columns]
if existing_items:
    df["SatisfactionScore"] = df[existing_items].mean(axis=1)

# Manager stability
if "YearsWithCurrManager" in df.columns and "YearsAtCompany" in df.columns:
    df["ManagerStability"] = df["YearsWithCurrManager"] / (df["YearsAtCompany"] + 1)

# REMOVED: LongHours — directly derived from avg_working_hours (r=0.835).
# Keeping both would triple-encode the same signal (avg_working_hours,
# avg_departure_hour [already removed in NB01], and LongHours).

new_features = ["IncomePerJobLevel", "PromotionStagnation", "SatisfactionScore", "ManagerStability"]
new_features = [f for f in new_features if f in df.columns]
print(f"New features created: {new_features}")

# ==============================================================================
# CATEGORICAL ENCODING (deterministic — no leakage)
# ==============================================================================

cat_cols = df.select_dtypes(include="object").columns.tolist()
print(f"Categorical columns to encode: {cat_cols}")

# Ordinal encoding for BusinessTravel
bt_map = {"Non-Travel": 0, "Travel_Rarely": 1, "Travel_Frequently": 2}
if "BusinessTravel" in df.columns:
    df["BusinessTravel"] = df["BusinessTravel"].map(bt_map)

# One-hot encoding for remaining categoricals
ohe_cols = [c for c in cat_cols if c != "BusinessTravel"]
df = pd.get_dummies(df, columns=ohe_cols, drop_first=True, dtype=int)

print(f"Post-encoding shape: {df.shape[0]} rows x {df.shape[1]} columns")

New features created: ['IncomePerJobLevel', 'PromotionStagnation', 'SatisfactionScore', 'ManagerStability']
Categorical columns to encode: ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']
Post-encoding shape: 4410 rows x 46 columns


## Section 3: Train / Test Split

Split **before** imputation to prevent data leakage.

In [3]:
# ==============================================================================
# TRAIN / TEST SPLIT (before imputation to avoid leakage)
# ==============================================================================

# Save and drop EmployeeID
if "EmployeeID" in df.columns:
    employee_ids = df["EmployeeID"].copy()
    df = df.drop(columns=["EmployeeID"])

X = df.drop(columns=["Attrition"])
y = df["Attrition"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Avoid SettingWithCopyWarning: train_test_split returns views, not copies.
# Without .copy(), in-place imputation may silently fail or corrupt data.
X_train = X_train.copy()
X_test = X_test.copy()

print(f"Train set: {X_train.shape[0]} samples ({y_train.mean()*100:.1f}% attrition)")
print(f"Test set:  {X_test.shape[0]} samples ({y_test.mean()*100:.1f}% attrition)")
print(f"Missing in train: {X_train.isnull().sum().sum()}")
print(f"Missing in test:  {X_test.isnull().sum().sum()}")

Train set: 3528 samples (16.1% attrition)
Test set:  882 samples (16.1% attrition)
Missing in train: 89
Missing in test:  22


## Section 4: Imputation (fit on train only)

**Strategy**:
- **KNN imputation** (k=5, distance-weighted) for all columns with missing values
- Preserves local data structure and reduces imputation bias
- Aligned with ethics document recommendation (avoid selection bias from median imputation)
- **Fit on train only** → transform both train and test (no data leakage)

In [4]:
# ==============================================================================
# IMPUTATION — KNNImputer (fit on train only, transform both)
# ==============================================================================
# Using KNNImputer instead of SimpleImputer(median) to:
# - Preserve local data structure and reduce imputation bias
# - Align with ethics document recommendation (avoid selection bias)

# Identify columns with missing values
cols_with_na = X_train.columns[X_train.isnull().any()].tolist()
print(f"Columns to impute: {cols_with_na}")

# KNNImputer — distance-weighted, fit on train only
imputer = KNNImputer(n_neighbors=5, weights="distance")
X_train[cols_with_na] = imputer.fit_transform(X_train[cols_with_na])
X_test[cols_with_na] = imputer.transform(X_test[cols_with_na])

print(f"✅ KNNImputer (k=5, distance-weighted) — consistent with ethics document")
print(f"   Remaining NaN — train: {X_train.isnull().sum().sum()}, test: {X_test.isnull().sum().sum()}")

Columns to impute: ['NumCompaniesWorked', 'TotalWorkingYears', 'EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance']
✅ KNNImputer (k=5, distance-weighted) — consistent with ethics document
   Remaining NaN — train: 0, test: 0


## Section 4b: Feature Selection & Scaling

Pipeline:
1. **Feature whitelist** — apply the 24-feature selection from ablation analysis  
   (hybrid: |r| >= 0.03 with Attrition OR top 82% cumulative Gini importance)
2. **Variance threshold** on **unscaled** data: remove near-constant features (variance < 0.01)
3. **Correlation filter**: remove features with |r| > 0.90 to reduce multicollinearity
4. **Scaling** (StandardScaler — fit on train only)

> See `outputs/feature_selection_final.csv` for the full feature audit (47 → 24).  
> Dropping 20+ noise features costs < 0.5% F1 (validated by 5-scenario ablation test).

In [5]:
# ==============================================================================
# FEATURE SELECTION & SCALING
# ==============================================================================
# Selection validated by ablation tests (see outputs/ablation_results.json):
#   47 features → 24 features: Delta CV-F1 < 0.5%
#   Method: hybrid (|r| >= 0.03 with Attrition) ∪ (top 82% Gini importance)

from sklearn.feature_selection import VarianceThreshold

n_before = X_train.shape[1]

# ── 1) Feature whitelist (from ablation analysis) ────────────────────────
FINAL_FEATURES = [
    # Badge H1 (2 features)
    "avg_working_hours", "late_arrival_rate",
    # HR core (12 features)
    "Age", "TotalWorkingYears", "YearsAtCompany", "MonthlyIncome",
    "YearsWithCurrManager", "NumCompaniesWorked", "DistanceFromHome",
    "PercentSalaryHike", "TrainingTimesLastYear", "YearsSinceLastPromotion",
    "BusinessTravel", "MaritalStatus_Single",
    # Surveys (3 features)
    "EnvironmentSatisfaction", "JobSatisfaction", "WorkLifeBalance",
    # Derived (4 features)
    "IncomePerJobLevel", "ManagerStability", "SatisfactionScore", "PromotionStagnation",
    # OHE kept (3 features)
    "MaritalStatus_Married", "JobRole_Manufacturing Director",
    "EducationField_Technical Degree",
]

available = [f for f in FINAL_FEATURES if f in X_train.columns]
dropped = [f for f in X_train.columns if f not in available]
X_train = X_train[available]
X_test  = X_test[available]
print(f"Feature whitelist applied: {n_before} -> {len(available)} features")
print(f"  Dropped ({len(dropped)}): {dropped[:10]}{'...' if len(dropped) > 10 else ''}")

# ── 2) Remove near-constant features (variance < 0.01, UNSCALED) ────────
var_selector = VarianceThreshold(threshold=0.01)
var_selector.fit(X_train)
low_var_cols = X_train.columns[~var_selector.get_support()].tolist()

if low_var_cols:
    print(f"\nLow-variance features removed ({len(low_var_cols)}): {low_var_cols}")
    X_train = X_train.drop(columns=low_var_cols)
    X_test = X_test.drop(columns=low_var_cols)
else:
    print("\nNo low-variance features found.")

# ── 3) Correlation filter (|r| > 0.90) ──────────────────────────────────
corr_matrix = X_train.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_cols = [col for col in upper_tri.columns if any(upper_tri[col] > 0.90)]

if high_corr_cols:
    print(f"Highly correlated features removed ({len(high_corr_cols)}): {high_corr_cols}")
    for col in high_corr_cols:
        partners = upper_tri.index[upper_tri[col] > 0.90].tolist()
        for p in partners:
            print(f"    {col} <-> {p}: r = {corr_matrix.loc[p, col]:.4f}")
    X_train = X_train.drop(columns=high_corr_cols)
    X_test = X_test.drop(columns=high_corr_cols)
else:
    print("No highly correlated feature pairs found (threshold: 0.90).")

# ── 4) Scaling — fit on train only ──────────────────────────────────────
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled  = pd.DataFrame(scaler.transform(X_test),      columns=X_test.columns,  index=X_test.index)

n_after = X_train_scaled.shape[1]
print(f"\nScaling applied (StandardScaler — fit on train only).")
print(f"Final pipeline: {n_before} -> {n_after} features")
print(f"\nFinal feature list ({n_after}):")
for i, col in enumerate(X_train_scaled.columns, 1):
    print(f"  {i:2d}. {col}")

Feature whitelist applied: 44 -> 24 features
  Dropped (20): ['Education', 'JobLevel', 'StockOptionLevel', 'JobInvolvement', 'PerformanceRating', 'absence_rate', 'Department_Research & Development', 'Department_Sales', 'EducationField_Life Sciences', 'EducationField_Marketing']...

Low-variance features removed (1): ['late_arrival_rate']
No highly correlated feature pairs found (threshold: 0.90).

Scaling applied (StandardScaler — fit on train only).
Final pipeline: 44 -> 23 features

Final feature list (23):
   1. avg_working_hours
   2. Age
   3. TotalWorkingYears
   4. YearsAtCompany
   5. MonthlyIncome
   6. YearsWithCurrManager
   7. NumCompaniesWorked
   8. DistanceFromHome
   9. PercentSalaryHike
  10. TrainingTimesLastYear
  11. YearsSinceLastPromotion
  12. BusinessTravel
  13. MaritalStatus_Single
  14. EnvironmentSatisfaction
  15. JobSatisfaction
  16. WorkLifeBalance
  17. IncomePerJobLevel
  18. ManagerStability
  19. SatisfactionScore
  20. PromotionStagnation
  21. Mari

## Section 5: Class Imbalance — SMOTE

Apply SMOTE **only on the training set** to avoid data leakage.

In [6]:
# ==============================================================================
# SMOTE OVERSAMPLING (train set only)
# ==============================================================================

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Before SMOTE: {y_train.value_counts().to_dict()}")
print(f"After  SMOTE: {pd.Series(y_train_resampled).value_counts().to_dict()}")

Before SMOTE: {0: 2959, 1: 569}
After  SMOTE: {0: 2959, 1: 2959}


## Section 6: Export Preprocessed Data

In [7]:
# ==============================================================================
# EXPORT
# ==============================================================================
import joblib

# Save processed data
X_train_resampled.to_csv(f"{OUTPUT_DIR}/X_train.csv", index=False)
X_test_scaled.to_csv(f"{OUTPUT_DIR}/X_test.csv", index=False)
pd.Series(y_train_resampled, name="Attrition").to_csv(f"{OUTPUT_DIR}/y_train.csv", index=False)
y_test.to_csv(f"{OUTPUT_DIR}/y_test.csv", index=False)

# Also save non-SMOTE versions for fairness analysis and honest CV
X_train_scaled.to_csv(f"{OUTPUT_DIR}/X_train_no_smote.csv", index=False)
y_train.to_csv(f"{OUTPUT_DIR}/y_train_no_smote.csv", index=False)

# Save pre-scaling train/test for fairness (unscaled binary columns)
X_train.to_csv(f"{OUTPUT_DIR}/X_train_unscaled.csv", index=False)
X_test.to_csv(f"{OUTPUT_DIR}/X_test_unscaled.csv", index=False)

# Save scaler and imputer for reproducibility
joblib.dump(scaler, f"{OUTPUT_DIR}/scaler.joblib")
joblib.dump(imputer, f"{OUTPUT_DIR}/imputer.joblib")

# Save feature names
feature_names = list(X_train.columns)
pd.Series(feature_names).to_csv(f"{OUTPUT_DIR}/feature_names.csv", index=False, header=False)

print(f"Exported to {OUTPUT_DIR}/:")
print(f"  X_train.csv              ({X_train_resampled.shape}) — SMOTE + scaled")
print(f"  X_test.csv               ({X_test_scaled.shape}) — scaled")
print(f"  X_train_unscaled.csv     ({X_train.shape}) — for fairness analysis")
print(f"  X_test_unscaled.csv      ({X_test.shape}) — for fairness analysis")
print(f"  y_train.csv / y_test.csv")
print(f"  scaler.joblib / imputer.joblib")
print(f"  feature_names.csv        ({len(feature_names)} features)")
print("\nPipeline: split -> impute (fit train) -> scale (fit train) -> SMOTE (train only)")
print("No data leakage.")
print("\n-> Proceed to 04_Model_Benchmark.ipynb")

Exported to C:\Users\yanis\Documents\CESI\A5\AI Project\HumanForYou\outputs/:
  X_train.csv              ((5918, 23)) — SMOTE + scaled
  X_test.csv               ((882, 23)) — scaled
  X_train_unscaled.csv     ((3528, 23)) — for fairness analysis
  X_test_unscaled.csv      ((882, 23)) — for fairness analysis
  y_train.csv / y_test.csv
  scaler.joblib / imputer.joblib
  feature_names.csv        (23 features)

Pipeline: split -> impute (fit train) -> scale (fit train) -> SMOTE (train only)
No data leakage.

-> Proceed to 04_Model_Benchmark.ipynb
