# 🔧 Data Processing & Feature Engineering for ML

**"Garbage In, Garbage Out"** - Data quality determines model quality!

This comprehensive notebook covers everything you need to prepare data for production ML systems.

**Learning Goals:**
- Master data cleaning techniques
- Handle missing data professionally
- Engineer powerful features
- Scale and normalize data properly
- Handle imbalanced datasets
- Build production-ready pipelines

**Interview Topics Covered:**
- Data preprocessing strategies
- Feature engineering techniques
- Handling categorical variables
- Dealing with outliers
- Data leakage prevention
- Pipeline design

**Sources:**
- "Feature Engineering for Machine Learning" - Zheng & Casari (2018)
- "Hands-On Machine Learning" - Géron (2019), Chapter 2
- "Python for Data Analysis" - McKinney (2017)

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Scikit-learn preprocessing
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder, OrdinalEncoder
)
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Feature engineering
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (14, 8)
sns.set_palette('husl')
np.random.seed(42)

print("✅ Libraries loaded successfully!")
print(f"Pandas: {pd.__version__}, NumPy: {np.__version__}")

## 📊 Part 1: Data Cleaning - The Foundation

**80% of ML work is data preparation!**

**Common Data Quality Issues:**
1. Missing values
2. Duplicate records
3. Inconsistent formatting
4. Outliers
5. Invalid values
6. Data type mismatches

**Interview Question:** *"How do you handle missing data in a dataset?"*

**Source:** "Python for Data Analysis" Chapter 7

In [None]:
# Create realistic messy dataset
np.random.seed(42)
n_samples = 1000

# Generate base data
data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.normal(60000, 25000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'years_employed': np.random.exponential(5, n_samples),
    'num_credit_cards': np.random.poisson(3, n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix'], n_samples),
    'default': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
}

df = pd.DataFrame(data)

# Introduce realistic data quality issues
# 1. Missing values (MCAR, MAR, MNAR)
missing_indices_age = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_indices_age, 'age'] = np.nan

# Missing income (MAR - depends on default)
missing_income = df[df['default'] == 1].sample(frac=0.2).index
df.loc[missing_income, 'income'] = np.nan

# Missing credit score
missing_credit = np.random.choice(df.index, size=80, replace=False)
df.loc[missing_credit, 'credit_score'] = np.nan

# 2. Duplicates
duplicate_rows = df.sample(n=20)
df = pd.concat([df, duplicate_rows], ignore_index=True)

# 3. Outliers
outlier_indices = np.random.choice(df.index, size=10, replace=False)
df.loc[outlier_indices, 'income'] = df.loc[outlier_indices, 'income'] * 10

# 4. Invalid values
invalid_indices = np.random.choice(df.index, size=5, replace=False)
df.loc[invalid_indices, 'age'] = -999

# 5. Inconsistent formatting
df.loc[df['city'] == 'NYC', 'city'] = np.random.choice(['NYC', 'New York', 'ny'], 
                                                         size=(df['city'] == 'NYC').sum())

print("🔍 DATA QUALITY ASSESSMENT")
print("="*70)
print(f"\n📏 Dataset Shape: {df.shape}")
print(f"\n📊 Data Types:\n{df.dtypes}")
print(f"\n❌ Missing Values:")
missing_summary = df.isnull().sum()
missing_pct = (missing_summary / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing': missing_summary,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing'] > 0])

print(f"\n🔄 Duplicates: {df.duplicated().sum()} rows")
print(f"\n⚠️ Invalid Ages (negative): {(df['age'] < 0).sum()}")

print("\n📊 First few rows:")
print(df.head(10))

In [None]:
# Visualize data quality issues
fig = plt.figure(figsize=(18, 12))

# 1. Missing data heatmap
ax1 = plt.subplot(2, 3, 1)
sns.heatmap(df.isnull().head(100), cmap='viridis', cbar=False, yticklabels=False, ax=ax1)
ax1.set_title('Missing Data Pattern (First 100 rows)\nYellow = Missing', fontweight='bold')
ax1.set_xlabel('Features')
plt.xticks(rotation=45, ha='right')

# 2. Missing data bar chart
ax2 = plt.subplot(2, 3, 2)
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)
ax2.barh(range(len(missing_counts)), missing_counts.values, color='coral')
ax2.set_yticks(range(len(missing_counts)))
ax2.set_yticklabels(missing_counts.index)
ax2.set_xlabel('Count')
ax2.set_title('Missing Values by Feature', fontweight='bold')
ax2.grid(True, alpha=0.3, axis='x')

# 3. Income distribution with outliers
ax3 = plt.subplot(2, 3, 3)
ax3.boxplot(df['income'].dropna(), vert=True, patch_artist=True,
            boxprops=dict(facecolor='lightblue'))
ax3.set_ylabel('Income ($)')
ax3.set_title('Income Distribution\n(Note: Extreme outliers)', fontweight='bold')
ax3.grid(True, alpha=0.3, axis='y')

# 4. City value inconsistencies
ax4 = plt.subplot(2, 3, 4)
city_counts = df['city'].value_counts()
ax4.bar(range(len(city_counts)), city_counts.values, color='skyblue', edgecolor='black')
ax4.set_xticks(range(len(city_counts)))
ax4.set_xticklabels(city_counts.index, rotation=45, ha='right')
ax4.set_ylabel('Count')
ax4.set_title('City Distribution\n(Note: NYC inconsistency)', fontweight='bold')
ax4.grid(True, alpha=0.3, axis='y')

# 5. Age distribution showing invalid values
ax5 = plt.subplot(2, 3, 5)
ax5.hist(df['age'].dropna(), bins=30, alpha=0.7, color='green', edgecolor='black')
ax5.axvline(0, color='red', linestyle='--', linewidth=2, label='Invalid threshold')
ax5.set_xlabel('Age')
ax5.set_ylabel('Frequency')
ax5.set_title(f'Age Distribution\n({(df["age"] < 0).sum()} invalid values)', fontweight='bold')
ax5.legend()
ax5.grid(True, alpha=0.3)

# 6. Summary statistics table
ax6 = plt.subplot(2, 3, 6)
ax6.axis('off')

summary_data = [
    ['Issue', 'Count', 'Action Needed'],
    ['Missing Values', f"{df.isnull().sum().sum()}", 'Impute/Remove'],
    ['Duplicates', f"{df.duplicated().sum()}", 'Remove'],
    ['Invalid Ages', f"{(df['age'] < 0).sum()}", 'Replace/Remove'],
    ['NYC Variants', f"{(df['city'].isin(['NYC', 'New York', 'ny'])).sum()}", 'Standardize'],
    ['Outliers (Income)', '~10', 'Cap/Transform'],
]

table = ax6.table(cellText=summary_data, cellLoc='left', loc='center',
                  colWidths=[0.3, 0.2, 0.3])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2.5)

for i in range(3):
    table[(0, i)].set_facecolor('lightblue')
    table[(0, i)].set_text_props(weight='bold')

ax6.set_title('Data Quality Issues Summary', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\n💡 Data Cleaning Strategy:")
print("  1. Remove duplicates first")
print("  2. Fix invalid/impossible values")
print("  3. Standardize inconsistent formats")
print("  4. Handle outliers")
print("  5. Impute missing values last")

### 1.1 Handling Missing Data - Complete Guide

**Interview Question:** *"What are the different types of missing data and how do you handle them?"*

**Answer:**

**Three Types of Missingness:**
1. **MCAR** (Missing Completely At Random) - Missingness is random, unrelated to any variable
   - Safe to delete or impute with simple methods
   
2. **MAR** (Missing At Random) - Missingness depends on observed variables
   - Example: Income missing more often for defaulters
   - Use advanced imputation (KNN, MICE)
   
3. **MNAR** (Missing Not At Random) - Missingness depends on the missing value itself
   - Example: High earners don't report income
   - Most difficult, may need domain knowledge

**Strategies:**
- Delete: If < 5% missing and MCAR
- Mean/Median/Mode: Simple, works for MCAR
- Forward/Backward Fill: For time series
- KNN Imputation: Uses similar records
- Model-based: MICE, iterative imputation
- Flag missing: Add indicator variable

**Source:** "Feature Engineering for Machine Learning" Chapter 3

In [None]:
# Comprehensive missing data handling
print("🔧 MISSING DATA HANDLING STRATEGIES")
print("="*70)

# Start with clean copy
df_clean = df.copy()

# Step 1: Analyze missing patterns
print("\n📊 Missing Data Analysis:")
for col in df_clean.columns:
    missing_count = df_clean[col].isnull().sum()
    if missing_count > 0:
        missing_pct = missing_count / len(df_clean) * 100
        print(f"  {col}: {missing_count} ({missing_pct:.2f}%)")

# Strategy 1: Simple imputation (mean/median/mode)
print("\n📍 Strategy 1: Simple Imputation")

# Age: Use median (robust to outliers)
age_median = df_clean['age'][df_clean['age'] >= 0].median()
df_clean['age'].fillna(age_median, inplace=True)
print(f"  Age: Filled with median = {age_median:.1f}")

# Income: Use median
income_median = df_clean['income'].median()
df_clean['income'].fillna(income_median, inplace=True)
print(f"  Income: Filled with median = ${income_median:,.0f}")

# Strategy 2: KNN Imputation for credit score
print("\n📍 Strategy 2: KNN Imputation (credit_score)")
print("  Using 5 nearest neighbors based on age, income, years_employed")

# Prepare data for KNN imputation
features_for_imputation = ['age', 'income', 'years_employed', 'credit_score']
imputer = KNNImputer(n_neighbors=5)
df_clean[features_for_imputation] = imputer.fit_transform(df_clean[features_for_imputation])
print("  ✅ Credit score imputed using KNN")

# Strategy 3: Create missing indicator
print("\n📍 Strategy 3: Missing Indicator Features")
df_clean['age_was_missing'] = df['age'].isnull().astype(int)
df_clean['income_was_missing'] = df['income'].isnull().astype(int)
df_clean['credit_was_missing'] = df['credit_score'].isnull().astype(int)
print("  ✅ Added binary indicators for originally missing values")

print("\n✅ Missing data handled!")
print(f"Remaining missing: {df_clean.isnull().sum().sum()}")

In [None]:
# Compare imputation methods visually
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Original data (before missing values)
original_credit = data['credit_score']
df_missing_credit = df['credit_score'].copy()

# Method 1: Mean imputation
df_mean = df_missing_credit.copy()
df_mean.fillna(df_mean.mean(), inplace=True)

# Method 2: Median imputation
df_median = df_missing_credit.copy()
df_median.fillna(df_median.median(), inplace=True)

# Method 3: KNN imputation (already done above)
df_knn = df_clean['credit_score'].copy()

# Plot comparisons
axes[0, 0].hist(original_credit, bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[0, 0].set_title('Original Distribution (Before Missing)', fontweight='bold')
axes[0, 0].set_xlabel('Credit Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].hist(df_mean, bins=30, alpha=0.7, color='orange', edgecolor='black')
axes[0, 1].axvline(df_missing_credit.mean(), color='red', linestyle='--', 
                   linewidth=2, label=f'Mean: {df_missing_credit.mean():.0f}')
axes[0, 1].set_title('Mean Imputation\n(Creates spike at mean)', fontweight='bold')
axes[0, 1].set_xlabel('Credit Score')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].hist(df_median, bins=30, alpha=0.7, color='green', edgecolor='black')
axes[1, 0].axvline(df_missing_credit.median(), color='red', linestyle='--', 
                   linewidth=2, label=f'Median: {df_missing_credit.median():.0f}')
axes[1, 0].set_title('Median Imputation\n(Creates spike at median)', fontweight='bold')
axes[1, 0].set_xlabel('Credit Score')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].hist(df_knn, bins=30, alpha=0.7, color='purple', edgecolor='black')
axes[1, 1].set_title('KNN Imputation\n(Preserves distribution better)', fontweight='bold')
axes[1, 1].set_xlabel('Credit Score')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("  • Mean/Median imputation creates artificial spikes")
print("  • KNN imputation preserves distribution shape better")
print("  • Always compare original vs imputed distributions")
print("  • Consider adding 'was_missing' indicator features")

print("\n🎯 Interview Tip:")
print("  'I would first analyze the missing pattern (MCAR/MAR/MNAR),")
print("   then choose appropriate imputation based on percentage missing")
print("   and relationship with other variables. I always add missing")
print("   indicators and validate imputation preserves distributions.'")

### 1.2 Outlier Detection and Treatment

**Interview Question:** *"How do you detect and handle outliers?"*

**Answer:**

**Detection Methods:**
1. **Statistical:**
   - Z-score (|z| > 3)
   - IQR method (Q1 - 1.5×IQR, Q3 + 1.5×IQR)
   - Modified Z-score (MAD-based)

2. **Distance-based:**
   - DBSCAN
   - Isolation Forest
   - Local Outlier Factor (LOF)

**Treatment Strategies:**
- Remove: If data errors or < 1% of data
- Cap/Winsorize: Replace with percentile values
- Transform: Log, square root, Box-Cox
- Separate Model: Train separate model for outliers
- Robust Methods: Use algorithms less sensitive to outliers

**Important:** Always understand WHY outliers exist before removing!

In [None]:
# Comprehensive outlier detection
print("🔍 OUTLIER DETECTION & TREATMENT")
print("="*70)

# Fix invalid ages first
df_clean.loc[df_clean['age'] < 0, 'age'] = age_median

# Method 1: Z-score
print("\n📊 Method 1: Z-Score (|z| > 3)")
income_z = np.abs(stats.zscore(df_clean['income'].dropna()))
outliers_zscore = df_clean[income_z > 3].index if len(income_z) == len(df_clean) else []
print(f"  Income outliers detected: {len(outliers_zscore)}")

# Method 2: IQR
print("\n📊 Method 2: IQR (Interquartile Range)")
Q1 = df_clean['income'].quantile(0.25)
Q3 = df_clean['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df_clean[(df_clean['income'] < lower_bound) | 
                        (df_clean['income'] > upper_bound)].index
print(f"  Lower bound: ${lower_bound:,.0f}")
print(f"  Upper bound: ${upper_bound:,.0f}")
print(f"  Outliers detected: {len(outliers_iqr)}")

# Method 3: Isolation Forest
print("\n📊 Method 3: Isolation Forest (ML-based)")
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_labels = iso_forest.fit_predict(df_clean[['income', 'age', 'credit_score']])
outliers_iso = df_clean[outlier_labels == -1].index
print(f"  Outliers detected: {len(outliers_iso)}")

# Treatment Strategy
print("\n🔧 Treatment: Winsorization (Capping)")
print("  Capping at 1st and 99th percentiles")

df_clean['income_original'] = df_clean['income'].copy()
lower_cap = df_clean['income'].quantile(0.01)
upper_cap = df_clean['income'].quantile(0.99)

df_clean['income_capped'] = df_clean['income'].clip(lower=lower_cap, upper=upper_cap)

print(f"  Lower cap: ${lower_cap:,.0f}")
print(f"  Upper cap: ${upper_cap:,.0f}")
print(f"  Values capped: {(df_clean['income'] != df_clean['income_capped']).sum()}")

# Treatment Strategy 2: Log transformation
print("\n🔧 Treatment: Log Transformation")
df_clean['income_log'] = np.log1p(df_clean['income'])  # log(1 + x) to handle zeros
print("  ✅ Applied log(1 + income) transformation")

In [None]:
# Visualize outlier detection methods
fig = plt.figure(figsize=(18, 12))

# Plot 1: Original distribution with outliers marked
ax1 = plt.subplot(2, 3, 1)
ax1.hist(df_clean['income_original'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
ax1.axvline(lower_bound, color='red', linestyle='--', linewidth=2, label='IQR bounds')
ax1.axvline(upper_bound, color='red', linestyle='--', linewidth=2)
ax1.set_xlabel('Income ($)')
ax1.set_ylabel('Frequency')
ax1.set_title('Original Distribution\n(Note extreme outliers)', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Box plot comparison
ax2 = plt.subplot(2, 3, 2)
box_data = [df_clean['income_original'], df_clean['income_capped']]
bp = ax2.boxplot(box_data, labels=['Original', 'Capped'], patch_artist=True)
for patch, color in zip(bp['boxes'], ['lightblue', 'lightgreen']):
    patch.set_facecolor(color)
ax2.set_ylabel('Income ($)')
ax2.set_title('Box Plot Comparison\n(Before vs After Capping)', fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

# Plot 3: Capped distribution
ax3 = plt.subplot(2, 3, 3)
ax3.hist(df_clean['income_capped'], bins=50, alpha=0.7, color='lightgreen', edgecolor='black')
ax3.set_xlabel('Income ($)')
ax3.set_ylabel('Frequency')
ax3.set_title('After Winsorization\n(Outliers capped)', fontweight='bold')
ax3.grid(True, alpha=0.3)

# Plot 4: Log transformed
ax4 = plt.subplot(2, 3, 4)
ax4.hist(df_clean['income_log'], bins=50, alpha=0.7, color='coral', edgecolor='black')
ax4.set_xlabel('Log(Income)')
ax4.set_ylabel('Frequency')
ax4.set_title('Log Transformation\n(More normal distribution)', fontweight='bold')
ax4.grid(True, alpha=0.3)

# Plot 5: Q-Q plot before
ax5 = plt.subplot(2, 3, 5)
stats.probplot(df_clean['income_original'], dist="norm", plot=ax5)
ax5.set_title('Q-Q Plot: Original\n(Deviates from normal)', fontweight='bold')
ax5.grid(True, alpha=0.3)

# Plot 6: Q-Q plot after log transform
ax6 = plt.subplot(2, 3, 6)
stats.probplot(df_clean['income_log'], dist="norm", plot=ax6)
ax6.set_title('Q-Q Plot: Log Transformed\n(Closer to normal)', fontweight='bold')
ax6.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("  • Outliers can severely skew distributions")
print("  • Capping preserves more data than removal")
print("  • Log transformation makes data more normal")
print("  • Q-Q plots help verify normality assumptions")

print("\n🎯 Interview Tip:")
print("  'I use multiple methods (IQR, Z-score, Isolation Forest) to")
print("   detect outliers, then investigate whether they are errors or")
print("   valid extreme values. For treatment, I prefer Winsorization")
print("   over removal to preserve sample size, or use robust models")
print("   like tree-based methods that handle outliers naturally.'")

## 🎨 Part 2: Feature Engineering - Creating Predictive Power

**"Applied machine learning is basically feature engineering"** - Andrew Ng

**Interview Question:** *"What feature engineering techniques do you know?"*

**Key Techniques:**
1. **Numerical Transformations:**
   - Scaling/Normalization
   - Log, sqrt, power transforms
   - Binning/Discretization
   - Polynomial features

2. **Categorical Encoding:**
   - One-Hot Encoding
   - Label Encoding
   - Target Encoding
   - Frequency Encoding

3. **Feature Creation:**
   - Domain-specific features
   - Interactions
   - Aggregations
   - Time-based features

4. **Feature Selection:**
   - Filter methods (correlation, chi-square)
   - Wrapper methods (RFE)
   - Embedded methods (L1 regularization)

**Source:** "Feature Engineering for Machine Learning" Chapters 2-5

In [None]:
# Complete feature engineering pipeline
print("🎨 FEATURE ENGINEERING PIPELINE")
print("="*70)

# Use capped income for stability
df_clean['income'] = df_clean['income_capped']

# Clean up city names (handle inconsistencies)
df_clean['city'] = df_clean['city'].replace({'New York': 'NYC', 'ny': 'NYC'})

# 1. Create domain-specific features
print("\n🔧 Step 1: Domain-Specific Feature Creation")

# Debt-to-income ratio (financial risk indicator)
df_clean['debt_to_income'] = df_clean['num_credit_cards'] * 1000 / (df_clean['income'] + 1)
print("  ✅ Created: debt_to_income")

# Age groups (life stages)
df_clean['age_group'] = pd.cut(df_clean['age'], 
                                bins=[0, 25, 35, 50, 65, 100],
                                labels=['18-25', '26-35', '36-50', '51-65', '65+'])
print("  ✅ Created: age_group (binned age)")

# Income brackets
df_clean['income_bracket'] = pd.cut(df_clean['income'],
                                     bins=[0, 30000, 60000, 100000, np.inf],
                                     labels=['Low', 'Medium', 'High', 'Very High'])
print("  ✅ Created: income_bracket")

# Credit score categories
df_clean['credit_category'] = pd.cut(df_clean['credit_score'],
                                      bins=[0, 580, 670, 740, 800, 850],
                                      labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])
print("  ✅ Created: credit_category")

# Financial stability score (composite)
df_clean['stability_score'] = (
    (df_clean['credit_score'] / 850) * 0.4 +
    (df_clean['years_employed'] / df_clean['years_employed'].max()) * 0.3 +
    (df_clean['income'] / df_clean['income'].max()) * 0.3
)
print("  ✅ Created: stability_score (weighted composite)")

# 2. Interaction features
print("\n🔧 Step 2: Interaction Features")

df_clean['income_per_age'] = df_clean['income'] / (df_clean['age'] + 1)
print("  ✅ Created: income_per_age")

df_clean['credit_income_interaction'] = df_clean['credit_score'] * df_clean['income'] / 1000000
print("  ✅ Created: credit_income_interaction")

# 3. Polynomial features (for specific numerical columns)
print("\n🔧 Step 3: Polynomial Features")

df_clean['income_squared'] = df_clean['income'] ** 2
df_clean['age_squared'] = df_clean['age'] ** 2
print("  ✅ Created: squared features for income and age")

# 4. Aggregation features (by categorical groups)
print("\n🔧 Step 4: Aggregation Features")

# Mean income by city
city_income_mean = df_clean.groupby('city')['income'].transform('mean')
df_clean['city_income_mean'] = city_income_mean
df_clean['income_vs_city_avg'] = df_clean['income'] / (city_income_mean + 1)
print("  ✅ Created: city_income_mean, income_vs_city_avg")

# Education level aggregations
edu_default_rate = df_clean.groupby('education')['default'].transform('mean')
df_clean['education_risk'] = edu_default_rate
print("  ✅ Created: education_risk (default rate by education)")

print(f"\n📊 Total features created: {len(df_clean.columns)}")
print(f"   Original features: {len(data.keys())}")
print(f"   New features: {len(df_clean.columns) - len(data.keys())}")

### 2.1 Encoding Categorical Variables

**Interview Question:** *"When would you use One-Hot Encoding vs Label Encoding?"*

**Answer:**

**One-Hot Encoding:**
- **Use when:** Categorical variable has no ordinal relationship
- **Examples:** city, color, product type
- **Pros:** No false ordinality
- **Cons:** High dimensionality with many categories
- **Models:** Linear models, neural networks

**Label Encoding:**
- **Use when:** Variable has ordinal relationship OR using tree-based models
- **Examples:** education level (HS < BS < MS < PhD)
- **Pros:** Compact representation
- **Cons:** Implies order (false for nominal variables)
- **Models:** Tree-based (handle it well), ordinal features

**Other Methods:**
- **Target Encoding:** Replace with target mean (watch for leakage!)
- **Frequency Encoding:** Replace with category frequency
- **Binary Encoding:** For high-cardinality features
- **Embeddings:** For neural networks, high-cardinality

In [None]:
# Comprehensive categorical encoding
print("🏷️ CATEGORICAL ENCODING STRATEGIES")
print("="*70)

# Prepare data
df_encoded = df_clean.copy()

# Method 1: Label Encoding (for ordinal)
print("\n📊 Method 1: Label Encoding (Ordinal)")

# Education has natural order
education_mapping = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}
df_encoded['education_encoded'] = df_encoded['education'].map(education_mapping)
print(f"  Education: {education_mapping}")
print("  ✅ Preserves ordinal relationship")

# Credit category has order
credit_cat_mapping = {
    'Poor': 0,
    'Fair': 1,
    'Good': 2,
    'Very Good': 3,
    'Excellent': 4
}
df_encoded['credit_category_encoded'] = df_encoded['credit_category'].map(credit_cat_mapping)
print(f"  Credit Category: Poor=0 → Excellent=4")

# Method 2: One-Hot Encoding (for nominal)
print("\n📊 Method 2: One-Hot Encoding (Nominal)")

# City has no natural order
city_dummies = pd.get_dummies(df_encoded['city'], prefix='city', drop_first=True)
df_encoded = pd.concat([df_encoded, city_dummies], axis=1)
print(f"  City: Created {len(city_dummies.columns)} binary columns")
print(f"  Columns: {list(city_dummies.columns)}")
print("  ✅ drop_first=True to avoid dummy variable trap")

# Method 3: Frequency Encoding
print("\n📊 Method 3: Frequency Encoding")

city_freq = df_encoded['city'].value_counts(normalize=True)
df_encoded['city_frequency'] = df_encoded['city'].map(city_freq)
print("  City: Replaced with occurrence frequency")
print(f"  Example frequencies: {city_freq.head().to_dict()}")

# Method 4: Target Encoding (with proper CV)
print("\n📊 Method 4: Target Encoding (Mean of target)")
print("  ⚠️ WARNING: Must use cross-validation to prevent leakage!")

# Simple version (in production, use proper CV)
city_target_mean = df_encoded.groupby('city')['default'].mean()
df_encoded['city_target_encoded'] = df_encoded['city'].map(city_target_mean)
print(f"  City default rates: {city_target_mean.to_dict()}")
print("  ✅ High correlation with target (powerful but risky!)")

# Method 5: Binary Encoding (for high cardinality)
print("\n📊 Method 5: Binary Encoding")
print("  Useful for features with 100s of categories (e.g., zip codes)")
print("  Represents category as binary digits")
print("  Example: 7 → 111, 3 → 011 (uses log2(n) columns)")

print(f"\n✅ Encoding complete!")
print(f"   Total columns: {len(df_encoded.columns)}")

In [None]:
# Visualize encoding effects
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Plot 1: Label encoding visualization
ax1 = axes[0, 0]
edu_counts = df_encoded.groupby('education_encoded')['default'].mean()
ax1.bar(edu_counts.index, edu_counts.values, color='skyblue', edgecolor='black')
ax1.set_xticks(edu_counts.index)
ax1.set_xticklabels(['HS', 'Bachelor', 'Master', 'PhD'])
ax1.set_xlabel('Education Level')
ax1.set_ylabel('Default Rate')
ax1.set_title('Label Encoding: Education\n(Preserves order)', fontweight='bold')
ax1.grid(True, alpha=0.3, axis='y')

# Plot 2: One-hot encoding visualization
ax2 = axes[0, 1]
city_cols = [col for col in df_encoded.columns if col.startswith('city_')]
city_encoded_df = df_encoded[city_cols[:5]].head(10)
sns.heatmap(city_encoded_df.T, cmap='RdYlGn', cbar=True, 
            linewidths=0.5, annot=True, fmt='g', ax=ax2)
ax2.set_title('One-Hot Encoding: City\n(Binary columns)', fontweight='bold')
ax2.set_xlabel('Sample Index')

# Plot 3: Frequency encoding
ax3 = axes[0, 2]
city_freq_df = df_encoded.groupby('city')['city_frequency'].first().sort_values(ascending=False)
ax3.barh(range(len(city_freq_df)), city_freq_df.values, color='coral', edgecolor='black')
ax3.set_yticks(range(len(city_freq_df)))
ax3.set_yticklabels(city_freq_df.index)
ax3.set_xlabel('Frequency')
ax3.set_title('Frequency Encoding\n(Occurrence rate)', fontweight='bold')
ax3.grid(True, alpha=0.3, axis='x')

# Plot 4: Target encoding
ax4 = axes[1, 0]
city_target_df = df_encoded.groupby('city')['city_target_encoded'].first().sort_values()
bars = ax4.barh(range(len(city_target_df)), city_target_df.values, 
                color='lightgreen', edgecolor='black')
ax4.set_yticks(range(len(city_target_df)))
ax4.set_yticklabels(city_target_df.index)
ax4.set_xlabel('Default Rate')
ax4.set_title('Target Encoding\n(Mean of target variable)', fontweight='bold')
ax4.grid(True, alpha=0.3, axis='x')

# Plot 5: Comparison of methods
ax5 = axes[1, 1]
encoding_summary = pd.DataFrame({
    'Method': ['Label', 'One-Hot', 'Frequency', 'Target'],
    'Columns Created': [1, len([col for col in df_encoded.columns if col.startswith('city_')]), 1, 1],
    'Best For': ['Ordinal', 'Nominal', 'All', 'High Corr']
})

ax5.axis('off')
table = ax5.table(cellText=encoding_summary.values, 
                  colLabels=encoding_summary.columns,
                  cellLoc='center', loc='center',
                  colWidths=[0.3, 0.35, 0.35])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2.5)

for i in range(len(encoding_summary.columns)):
    table[(0, i)].set_facecolor('lightblue')
    table[(0, i)].set_text_props(weight='bold')

ax5.set_title('Encoding Methods Summary', fontsize=14, fontweight='bold', pad=20)

# Plot 6: Dimensionality comparison
ax6 = axes[1, 2]
methods = ['Original', 'Label\nEncoding', 'One-Hot\nEncoding']
dimensions = [
    1,  # Original city column
    1,  # Label encoding: 1 column
    len([col for col in df_encoded.columns if col.startswith('city_')])  # One-hot: n columns
]
colors = ['gray', 'lightblue', 'lightcoral']
bars = ax6.bar(methods, dimensions, color=colors, edgecolor='black')
ax6.set_ylabel('Number of Columns')
ax6.set_title('Dimensionality Impact\n(City encoding)', fontweight='bold')
ax6.grid(True, alpha=0.3, axis='y')

for bar, dim in zip(bars, dimensions):
    height = bar.get_height()
    ax6.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(dim)}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n💡 Key Takeaways:")
print("  • One-Hot: Safe but increases dimensionality")
print("  • Label: Compact but implies order (use for trees)")
print("  • Frequency: Useful for high-cardinality features")
print("  • Target: Powerful but needs CV to prevent leakage")

print("\n🎯 Interview Answer Template:")
print("  'For nominal variables like city, I use One-Hot encoding for")
print("   linear models, but Label encoding for tree-based models since")
print("   they can handle it. For ordinal variables like education, I use")
print("   Label encoding with proper ordering. For high-cardinality features,")
print("   I consider Target encoding with cross-validation or embeddings.'")

## ⚖️ Part 3: Feature Scaling - Critical for Many Algorithms

**Interview Question:** *"When and why do you need to scale features?"*

**Answer:**

**When Scaling Matters:**
- ✅ **REQUIRED:** Linear Regression, Logistic Regression, SVM, KNN, Neural Networks, PCA
- ❌ **NOT NEEDED:** Tree-based models (Decision Trees, Random Forest, XGBoost)

**Why?**
- Distance-based algorithms are sensitive to feature magnitude
- Gradient descent converges faster with scaled features
- Regularization (L1/L2) penalizes large coefficients equally

**Scaling Methods:**

1. **StandardScaler** (Z-score normalization)
   - Formula: (x - μ) / σ
   - Result: mean=0, std=1
   - Use: Most common, assumes normal distribution

2. **MinMaxScaler** (Min-Max normalization)
   - Formula: (x - min) / (max - min)
   - Result: range [0, 1]
   - Use: Bounded features, neural networks

3. **RobustScaler**
   - Formula: (x - median) / IQR
   - Result: robust to outliers
   - Use: Data with outliers

4. **MaxAbsScaler**
   - Formula: x / |max|
   - Result: range [-1, 1]
   - Use: Sparse data

**CRITICAL:** Always fit scaler on training set only, then transform test set!

In [None]:
# Comprehensive feature scaling
print("⚖️ FEATURE SCALING STRATEGIES")
print("="*70)

# Select numerical features for scaling
numerical_features = ['age', 'income', 'credit_score', 'years_employed', 'num_credit_cards']
X = df_clean[numerical_features].copy()

# Remove any remaining NaN
X = X.fillna(X.median())

print(f"\n📊 Original Data Statistics:")
print(X.describe())

# Method 1: StandardScaler
print("\n⚖️ Method 1: StandardScaler (Z-score)")
scaler_standard = StandardScaler()
X_standard = pd.DataFrame(
    scaler_standard.fit_transform(X),
    columns=[f"{col}_standard" for col in X.columns]
)
print("  Formula: (x - mean) / std")
print("  Result: mean ≈ 0, std ≈ 1")
print(f"\n  Transformed means: {X_standard.mean().round(6).to_dict()}")
print(f"  Transformed stds: {X_standard.std().round(6).to_dict()}")

# Method 2: MinMaxScaler
print("\n⚖️ Method 2: MinMaxScaler (Min-Max)")
scaler_minmax = MinMaxScaler()
X_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(X),
    columns=[f"{col}_minmax" for col in X.columns]
)
print("  Formula: (x - min) / (max - min)")
print("  Result: range [0, 1]")
print(f"\n  Transformed mins: {X_minmax.min().round(6).to_dict()}")
print(f"  Transformed maxs: {X_minmax.max().round(6).to_dict()}")

# Method 3: RobustScaler
print("\n⚖️ Method 3: RobustScaler (Robust to outliers)")
scaler_robust = RobustScaler()
X_robust = pd.DataFrame(
    scaler_robust.fit_transform(X),
    columns=[f"{col}_robust" for col in X.columns]
)
print("  Formula: (x - median) / IQR")
print("  Result: median ≈ 0, less affected by outliers")
print(f"\n  Transformed medians: {X_robust.median().round(6).to_dict()}")

# Compare distributions
print("\n📊 Comparison Summary:")
comparison = pd.DataFrame({
    'Method': ['Original', 'StandardScaler', 'MinMaxScaler', 'RobustScaler'],
    'Income Mean': [
        X['income'].mean(),
        X_standard['income_standard'].mean(),
        X_minmax['income_minmax'].mean(),
        X_robust['income_robust'].mean()
    ],
    'Income Std': [
        X['income'].std(),
        X_standard['income_standard'].std(),
        X_minmax['income_minmax'].std(),
        X_robust['income_robust'].std()
    ],
    'Income Range': [
        X['income'].max() - X['income'].min(),
        X_standard['income_standard'].max() - X_standard['income_standard'].min(),
        X_minmax['income_minmax'].max() - X_minmax['income_minmax'].min(),
        X_robust['income_robust'].max() - X_robust['income_robust'].min()
    ]
})
print(comparison.round(2))

In [None]:
# Visualize scaling effects
fig, axes = plt.subplots(3, 4, figsize=(20, 15))

# Feature to visualize in detail
feature = 'income'
feature_idx = numerical_features.index(feature)

# Row 1: Distributions
axes[0, 0].hist(X[feature], bins=50, alpha=0.7, color='blue', edgecolor='black')
axes[0, 0].set_title(f'Original {feature}', fontweight='bold')
axes[0, 0].set_xlabel(feature)
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].hist(X_standard.iloc[:, feature_idx], bins=50, alpha=0.7, color='green', edgecolor='black')
axes[0, 1].set_title(f'StandardScaler', fontweight='bold')
axes[0, 1].set_xlabel('Scaled Value')
axes[0, 1].grid(True, alpha=0.3)

axes[0, 2].hist(X_minmax.iloc[:, feature_idx], bins=50, alpha=0.7, color='orange', edgecolor='black')
axes[0, 2].set_title(f'MinMaxScaler', fontweight='bold')
axes[0, 2].set_xlabel('Scaled Value')
axes[0, 2].grid(True, alpha=0.3)

axes[0, 3].hist(X_robust.iloc[:, feature_idx], bins=50, alpha=0.7, color='purple', edgecolor='black')
axes[0, 3].set_title(f'RobustScaler', fontweight='bold')
axes[0, 3].set_xlabel('Scaled Value')
axes[0, 3].grid(True, alpha=0.3)

# Row 2: Box plots
box_data = [
    X[feature],
    X_standard.iloc[:, feature_idx],
    X_minmax.iloc[:, feature_idx],
    X_robust.iloc[:, feature_idx]
]

for idx, (ax, data, title, color) in enumerate(zip(
    axes[1, :],
    box_data,
    ['Original', 'StandardScaler', 'MinMaxScaler', 'RobustScaler'],
    ['lightblue', 'lightgreen', 'lightyellow', 'lightpink']
)):
    bp = ax.boxplot(data, vert=True, patch_artist=True)
    bp['boxes'][0].set_facecolor(color)
    ax.set_title(title, fontweight='bold')
    ax.set_ylabel('Value')
    ax.grid(True, alpha=0.3, axis='y')

# Row 3: Compare all features side by side
# Original data
ax = axes[2, 0]
X_sample = X.head(100)
for col in X_sample.columns:
    ax.plot(X_sample[col].values, alpha=0.6, label=col)
ax.set_title('Original Features\n(Different scales)', fontweight='bold')
ax.set_xlabel('Sample')
ax.set_ylabel('Value')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)

# StandardScaler
ax = axes[2, 1]
X_standard_sample = X_standard.head(100)
for col in X_standard_sample.columns:
    ax.plot(X_standard_sample[col].values, alpha=0.6)
ax.set_title('StandardScaler\n(All features comparable)', fontweight='bold')
ax.set_xlabel('Sample')
ax.set_ylabel('Scaled Value')
ax.grid(True, alpha=0.3)

# MinMaxScaler
ax = axes[2, 2]
X_minmax_sample = X_minmax.head(100)
for col in X_minmax_sample.columns:
    ax.plot(X_minmax_sample[col].values, alpha=0.6)
ax.set_title('MinMaxScaler\n(All in [0,1])', fontweight='bold')
ax.set_xlabel('Sample')
ax.set_ylabel('Scaled Value')
ax.grid(True, alpha=0.3)

# RobustScaler
ax = axes[2, 3]
X_robust_sample = X_robust.head(100)
for col in X_robust_sample.columns:
    ax.plot(X_robust_sample[col].values, alpha=0.6)
ax.set_title('RobustScaler\n(Less affected by outliers)', fontweight='bold')
ax.set_xlabel('Sample')
ax.set_ylabel('Scaled Value')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 When to Use Each Scaler:")
print("\n📊 StandardScaler:")
print("  ✅ Most common choice")
print("  ✅ Data is approximately normal")
print("  ✅ Linear models, SVM, neural networks")
print("  ❌ Many outliers present")

print("\n📊 MinMaxScaler:")
print("  ✅ Need bounded range [0, 1]")
print("  ✅ Neural networks (sigmoid/tanh activation)")
print("  ✅ Image data")
print("  ❌ Sensitive to outliers")

print("\n📊 RobustScaler:")
print("  ✅ Many outliers in data")
print("  ✅ Want outliers to have less influence")
print("  ✅ Heavy-tailed distributions")
print("  ❌ Assumes median-based normality")

print("\n🎯 Interview Answer Template:")
print("  'I always scale features for distance-based algorithms like KNN,")
print("   SVM, and linear models with regularization. StandardScaler is my")
print("   default choice, but I use RobustScaler if I detect outliers, and")
print("   MinMaxScaler for neural networks when I need bounded inputs.")
print("   Critically, I fit the scaler only on training data to prevent")
print("   data leakage, then transform both train and test sets.'")

## 🏗️ Part 4: Building Production Pipeline

**Interview Question:** *"How do you prevent data leakage in your ML pipeline?"*

**Answer:**

**Data Leakage:** When information from test set leaks into training process

**Common Causes:**
1. Scaling before train-test split
2. Feature engineering using entire dataset
3. Target encoding without cross-validation
4. Time series: using future to predict past
5. Including target variable in features

**Prevention:**
- ✅ Always split data FIRST
- ✅ Fit preprocessors only on training data
- ✅ Use Pipelines (sklearn.pipeline.Pipeline)
- ✅ Cross-validation for meta-features
- ✅ Be careful with time-based features

**Pipeline Benefits:**
- Prevents leakage automatically
- Reproducible
- Easier to deploy
- Cleaner code

In [None]:
# Build complete production pipeline
print("🏗️ PRODUCTION-READY ML PIPELINE")
print("="*70)

# Prepare clean dataset
# Select features and target
feature_columns = ['age', 'income', 'credit_score', 'years_employed', 
                   'num_credit_cards', 'education', 'city']
X = df_clean[feature_columns].copy()
y = df_clean['default'].copy()

# Handle any remaining missing values
X['age'] = X['age'].fillna(X['age'].median())
X['income'] = X['income'].fillna(X['income'].median())
X['credit_score'] = X['credit_score'].fillna(X['credit_score'].median())

print("\n📊 Dataset Preparation:")
print(f"  Samples: {len(X)}")
print(f"  Features: {X.shape[1]}")
print(f"  Target distribution: {y.value_counts().to_dict()}")

# CRITICAL: Split BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\n✂️ Train-Test Split:")
print(f"  Train: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"  ✅ Stratified split maintains class balance")

# Define numerical and categorical features
numerical_features = ['age', 'income', 'credit_score', 'years_employed', 'num_credit_cards']
categorical_features = ['education', 'city']

print(f"\n🔧 Feature Types:")
print(f"  Numerical: {numerical_features}")
print(f"  Categorical: {categorical_features}")

# Create preprocessing pipelines
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

print("\n🏗️ Pipeline Structure:")
print("")
print("  Numerical Pipeline:")
print("    1. SimpleImputer (median)")
print("    2. StandardScaler")
print("")
print("  Categorical Pipeline:")
print("    1. SimpleImputer (constant='missing')")
print("    2. OneHotEncoder (handle_unknown='ignore')")
print("")

# Fit and transform
print("\n⚙️ Fitting Pipeline on Training Data...")
X_train_processed = preprocessor.fit_transform(X_train)
print("  ✅ Pipeline fitted on training data only")

print("\n⚙️ Transforming Test Data...")
X_test_processed = preprocessor.transform(X_test)
print("  ✅ Test data transformed using training statistics")

print(f"\n📊 Processed Data Shape:")
print(f"  Train: {X_train_processed.shape}")
print(f"  Test: {X_test_processed.shape}")
print(f"  Features expanded: {X.shape[1]} → {X_train_processed.shape[1]}")
print(f"  (Due to one-hot encoding)")

# Get feature names after preprocessing
feature_names = (numerical_features + 
                list(preprocessor.named_transformers_['cat']
                     .named_steps['onehot'].get_feature_names_out(categorical_features)))

print(f"\n📋 Final Feature Names ({len(feature_names)} total):")
print(f"  Numerical (5): {numerical_features}")
print(f"  Categorical ({len(feature_names) - 5}): {list(feature_names[5:])}")

In [None]:
# Complete ML pipeline with model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

print("🤖 COMPLETE ML PIPELINE WITH MODEL")
print("="*70)

# Create full pipeline (preprocessing + model)
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

print("\n🏗️ Full Pipeline:")
print("  1. Preprocessing (numerical + categorical)")
print("  2. Logistic Regression Classifier")

# Train
print("\n⚙️ Training Pipeline...")
full_pipeline.fit(X_train, y_train)
print("  ✅ Training complete!")

# Predict
print("\n🔮 Making Predictions...")
y_train_pred = full_pipeline.predict(X_train)
y_test_pred = full_pipeline.predict(X_test)
y_test_proba = full_pipeline.predict_proba(X_test)[:, 1]

# Evaluate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("\n📊 Model Performance:")
print("\nTraining Set:")
print(f"  Accuracy: {accuracy_score(y_train, y_train_pred):.4f}")
print(f"  Precision: {precision_score(y_train, y_train_pred):.4f}")
print(f"  Recall: {recall_score(y_train, y_train_pred):.4f}")
print(f"  F1-Score: {f1_score(y_train, y_train_pred):.4f}")

print("\nTest Set:")
print(f"  Accuracy: {accuracy_score(y_test, y_test_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_test_pred):.4f}")
print(f"  Recall: {recall_score(y_test, y_test_pred):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_test_pred):.4f}")
print(f"  ROC-AUC: {roc_auc_score(y_test, y_test_proba):.4f}")

print("\n📋 Classification Report:")
print(classification_report(y_test, y_test_pred, target_names=['No Default', 'Default']))

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
axes[0].set_title('Confusion Matrix', fontweight='bold', fontsize=14)
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
axes[1].plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_test_proba):.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate', fontsize=12)
axes[1].set_title('ROC Curve', fontweight='bold', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ Pipeline Benefits:")
print("  • No data leakage (transformations fit on train only)")
print("  • Easy to deploy (single object)")
print("  • Reproducible (same transformations always)")
print("  • Clean code (no manual transformations)")

print("\n🎯 Interview Answer Template:")
print("  'I always build scikit-learn Pipelines that combine preprocessing")
print("   and modeling. I split data first, fit the pipeline only on training")
print("   data to prevent leakage, use ColumnTransformer for different feature")
print("   types, and the pipeline makes deployment straightforward since all")
print("   transformations are encapsulated in a single object.'")