# Tutorial 06: Feature Engineering Operations

## Module 3: Data Preparation

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Master techniques for handling missing values** (deletion, imputation methods)
2. **Apply various feature scaling methods** (normalization, standardization, log scaling)
3. **Implement discretization and bucketing** strategies
4. **Encode categorical variables effectively** (one-hot, target encoding, embeddings)

---

## Table of Contents

1. [Introduction to Feature Engineering](#1-introduction)
2. [Handling Missing Values](#2-missing-values)
3. [Feature Scaling](#3-feature-scaling)
4. [Discretization and Bucketing](#4-discretization)
5. [Encoding Categorical Features](#5-encoding)
6. [Feature Engineering Pipeline](#6-pipeline)
7. [Hands-on Exercise](#7-exercise)
8. [Summary and Key Takeaways](#8-summary)

---

## 1. Introduction to Feature Engineering <a id='1-introduction'></a>

Feature engineering transforms raw data into features that better represent the problem.

### Why Feature Engineering Matters

- **Better features = Better models**: Good features make simple models perform well
- **Domain knowledge**: Encodes expert understanding into the model
- **Data quality**: Handles missing values, outliers, and inconsistencies

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Any, Optional, Tuple
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings

warnings.filterwarnings('ignore')
np.random.seed(42)

print("Libraries imported successfully!")

In [None]:
# Create sample dataset
def create_sample_dataset(n_samples: int = 1000) -> pd.DataFrame:
    np.random.seed(42)
    
    df = pd.DataFrame({
        'age': np.random.normal(40, 15, n_samples).clip(18, 85),
        'income': np.random.lognormal(10.5, 0.8, n_samples),
        'credit_score': np.random.normal(700, 100, n_samples).clip(300, 850),
        'years_employed': np.random.exponential(5, n_samples).clip(0, 40),
        'num_products': np.random.poisson(3, n_samples),
        'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples, p=[0.3, 0.4, 0.2, 0.1]),
        'employment_type': np.random.choice(['Full-time', 'Part-time', 'Self-employed', 'Unemployed'], n_samples, p=[0.6, 0.15, 0.2, 0.05]),
        'region': np.random.choice(['North', 'South', 'East', 'West', 'Central'], n_samples),
        'customer_segment': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], n_samples, p=[0.5, 0.3, 0.15, 0.05]),
        'churned': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
    })
    
    # Add missing values
    for col in ['age', 'income', 'credit_score', 'education', 'region']:
        mask = np.random.random(n_samples) < 0.1
        df.loc[mask, col] = np.nan
    
    # Add outliers
    outlier_idx = np.random.choice(n_samples, 20, replace=False)
    df.loc[outlier_idx, 'income'] = np.random.uniform(1e6, 5e6, 20)
    
    return df

df = create_sample_dataset(1000)
print("Sample Dataset:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nMissing Values:\n{df.isnull().sum()}")

---

## 2. Handling Missing Values <a id='2-missing-values'></a>

| Strategy | When to Use | Pros | Cons |
|----------|-------------|------|------|
| **Deletion** | MCAR, few missing | Simple | Loses data |
| **Mean/Median** | Numerical, MCAR | Fast | Ignores relationships |
| **Mode** | Categorical | Simple | May not represent well |
| **KNN** | Complex patterns | Captures relationships | Slow |

In [None]:
# Missing Value Handler
class MissingValueHandler:
    def __init__(self, df: pd.DataFrame):
        self.original_df = df.copy()
        self.df = df.copy()
    
    def analyze(self) -> pd.DataFrame:
        analysis = []
        for col in self.df.columns:
            missing_count = self.df[col].isnull().sum()
            missing_pct = missing_count / len(self.df) * 100
            analysis.append({'column': col, 'missing_count': missing_count, 'missing_pct': f"{missing_pct:.1f}%"})
        return pd.DataFrame(analysis)
    
    def impute_mean(self, columns: List[str]) -> 'MissingValueHandler':
        for col in columns:
            if col in self.df.columns and pd.api.types.is_numeric_dtype(self.df[col]):
                self.df[col] = self.df[col].fillna(self.df[col].mean())
        return self
    
    def impute_median(self, columns: List[str]) -> 'MissingValueHandler':
        for col in columns:
            if col in self.df.columns and pd.api.types.is_numeric_dtype(self.df[col]):
                self.df[col] = self.df[col].fillna(self.df[col].median())
        return self
    
    def impute_mode(self, columns: List[str]) -> 'MissingValueHandler':
        for col in columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].fillna(self.df[col].mode()[0])
        return self
    
    def impute_knn(self, columns: List[str], n_neighbors: int = 5) -> 'MissingValueHandler':
        num_cols = [c for c in columns if c in self.df.columns and pd.api.types.is_numeric_dtype(self.df[c])]
        if num_cols:
            imputer = KNNImputer(n_neighbors=n_neighbors)
            self.df[num_cols] = imputer.fit_transform(self.df[num_cols])
        return self
    
    def get_result(self) -> pd.DataFrame:
        return self.df

# Demo
print("Missing Value Analysis:")
handler = MissingValueHandler(df)
print(handler.analyze().to_string(index=False))

In [None]:
# Compare imputation strategies
print("\nComparing Imputation Strategies:")

original_income = df['income'].dropna()

df_mean = df.copy()
df_mean['income'] = df_mean['income'].fillna(df_mean['income'].mean())

df_median = df.copy()
df_median['income'] = df_median['income'].fillna(df_median['income'].median())

print(f"Original: mean={original_income.mean():.2f}, std={original_income.std():.2f}")
print(f"Mean Imp: mean={df_mean['income'].mean():.2f}, std={df_mean['income'].std():.2f}")
print(f"Median Imp: mean={df_median['income'].mean():.2f}, std={df_median['income'].std():.2f}")

In [None]:
# Visualize imputation effects
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(original_income, bins=50, alpha=0.7, color='blue')
axes[0].set_title('Original (Non-null)')

axes[1].hist(df_mean['income'], bins=50, alpha=0.7, color='green')
axes[1].set_title('Mean Imputation')

axes[2].hist(df_median['income'], bins=50, alpha=0.7, color='orange')
axes[2].set_title('Median Imputation')

plt.tight_layout()
plt.show()

---

## 3. Feature Scaling <a id='3-feature-scaling'></a>

| Method | Formula | Range | Best For |
|--------|---------|-------|----------|
| **StandardScaler** | (x - mean) / std | Unbounded | Normal distributions |
| **MinMaxScaler** | (x - min) / (max - min) | [0, 1] | Bounded features |
| **RobustScaler** | (x - median) / IQR | Unbounded | Data with outliers |

In [None]:
# Prepare clean data
num_cols = ['age', 'income', 'credit_score', 'years_employed']
df_clean = df.copy()
imputer = KNNImputer(n_neighbors=5)
df_clean[num_cols] = imputer.fit_transform(df_clean[num_cols])

# Apply different scalers
income = df_clean['income'].values.reshape(-1, 1)

income_standard = StandardScaler().fit_transform(income)
income_minmax = MinMaxScaler().fit_transform(income)
income_robust = RobustScaler().fit_transform(income)
income_log = np.log1p(income)

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].hist(income, bins=50, alpha=0.7, color='blue')
axes[0, 0].set_title(f'Original\nMean: {income.mean():.0f}')

axes[0, 1].hist(income_standard, bins=50, alpha=0.7, color='green')
axes[0, 1].set_title(f'StandardScaler\nMean: {income_standard.mean():.2f}')

axes[1, 0].hist(income_minmax, bins=50, alpha=0.7, color='orange')
axes[1, 0].set_title(f'MinMaxScaler\nRange: [{income_minmax.min():.2f}, {income_minmax.max():.2f}]')

axes[1, 1].hist(income_log, bins=50, alpha=0.7, color='red')
axes[1, 1].set_title(f'Log Transform\nMean: {income_log.mean():.2f}')

plt.tight_layout()
plt.show()

In [None]:
# Impact of scaling on model performance
print("Impact of Scaling on Logistic Regression:")

X = df_clean[num_cols].copy()
y = df_clean['churned']

lr = LogisticRegression(max_iter=1000)

results = []
for name, scaler in [('No Scaling', None), ('Standard', StandardScaler()), ('MinMax', MinMaxScaler()), ('Robust', RobustScaler())]:
    X_scaled = scaler.fit_transform(X) if scaler else X
    scores = cross_val_score(lr, X_scaled, y, cv=5)
    results.append({'Method': name, 'Mean CV Score': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

print(pd.DataFrame(results).to_string(index=False))

---

## 4. Discretization and Bucketing <a id='4-discretization'></a>

| Method | Description | Use Case |
|--------|-------------|----------|
| **Equal-width** | Same range per bin | Uniform distribution |
| **Equal-frequency** | Same count per bin | Skewed distribution |
| **Custom** | Domain-specific bins | Business rules |

In [None]:
# Discretization demo
df_binned = df_clean.copy()

# Age: Custom bins
df_binned['age_bin'] = pd.cut(df_binned['age'], bins=[0, 25, 35, 50, 65, 100],
                               labels=['Young Adult', 'Adult', 'Middle Age', 'Senior', 'Elderly'])

# Income: Equal-frequency bins
df_binned['income_bin'] = pd.qcut(df_binned['income'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Credit Score: Custom bins
df_binned['credit_bin'] = pd.cut(df_binned['credit_score'], bins=[300, 580, 670, 740, 800, 850],
                                  labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])

print("Discretized Features:")
print(df_binned[['age', 'age_bin', 'income', 'income_bin', 'credit_score', 'credit_bin']].head(10))

In [None]:
# Visualize bins
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

df_binned['age'].hist(bins=30, ax=axes[0, 0], color='steelblue', alpha=0.7)
axes[0, 0].set_title('Age Distribution')

df_binned['age_bin'].value_counts().plot(kind='bar', ax=axes[0, 1], color='coral')
axes[0, 1].set_title('Age Bins')
axes[0, 1].tick_params(axis='x', rotation=45)

df_binned['credit_score'].hist(bins=30, ax=axes[1, 0], color='green', alpha=0.7)
axes[1, 0].set_title('Credit Score Distribution')

order = ['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']
df_binned['credit_bin'].value_counts().reindex(order).plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Credit Score Bins')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

---

## 5. Encoding Categorical Features <a id='5-encoding'></a>

| Method | Description | Best For |
|--------|-------------|----------|
| **Label Encoding** | Integer values | Ordinal, tree models |
| **One-Hot Encoding** | Binary columns | Nominal, linear models |
| **Target Encoding** | Mean target per category | High cardinality |

In [None]:
# Encoding Demo
df_encoded = df_clean.copy()

# Fill missing categorical values
for col in ['education', 'region']:
    df_encoded[col] = df_encoded[col].fillna(df_encoded[col].mode()[0])

# Label Encoding
le = LabelEncoder()
df_encoded['region_label'] = le.fit_transform(df_encoded['region'])

# Ordinal Encoding for education
edu_order = {'High School': 0, 'Bachelor': 1, 'Master': 2, 'PhD': 3}
df_encoded['education_ordinal'] = df_encoded['education'].map(edu_order)

# One-Hot Encoding
employment_dummies = pd.get_dummies(df_encoded['employment_type'], prefix='emp')
df_encoded = pd.concat([df_encoded, employment_dummies], axis=1)

# Target Encoding
target_means = df_encoded.groupby('region')['churned'].mean()
df_encoded['region_target'] = df_encoded['region'].map(target_means)

# Frequency Encoding
freq = df_encoded['region'].value_counts(normalize=True)
df_encoded['region_freq'] = df_encoded['region'].map(freq)

print("Encoded Features:")
cols_to_show = ['region', 'region_label', 'region_target', 'region_freq', 'education', 'education_ordinal']
print(df_encoded[cols_to_show].head(10))

In [None]:
# Visualize encoding
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

df_encoded['region'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Original Region Distribution')
axes[0].tick_params(axis='x', rotation=45)

target_means.sort_values().plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Target Encoded (Churn Rate)')
axes[1].tick_params(axis='x', rotation=45)

employment_dummies.sum().plot(kind='bar', ax=axes[2], color='green')
axes[2].set_title('One-Hot: Employment Type')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Compare encoding impact
print("\nEncoding Impact on Model Performance:")

y = df_encoded['churned']

# Label encoding
X_label = df_encoded[['age', 'income', 'credit_score', 'years_employed']].copy()
for col in ['education', 'employment_type', 'region', 'customer_segment']:
    le = LabelEncoder()
    X_label[col] = le.fit_transform(df_encoded[col])

# One-hot encoding
X_onehot = df_encoded[['age', 'income', 'credit_score', 'years_employed']].copy()
for col in ['education', 'employment_type', 'region', 'customer_segment']:
    dummies = pd.get_dummies(df_encoded[col], prefix=col)
    X_onehot = pd.concat([X_onehot, dummies], axis=1)

results = []
for name, X in [('Label Encoding', X_label), ('One-Hot Encoding', X_onehot)]:
    X_scaled = StandardScaler().fit_transform(X)
    scores = cross_val_score(LogisticRegression(max_iter=1000), X_scaled, y, cv=5)
    results.append({'Encoding': name, 'Mean CV': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

print(pd.DataFrame(results).to_string(index=False))

---

## 6. Feature Engineering Pipeline <a id='6-pipeline'></a>

Combining all operations into a complete pipeline.

In [None]:
# Complete Pipeline
class FeatureEngineeringPipeline:
    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self.steps = []
    
    def handle_missing(self, num_strategy: str = 'median', cat_strategy: str = 'mode') -> 'FeatureEngineeringPipeline':
        for col in self.df.columns:
            if self.df[col].isnull().sum() > 0:
                if pd.api.types.is_numeric_dtype(self.df[col]):
                    fill_val = self.df[col].median() if num_strategy == 'median' else self.df[col].mean()
                    self.df[col] = self.df[col].fillna(fill_val)
                else:
                    self.df[col] = self.df[col].fillna(self.df[col].mode()[0] if cat_strategy == 'mode' else 'Unknown')
        self.steps.append('handle_missing')
        return self
    
    def remove_outliers(self, columns: List[str], factor: float = 3) -> 'FeatureEngineeringPipeline':
        for col in columns:
            if col in self.df.columns and pd.api.types.is_numeric_dtype(self.df[col]):
                Q1, Q3 = self.df[col].quantile([0.25, 0.75])
                IQR = Q3 - Q1
                self.df = self.df[(self.df[col] >= Q1 - factor*IQR) & (self.df[col] <= Q3 + factor*IQR)]
        self.steps.append('remove_outliers')
        return self
    
    def scale(self, columns: List[str], method: str = 'standard') -> 'FeatureEngineeringPipeline':
        scaler = StandardScaler() if method == 'standard' else MinMaxScaler() if method == 'minmax' else RobustScaler()
        cols = [c for c in columns if c in self.df.columns]
        if cols:
            self.df[cols] = scaler.fit_transform(self.df[cols])
        self.steps.append(f'scale_{method}')
        return self
    
    def encode(self, columns: List[str], method: str = 'onehot') -> 'FeatureEngineeringPipeline':
        for col in columns:
            if col not in self.df.columns:
                continue
            if method == 'onehot':
                dummies = pd.get_dummies(self.df[col], prefix=col)
                self.df = pd.concat([self.df.drop(columns=[col]), dummies], axis=1)
            elif method == 'label':
                self.df[col] = LabelEncoder().fit_transform(self.df[col].astype(str))
        self.steps.append(f'encode_{method}')
        return self
    
    def get_result(self) -> Tuple[pd.DataFrame, List[str]]:
        return self.df, self.steps

print("FeatureEngineeringPipeline defined!")

In [None]:
# Run complete pipeline
print("=" * 60)
print("COMPLETE PIPELINE")
print("=" * 60)

pipeline = FeatureEngineeringPipeline(df)

df_processed, steps = (
    pipeline
    .handle_missing(num_strategy='median', cat_strategy='mode')
    .remove_outliers(['income', 'years_employed'], factor=3)
    .scale(['age', 'income', 'credit_score', 'years_employed'], method='standard')
    .encode(['education', 'employment_type', 'region', 'customer_segment'], method='onehot')
    .get_result()
)

print(f"\nOriginal shape: {df.shape}")
print(f"Processed shape: {df_processed.shape}")
print(f"\nSteps applied: {steps}")
print(f"\nColumns: {list(df_processed.columns)}")

---

## 7. Hands-on Exercise <a id='7-exercise'></a>

Apply feature engineering to a loan dataset.

In [None]:
# Exercise: Loan Dataset
np.random.seed(42)
n = 1000

loan_df = pd.DataFrame({
    'loan_amount': np.random.lognormal(10, 0.5, n),
    'interest_rate': np.random.uniform(5, 25, n),
    'term_months': np.random.choice([12, 24, 36, 48, 60], n),
    'annual_income': np.random.lognormal(11, 0.6, n),
    'debt_to_income': np.random.uniform(0.1, 0.6, n),
    'credit_history': np.random.choice(['Short', 'Medium', 'Long'], n, p=[0.3, 0.4, 0.3]),
    'home_ownership': np.random.choice(['Rent', 'Own', 'Mortgage', 'Other'], n, p=[0.35, 0.2, 0.4, 0.05]),
    'purpose': np.random.choice(['Debt Consolidation', 'Home Improvement', 'Medical', 'Education', 'Other'], n),
    'default': np.random.choice([0, 1], n, p=[0.88, 0.12])
})

# Add missing values
for col in ['annual_income', 'debt_to_income', 'credit_history']:
    mask = np.random.random(n) < 0.08
    loan_df.loc[mask, col] = np.nan

print("Loan Dataset:")
print(loan_df.head())
print(f"\nMissing: {loan_df.isnull().sum().sum()}")

In [None]:
# YOUR TASK: Apply feature engineering
loan_pipeline = FeatureEngineeringPipeline(loan_df)

loan_processed, loan_steps = (
    loan_pipeline
    .handle_missing()
    .remove_outliers(['loan_amount', 'annual_income'], factor=3)
    .scale(['loan_amount', 'interest_rate', 'annual_income', 'debt_to_income'], method='robust')
    .encode(['credit_history', 'home_ownership', 'purpose'], method='onehot')
    .get_result()
)

print(f"Original: {loan_df.shape} -> Processed: {loan_processed.shape}")
print(f"Steps: {loan_steps}")

# Train model
X = loan_processed.drop(columns=['default'])
y = loan_processed['default']

scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=5)
print(f"\nRandom Forest CV Score: {scores.mean():.4f} (+/- {scores.std():.4f})")

---

## 8. Summary and Key Takeaways <a id='8-summary'></a>

### Key Concepts

1. **Missing Values**: Mean, median, mode, KNN imputation
2. **Scaling**: StandardScaler, MinMaxScaler, RobustScaler
3. **Discretization**: Equal-width, equal-frequency, custom bins
4. **Encoding**: Label, one-hot, target, frequency encoding

### Best Practices

- Analyze data before choosing strategies
- Use RobustScaler for data with outliers
- Use target encoding for high-cardinality categories
- Fit transformers on training data only

### Next Steps

Tutorial 07: Feature Engineering for Unstructured Data

In [None]:
print("=" * 60)
print("TUTORIAL 06 COMPLETE: Feature Engineering Operations")
print("=" * 60)
print("\nTopics covered:")
print("  1. Handling Missing Values")
print("  2. Feature Scaling")
print("  3. Discretization and Bucketing")
print("  4. Categorical Encoding")
print("  5. Complete Pipeline")
print("\nNext: Tutorial 07 - Feature Engineering for Unstructured Data")