# 05 - Feature Engineering Operations

---

## What the Chapter Says

The chapter covers **Feature Engineering** for structured data with these specific operations:

1. **Missing Values**: deletion (row/column) vs imputation (defaults, mean/median/mode)
2. **Feature Scaling**: normalization (min-max), standardization (z-score), log scaling
3. **Discretization/Bucketing**: continuous → categorical (with age bucketing example)
4. **Encoding Categorical Features**: integer encoding, one-hot encoding, embedding learning

Each operation includes pros and cons as specified in the chapter.

---

## Meta Interview Signal

| Level | Expectations |
|-------|-------------|
| **E5** | Knows all operations and when to use each. Can implement them. Understands pros/cons. |
| **E6** | Discusses feature engineering at scale (batch vs online features). Proposes feature store design. Considers feature drift and freshness. |

---

## Setup: Create Synthetic Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Create synthetic user engagement data (matching chapter examples)
n = 1000

raw_data = pd.DataFrame({
    'user_id': range(n),
    
    # Numerical features with various issues
    'age': np.random.choice(
        list(range(13, 80)) + [None], n, 
        p=[0.01]*67 + [0.33]  # 33% missing
    ),
    'watch_time_sec': np.random.exponential(300, n),  # Highly skewed
    'num_sessions': np.random.poisson(10, n),
    'follower_count': np.random.pareto(2, n) * 100,  # Power law, needs log
    
    # Categorical features
    'device': np.random.choice(['iOS', 'Android', 'Web', None], n, p=[0.35, 0.35, 0.25, 0.05]),
    'country': np.random.choice(['US', 'UK', 'IN', 'BR', 'DE', 'JP', 'FR', 'CA', 'AU', 'MX'], n),
    'content_category': np.random.choice(['sports', 'music', 'news', 'comedy', 'education', 'gaming'], n),
    
    # Target: did user engage? (like/share)
    'engaged': np.random.choice([0, 1], n, p=[0.7, 0.3])
})

# Convert age to numeric (with NaN)
raw_data['age'] = pd.to_numeric(raw_data['age'], errors='coerce')

print("RAW DATA OVERVIEW")
print("="*60)
print(f"Shape: {raw_data.shape}")
print(f"\nMissing values:")
print(raw_data.isnull().sum())
print(f"\nSample:")
print(raw_data.head(10))

---

## 1) Missing Values (Chapter Content)

The chapter specifies two approaches:

| Approach | Methods | Drawback |
|----------|---------|----------|
| **Deletion** | Row deletion, Column deletion | Reduces data quantity |
| **Imputation** | Defaults, Mean/Median/Mode | Introduces noise |

In [None]:
# Demonstrate missing value handling
print("="*60)
print("MISSING VALUES HANDLING (Chapter Methods)")
print("="*60)

df = raw_data.copy()
print(f"\nOriginal data: {len(df)} rows")
print(f"Missing values: age={df['age'].isnull().sum()}, device={df['device'].isnull().sum()}")

In [None]:
# Method 1: Row Deletion
print("\n" + "-"*40)
print("METHOD 1: Row Deletion")
print("-"*40)

df_row_deleted = df.dropna(subset=['age', 'device'])
rows_lost = len(df) - len(df_row_deleted)

print(f"After row deletion: {len(df_row_deleted)} rows")
print(f"Rows lost: {rows_lost} ({rows_lost/len(df)*100:.1f}%)")
print(f"\n[Drawback]: Reduces data quantity - lost {rows_lost/len(df)*100:.1f}% of data!")
print("[When to use]: When missing data is random and you have plenty of data")

In [None]:
# Method 2: Column Deletion
print("\n" + "-"*40)
print("METHOD 2: Column Deletion")
print("-"*40)

# If a column has >50% missing, consider dropping
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing percentage per column:")
print(missing_pct[missing_pct > 0])

# In this case, age has 33% missing - borderline
print(f"\n[Drawback]: Lose the entire feature's predictive power")
print("[When to use]: When column has very high missing rate (>50-70%)")

In [None]:
# Method 3: Imputation - Default Value
print("\n" + "-"*40)
print("METHOD 3: Imputation - Default Value")
print("-"*40)

df_imputed_default = df.copy()
df_imputed_default['device'] = df_imputed_default['device'].fillna('Unknown')
df_imputed_default['age'] = df_imputed_default['age'].fillna(-1)  # Sentinel value

print(f"Device value counts after imputation:")
print(df_imputed_default['device'].value_counts())
print(f"\n[Drawback]: Introduces artificial category; sentinel values can confuse models")
print("[When to use]: For categorical features where 'unknown' is meaningful")

In [None]:
# Method 4: Imputation - Mean/Median/Mode
print("\n" + "-"*40)
print("METHOD 4: Imputation - Mean/Median/Mode")
print("-"*40)

df_imputed_stats = df.copy()

# Numerical: Use median (robust to outliers)
age_median = df_imputed_stats['age'].median()
df_imputed_stats['age'] = df_imputed_stats['age'].fillna(age_median)

# Categorical: Use mode (most frequent)
device_mode = df_imputed_stats['device'].mode()[0]
df_imputed_stats['device'] = df_imputed_stats['device'].fillna(device_mode)

print(f"Age imputed with median: {age_median}")
print(f"Device imputed with mode: {device_mode}")
print(f"\n[Drawback]: Introduces noise - all missing ages now look the same")
print("[When to use]: When missing is random and feature is important")

In [None]:
# Visualize the effect of imputation
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Original age distribution (without NaN)
axes[0].hist(df['age'].dropna(), bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[0].axvline(age_median, color='red', linestyle='--', linewidth=2, label=f'Median={age_median}')
axes[0].set_title('Original Age Distribution')
axes[0].set_xlabel('Age')
axes[0].legend()

# After imputation
axes[1].hist(df_imputed_stats['age'], bins=30, alpha=0.7, color='green', edgecolor='black')
axes[1].axvline(age_median, color='red', linestyle='--', linewidth=2, label=f'Spike at median')
axes[1].set_title('After Median Imputation')
axes[1].set_xlabel('Age')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n[Notice]: The spike at the median is the 'noise' introduced by imputation")

---

## 2) Feature Scaling (Chapter Content)

The chapter specifies three methods:

| Method | Idea | Use Case |
|--------|------|----------|
| **Normalization** | Min-max scaling to [0, 1] | Bounded range needed |
| **Standardization** | Z-score (mean=0, std=1) | Gaussian-like data |
| **Log Scaling** | Reduce skewness | Power-law distributions |

In [None]:
print("="*60)
print("FEATURE SCALING (Chapter Methods)")
print("="*60)

# Use imputed data
df_scaling = df_imputed_stats.copy()

# Show original distributions
print("\nOriginal feature statistics:")
print(df_scaling[['age', 'watch_time_sec', 'follower_count']].describe())

In [None]:
# Method 1: Normalization (Min-Max)
print("\n" + "-"*40)
print("METHOD 1: Normalization (Min-Max Scaling)")
print("-"*40)

scaler_minmax = MinMaxScaler()
age_normalized = scaler_minmax.fit_transform(df_scaling[['age']])

print(f"Formula: x_scaled = (x - min) / (max - min)")
print(f"\nOriginal age: min={df_scaling['age'].min()}, max={df_scaling['age'].max()}")
print(f"Normalized age: min={age_normalized.min():.3f}, max={age_normalized.max():.3f}")
print(f"\n[Use case]: When you need bounded range [0, 1]")
print("[Drawback]: Sensitive to outliers (outlier becomes 1.0)")

In [None]:
# Method 2: Standardization (Z-score)
print("\n" + "-"*40)
print("METHOD 2: Standardization (Z-score)")
print("-"*40)

scaler_std = StandardScaler()
age_standardized = scaler_std.fit_transform(df_scaling[['age']])

print(f"Formula: z = (x - mean) / std")
print(f"\nOriginal age: mean={df_scaling['age'].mean():.2f}, std={df_scaling['age'].std():.2f}")
print(f"Standardized age: mean={age_standardized.mean():.6f}, std={age_standardized.std():.3f}")
print(f"\n[Use case]: When data is roughly Gaussian, algorithms assume zero mean")
print("[Benefit]: Not bounded, preserves outlier information")

In [None]:
# Method 3: Log Scaling (for skewed data)
print("\n" + "-"*40)
print("METHOD 3: Log Scaling")
print("-"*40)

# Follower count is heavily skewed (power law)
print(f"follower_count statistics before log:")
print(f"  mean: {df_scaling['follower_count'].mean():.2f}")
print(f"  median: {df_scaling['follower_count'].median():.2f}")
print(f"  skewness: {df_scaling['follower_count'].skew():.2f}")

# Apply log transform (add 1 to handle zeros)
follower_log = np.log1p(df_scaling['follower_count'])

print(f"\nAfter log transform:")
print(f"  mean: {follower_log.mean():.2f}")
print(f"  median: {follower_log.median():.2f}")
print(f"  skewness: {follower_log.skew():.2f}")
print(f"\n[Use case]: Power-law distributions (followers, likes, views)")
print("[Benefit]: Reduces skewness, speeds up convergence")

In [None]:
# Visualize scaling effects
fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# Row 1: Age scaling
axes[0, 0].hist(df_scaling['age'], bins=30, color='blue', alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Age: Original')

axes[0, 1].hist(age_normalized, bins=30, color='green', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Age: Normalized [0,1]')

axes[0, 2].hist(age_standardized, bins=30, color='orange', alpha=0.7, edgecolor='black')
axes[0, 2].set_title('Age: Standardized (z-score)')

# Row 2: Follower count (skewed)
axes[1, 0].hist(df_scaling['follower_count'], bins=50, color='blue', alpha=0.7, edgecolor='black')
axes[1, 0].set_title(f'Followers: Original (skew={df_scaling["follower_count"].skew():.2f})')

axes[1, 1].hist(follower_log, bins=30, color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].set_title(f'Followers: Log Transformed (skew={follower_log.skew():.2f})')

axes[1, 2].axis('off')
axes[1, 2].text(0.1, 0.7, 'Scaling Summary:', fontsize=12, fontweight='bold')
axes[1, 2].text(0.1, 0.5, '• Normalization: [0,1] range', fontsize=10)
axes[1, 2].text(0.1, 0.35, '• Standardization: mean=0, std=1', fontsize=10)
axes[1, 2].text(0.1, 0.2, '• Log: reduces skewness', fontsize=10)

plt.tight_layout()
plt.show()

---

## 3) Discretization / Bucketing (Chapter Content)

The chapter specifies:
- Turn continuous features into categorical buckets
- **Age bucketing example**: 0-9, 10-19, 20-39, 40-59, 60+

In [None]:
print("="*60)
print("DISCRETIZATION / BUCKETING (Chapter Content)")
print("="*60)

df_bucket = df_imputed_stats.copy()

# Chapter's exact age bucketing example
age_bins = [0, 9, 19, 39, 59, 100]
age_labels = ['0-9', '10-19', '20-39', '40-59', '60+']

df_bucket['age_bucket'] = pd.cut(df_bucket['age'], bins=age_bins, labels=age_labels)

print("\nAge Bucketing (Chapter Example)")
print(f"Bins: {age_bins}")
print(f"Labels: {age_labels}")
print(f"\nBucket distribution:")
print(df_bucket['age_bucket'].value_counts().sort_index())

In [None]:
# Bucket watch time (quintiles)
print("\n" + "-"*40)
print("Watch Time Bucketing (Quantile-based)")
print("-"*40)

df_bucket['watch_time_bucket'] = pd.qcut(
    df_bucket['watch_time_sec'], 
    q=5, 
    labels=['very_low', 'low', 'medium', 'high', 'very_high']
)

print(f"\nBucket distribution:")
print(df_bucket['watch_time_bucket'].value_counts())

In [None]:
# Visualize bucketing
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Age buckets
df_bucket['age_bucket'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='#4CAF50', edgecolor='black')
axes[0].set_title('Age Buckets (Chapter Example)')
axes[0].set_xlabel('Age Bucket')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# Watch time buckets
order = ['very_low', 'low', 'medium', 'high', 'very_high']
df_bucket['watch_time_bucket'].value_counts()[order].plot(kind='bar', ax=axes[1], color='#2196F3', edgecolor='black')
axes[1].set_title('Watch Time Buckets (Quantile-based)')
axes[1].set_xlabel('Watch Time Bucket')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n[When to use]: Non-linear relationships, tree-based models, interpretability")
print("[Drawback]: Loss of granularity, boundary effects")

---

## 4) Encoding Categorical Features (Chapter Content)

The chapter specifies:

| Method | When to Use | Issue |
|--------|-------------|-------|
| **Integer Encoding** | Only when ordinal relationship exists | Implies order when there isn't |
| **One-Hot Encoding** | Nominal categories | High-cardinality explosion |
| **Embedding Learning** | High-cardinality categories | Requires training |

In [None]:
print("="*60)
print("ENCODING CATEGORICAL FEATURES (Chapter Content)")
print("="*60)

df_encode = df_imputed_stats.copy()
print(f"\nCategorical features to encode:")
print(f"  device: {df_encode['device'].nunique()} unique values")
print(f"  country: {df_encode['country'].nunique()} unique values")
print(f"  content_category: {df_encode['content_category'].nunique()} unique values")

In [None]:
# Method 1: Integer Encoding (for ordinal data ONLY)
print("\n" + "-"*40)
print("METHOD 1: Integer Encoding")
print("-"*40)

# Create an ordinal feature (e.g., engagement level)
df_encode['engagement_level'] = pd.cut(
    df_encode['watch_time_sec'],
    bins=[0, 60, 180, 600, float('inf')],
    labels=['none', 'low', 'medium', 'high']
)

# Integer encoding makes sense here (there IS an order)
engagement_mapping = {'none': 0, 'low': 1, 'medium': 2, 'high': 3}
df_encode['engagement_encoded'] = df_encode['engagement_level'].map(engagement_mapping)

print(f"Ordinal encoding for engagement_level:")
print(f"  {engagement_mapping}")
print(f"\n[When to use]: ONLY when ordinal relationship exists (low < medium < high)")
print("[Drawback]: If used on nominal data, implies false ordering (iOS < Android < Web?)")

In [None]:
# Method 2: One-Hot Encoding
print("\n" + "-"*40)
print("METHOD 2: One-Hot Encoding")
print("-"*40)

# One-hot encode device (nominal, low cardinality)
device_onehot = pd.get_dummies(df_encode['device'], prefix='device')

print(f"Original 'device' column: 1 column with {df_encode['device'].nunique()} categories")
print(f"After one-hot: {device_onehot.shape[1]} columns")
print(f"\nOne-hot columns: {list(device_onehot.columns)}")
print(f"\nSample:")
print(pd.concat([df_encode['device'].head(), device_onehot.head()], axis=1))

In [None]:
# One-hot encoding failure case: high cardinality
print("\n" + "-"*40)
print("ONE-HOT ENCODING: High Cardinality Problem")
print("-"*40)

# Simulate high-cardinality feature (e.g., user_id, product_id)
n_unique_items = 100000  # 100K products

print(f"\nScenario: 100,000 unique product IDs")
print(f"One-hot encoding would create: 100,000 binary columns!")
print(f"Memory for 1M rows: ~100GB (assuming 1 byte per value)")
print(f"\n[Problem]: Sparse, high-dimensional, slow training")
print("[Solution]: Use embedding learning instead")

In [None]:
# Method 3: Embedding Learning (conceptual)
print("\n" + "-"*40)
print("METHOD 3: Embedding Learning")
print("-"*40)

print("""
For HIGH-CARDINALITY categorical features:

Instead of:  user_id → [0, 0, 0, 1, 0, 0, ..., 0]  (100K dimensions)

Learn:       user_id → [0.23, -0.15, 0.87, 0.42]  (embedding, e.g., 64 dimensions)

Benefits:
• Dense, low-dimensional representation
• Captures semantic similarity (similar users have similar embeddings)
• Learned during model training

Used in:
• Recommendation systems (user/item embeddings)
• NLP (word embeddings like Word2Vec)
• Ads CTR prediction (ad/user embeddings)
""")

# Simulate learned embeddings
np.random.seed(42)
embedding_dim = 8
n_categories = 1000

# Pretend these are learned embeddings
fake_embeddings = np.random.randn(n_categories, embedding_dim)

print(f"\nSimulated embedding table shape: {fake_embeddings.shape}")
print(f"  {n_categories} categories × {embedding_dim} dimensions")
print(f"\nSample embedding for category 0:")
print(f"  {fake_embeddings[0]}")

---

## E4) Data Prep Talking Points Checklist (Chapter Content)

The chapter requires discussing:

In [None]:
talking_points = pd.DataFrame({
    'Topic': [
        'Availability / Collection / Size / Freshness',
        'Storage',
        'Feature Engineering',
        'Privacy',
        'Bias'
    ],
    'Questions to Address': [
        'What data exists? How is it collected? How much? How fresh?',
        'Cloud vs device? Formats? Multimodal data?',
        'Missing data handling? Normalization? Constructed features? Combining text+numbers+images?',
        'Sensitive data? Anonymization? On-device constraints?',
        'What kinds of bias? How to mitigate?'
    ],
    'Example Answer': [
        'We have 1 year of click logs, updated hourly, 10TB total',
        'Event logs in cloud data warehouse, features served from Redis',
        'Median imputation for age, log transform for followers, embeddings for user IDs',
        'Remove PII, train on aggregated stats, differential privacy for sensitive features',
        'Selection bias in training data, mitigate with stratified sampling'
    ]
})

print("="*100)
print("DATA PREP TALKING POINTS CHECKLIST (Chapter Content)")
print("="*100)
for _, row in talking_points.iterrows():
    print(f"\n{row['Topic']}")
    print(f"  Q: {row['Questions to Address']}")
    print(f"  Example: {row['Example Answer']}")

---

## Complete Feature Engineering Pipeline

In [None]:
# Complete pipeline combining all operations
def feature_engineering_pipeline(df):
    """Complete feature engineering pipeline (Chapter operations)"""
    result = df.copy()
    
    print("FEATURE ENGINEERING PIPELINE")
    print("="*60)
    print(f"Input shape: {result.shape}")
    
    # 1. Handle missing values
    print("\n1. Handling missing values...")
    result['age'] = result['age'].fillna(result['age'].median())
    result['device'] = result['device'].fillna('Unknown')
    print(f"   Imputed age with median, device with 'Unknown'")
    
    # 2. Feature scaling
    print("\n2. Scaling features...")
    result['age_scaled'] = StandardScaler().fit_transform(result[['age']])
    result['watch_time_log'] = np.log1p(result['watch_time_sec'])
    result['follower_log'] = np.log1p(result['follower_count'])
    print(f"   Standardized age, log-transformed watch_time and followers")
    
    # 3. Discretization
    print("\n3. Bucketing continuous features...")
    age_bins = [0, 9, 19, 39, 59, 100]
    age_labels = ['0-9', '10-19', '20-39', '40-59', '60+']
    result['age_bucket'] = pd.cut(result['age'], bins=age_bins, labels=age_labels)
    print(f"   Created age buckets: {age_labels}")
    
    # 4. Encode categoricals
    print("\n4. Encoding categorical features...")
    device_dummies = pd.get_dummies(result['device'], prefix='device')
    result = pd.concat([result, device_dummies], axis=1)
    print(f"   One-hot encoded device: {list(device_dummies.columns)}")
    
    # For high-cardinality, we'd use embeddings (simulated here)
    result['country_encoded'] = LabelEncoder().fit_transform(result['country'])
    print(f"   Integer encoded country (would use embeddings in production)")
    
    print(f"\nOutput shape: {result.shape}")
    return result

# Run pipeline
processed_data = feature_engineering_pipeline(raw_data)

In [None]:
# Show processed data
print("\nProcessed Data Sample:")
print(processed_data.head())

print(f"\nNew columns created:")
new_cols = [c for c in processed_data.columns if c not in raw_data.columns]
print(new_cols)

---

## Tradeoffs (Chapter-Aligned)

| Tradeoff | Discussion | Interview Signal |
|----------|------------|------------------|
| **Deletion vs Imputation** | Lose data vs introduce noise | E5: Knows both. E6: Discusses impact on model bias |
| **Normalization vs Standardization** | Bounded range vs unbounded | E5: Knows when to use each. E6: Discusses outlier handling |
| **One-Hot vs Embeddings** | Simple vs learnable | E5: Understands cardinality issue. E6: Proposes embedding strategies |
| **Bucketing granularity** | Fine vs coarse buckets | E5: Can implement. E6: Discusses business-relevant buckets |

---

## Meta Interview Signal (Detailed)

### E5 Answer Expectations

- Knows all four feature engineering operations (missing values, scaling, bucketing, encoding)
- Can implement each with sklearn/pandas
- Understands pros and cons of each method
- Can explain when to use each approach

### E6 Additions

- **Feature store design**: "We compute batch features daily and real-time features on-the-fly, served from Redis"
- **Feature drift**: "We monitor feature distributions and alert when they drift from training"
- **Embedding strategies**: "For cold-start users, we use average embeddings or fallback to demographic features"
- **Scale considerations**: "At Meta scale, feature engineering is a distributed Spark job running on 1000s of nodes"

---

## Interview Drills

### Drill 1: Missing Value Strategy
For each scenario, choose deletion or imputation and justify:
- 5% of user ages are missing
- 60% of a new feature "last_purchase_date" is missing
- 2% of target labels are missing

### Drill 2: Scaling Selection
Choose the appropriate scaling method:
- User age for a neural network
- Number of followers (power law distribution)
- Temperature in Celsius for a linear regression

### Drill 3: Age Bucketing
Reproduce the chapter's age bucketing example from memory: 0-9, 10-19, 20-39, 40-59, 60+

### Drill 4: Encoding Decision
For each feature, choose integer, one-hot, or embedding encoding:
- Device type (iOS, Android, Web)
- User ID (100M users)
- Product category (100 categories)
- Rating (1-5 stars)

### Drill 5: Complete Pipeline
Design a feature engineering pipeline for a video recommendation system. Include:
- What features would you extract?
- How would you handle missing values?
- What scaling would you apply?
- How would you encode categorical features?