# Chapter 30: Feature Engineering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/30_feature_engineering.ipynb)

This notebook accompanies Chapter 30 of the BANA 4080 textbook. It provides interactive examples of feature engineering techniques including encoding, scaling, feature creation, handling missing data, and building scikit-learn pipelines.

## Learning Objectives

By working through this notebook, you will be able to:

- Apply different encoding strategies for categorical variables (dummy/one-hot, label, and ordinal encoding)
- Scale and normalize numerical features using StandardScaler and MinMaxScaler
- Create new features using polynomial terms, interaction terms, and domain knowledge
- Handle missing data strategically through imputation or deletion, including missingness indicators
- Build end-to-end feature engineering pipelines with scikit-learn to prevent data leakage
- Recognize when different techniques are appropriate based on your data, model, and goals

## Setup

Let's start by importing the libraries we'll need and loading the Ames housing dataset.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer

# Scikit-learn pipeline tools
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')

print("✓ Libraries imported successfully")

### Load the Ames Housing Data

We'll use the Ames housing dataset throughout this notebook. This dataset contains information about house sales in Ames, Iowa.

In [None]:
# Load data - adjust path if running in Google Colab
try:
    # Try local path first
    ames = pd.read_csv('../data/ames_clean.csv')
except FileNotFoundError:
    # If in Colab, load from GitHub
    url = 'https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/ames_clean.csv'
    ames = pd.read_csv(url)

print(f"Dataset shape: {ames.shape}")
print(f"\nFirst few rows:")
ames.head()

Let's quickly explore the dataset structure:

In [None]:
# Dataset information
print("Dataset Info:")
print(f"Number of rows: {len(ames):,}")
print(f"Number of columns: {len(ames.columns)}")
print(f"\nColumn types:")
print(ames.dtypes.value_counts())
print(f"\nTarget variable (SalePrice) summary:")
print(ames['SalePrice'].describe())

---

## 1. Encoding Categorical Variables

Machine learning algorithms work with numbers, not categories. We need to convert categorical variables into numerical format.

### 1.1 Dummy/One-Hot Encoding

Creates separate binary columns for each category. Best for nominal variables (no inherent order).

In [None]:
# Look at the BldgType (building type) variable
print("Building types in the dataset:")
print(ames['BldgType'].value_counts())
print(f"\nNumber of unique building types: {ames['BldgType'].nunique()}")

In [None]:
# Create dummy variables for building type
bldg_dummies = pd.get_dummies(ames['BldgType'], prefix='BldgType')

print("Dummy encoded building types (first 5 rows):")
print(bldg_dummies.head())
print(f"\nNumber of columns created: {len(bldg_dummies.columns)}")

In [None]:
# Avoid the dummy variable trap by dropping the first category
bldg_dummies_safe = pd.get_dummies(ames['BldgType'], 
                                   prefix='BldgType',
                                   drop_first=True)

print(f"Original columns: {len(bldg_dummies.columns)}")
print(f"After dropping first: {len(bldg_dummies_safe.columns)}")
print(f"\nColumns kept: {list(bldg_dummies_safe.columns)}")

💡 **Key Insight:** For linear regression, use `drop_first=True` to avoid multicollinearity (the dummy variable trap).

### 1.2 Label Encoding

Assigns a unique integer to each category, creating just one column. More compact but can be misleading for linear models.

In [None]:
# Check how many neighborhoods we have
print(f"Number of unique neighborhoods: {ames['Neighborhood'].nunique()}")
print(f"\nSample neighborhoods:")
print(ames['Neighborhood'].value_counts().head())

In [None]:
# Apply label encoding
le = LabelEncoder()
ames_encoded = ames.copy()
ames_encoded['Neighborhood_Encoded'] = le.fit_transform(ames['Neighborhood'])

# Show the mapping for a few examples
print("Label encoding results (first 10 rows):")
print(ames_encoded[['Neighborhood', 'Neighborhood_Encoded']].head(10))

⚠️ **Warning:** Never use label encoding for non-ordinal categorical variables with linear models! The model will incorrectly treat the numeric codes as having magnitude.

### 1.3 Ordinal Encoding

For categorical variables with a natural order, create custom mappings that preserve the ordering.

In [None]:
# Look at the exterior quality variable
print("Exterior quality categories:")
print(ames['ExterQual'].value_counts().sort_index())

In [None]:
# Create custom ordinal mapping that preserves the quality order
# Po = Poor, Fa = Fair, TA = Typical/Average, Gd = Good, Ex = Excellent
quality_map = {
    'Po': 1,  # Poor
    'Fa': 2,  # Fair
    'TA': 3,  # Typical/Average
    'Gd': 4,  # Good
    'Ex': 5   # Excellent
}

# Apply the mapping
ames_encoded['ExterQual_Encoded'] = ames['ExterQual'].map(quality_map)

# Show the results
print("Ordinal encoding results (first 10 rows):")
print(ames_encoded[['ExterQual', 'ExterQual_Encoded']].head(10))

In [None]:
# Verify the encoding preserves order by checking mean prices
print("Mean sale price by exterior quality:")
quality_prices = ames_encoded.groupby('ExterQual_Encoded')['SalePrice'].mean().sort_index()
print(quality_prices)
print("\n✓ Higher quality ratings correspond to higher prices!")

### 🎯 Try It Yourself: Encoding

Try encoding the `KitchenQual` variable (kitchen quality) using ordinal encoding. Create an appropriate quality mapping and verify it makes sense.

In [None]:
# TODO: Create a quality_map for KitchenQual
# TODO: Apply the mapping
# TODO: Verify by checking mean sale prices

# Your code here:


---

## 2. Scaling and Normalization

Many ML algorithms are sensitive to the scale of features. Scaling puts all features on a level playing field.

### 2.1 StandardScaler (Z-score Normalization)

Transforms features to have mean=0 and standard deviation=1.

In [None]:
# Create sample data with different scales
data = pd.DataFrame({
    'HouseSize': [1200, 1800, 950, 2400, 1600],
    'Bedrooms': [2, 3, 2, 4, 3]
})

print("Original data:")
print(data)
print(f"\nHouseSize range: {data['HouseSize'].min()} to {data['HouseSize'].max()}")
print(f"Bedrooms range: {data['Bedrooms'].min()} to {data['Bedrooms'].max()}")

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

scaled_df = pd.DataFrame(
    scaled_data,
    columns=['HouseSize_Scaled', 'Bedrooms_Scaled']
)

print("After StandardScaler:")
print(scaled_df)
print(f"\nMean: {scaled_df.mean().values}")
print(f"Std: {scaled_df.std().values}")

### 2.2 MinMaxScaler

Transforms features to a fixed range, typically [0, 1].

In [None]:
# Apply MinMaxScaler
minmax = MinMaxScaler()
minmax_scaled = minmax.fit_transform(data)

minmax_df = pd.DataFrame(
    minmax_scaled,
    columns=['HouseSize_MinMax', 'Bedrooms_MinMax']
)

print("After MinMaxScaler:")
print(minmax_df)
print(f"\nMin: {minmax_df.min().values}")
print(f"Max: {minmax_df.max().values}")

### 2.3 Visualizing the Effect of Scaling

In [None]:
# Create visualization comparing original, StandardScaler, and MinMaxScaler
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original data
axes[0].scatter(data['HouseSize'], data['Bedrooms'], s=100, alpha=0.6)
axes[0].set_xlabel('HouseSize')
axes[0].set_ylabel('Bedrooms')
axes[0].set_title('Original Data\n(Different Scales)')
axes[0].grid(True, alpha=0.3)

# StandardScaler
axes[1].scatter(scaled_df['HouseSize_Scaled'], scaled_df['Bedrooms_Scaled'], 
               s=100, alpha=0.6, color='orange')
axes[1].set_xlabel('HouseSize (Standardized)')
axes[1].set_ylabel('Bedrooms (Standardized)')
axes[1].set_title('StandardScaler\n(Mean=0, Std=1)')
axes[1].grid(True, alpha=0.3)

# MinMaxScaler
axes[2].scatter(minmax_df['HouseSize_MinMax'], minmax_df['Bedrooms_MinMax'], 
               s=100, alpha=0.6, color='green')
axes[2].set_xlabel('HouseSize (Min-Max)')
axes[2].set_ylabel('Bedrooms (Min-Max)')
axes[2].set_title('MinMaxScaler\n(Range [0,1])')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 🎯 Try It Yourself: Scaling

Apply StandardScaler to three Ames features: `GrLivArea`, `YearBuilt`, and `GarageArea`. Compare their ranges before and after scaling.

In [None]:
# TODO: Select the three features
# TODO: Apply StandardScaler
# TODO: Compare the ranges and verify mean≈0, std≈1

# Your code here:


---

## 3. Creating New Features

Some of the most powerful features are ones you create yourself by combining or transforming existing features.

### 3.1 Domain-Specific Features

Use domain knowledge to create meaningful features.

In [None]:
# Create domain-specific features for the Ames dataset
ames_features = ames.copy()
current_year = pd.Timestamp.now().year

# Feature 1: House age (more intuitive than year built)
ames_features['Age'] = current_year - ames_features['YearBuilt']

# Feature 2: Was the house renovated?
ames_features['Was_Renovated'] = (ames_features['YearRemodAdd'] > ames_features['YearBuilt']).astype(int)

# Feature 3: Years since renovation
ames_features['Years_Since_Reno'] = current_year - ames_features['YearRemodAdd']

# Feature 4: Total bathrooms
ames_features['Total_Baths'] = (ames_features['FullBath'] + 
                                0.5 * ames_features['HalfBath'] + 
                                ames_features['BsmtFullBath'] + 
                                0.5 * ames_features['BsmtHalfBath'])

# Feature 5: Square feet per bathroom
ames_features['Sqft_Per_Bath'] = ames_features['GrLivArea'] / (ames_features['Total_Baths'] + 0.1)

print("New features created:")
print(ames_features[['Age', 'Was_Renovated', 'Years_Since_Reno', 'Total_Baths', 'Sqft_Per_Bath']].head())

In [None]:
# Check correlation of new features with SalePrice
new_features = ['Age', 'Was_Renovated', 'Years_Since_Reno', 'Total_Baths', 'Sqft_Per_Bath']
correlations = ames_features[new_features + ['SalePrice']].corr()['SalePrice'].drop('SalePrice').sort_values(ascending=False)

print("Correlation with SalePrice:")
print(correlations)

# Visualize
plt.figure(figsize=(10, 5))
correlations.plot(kind='barh', color='skyblue')
plt.xlabel('Correlation with SalePrice')
plt.title('New Feature Correlations')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 3.2 Polynomial Features

Capture non-linear relationships by creating powers of features.

In [None]:
# Example with simple data
simple_data = pd.DataFrame({
    'x1': [1, 2, 3],
    'x2': [4, 5, 6]
})

print("Original data:")
print(simple_data)

In [None]:
# Create polynomial features (degree=2)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(simple_data)

poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out())
print("After polynomial transformation (degree=2):")
print(poly_df)
print(f"\nOriginal features: {simple_data.shape[1]}")
print(f"Polynomial features: {poly_df.shape[1]}")
print(f"\nFeatures created: {list(poly_df.columns)}")

### 3.3 Visualizing Polynomial Relationships

Let's see how polynomial features help capture non-linear patterns.

In [None]:
# Create synthetic data with a non-linear relationship (diminishing returns)
np.random.seed(42)
X = np.linspace(500, 3500, 50).reshape(-1, 1)  # Square footage
# Price follows a curve with diminishing returns + some noise
y = 50000 + 100*X + -0.015*(X**2) + np.random.normal(0, 10000, X.shape)

# Fit three models: linear, 2nd degree polynomial, 3rd degree polynomial
models = {}
predictions = {}

# Linear model
models['Linear'] = LinearRegression()
models['Linear'].fit(X, y)
predictions['Linear'] = models['Linear'].predict(X)

# 2nd degree polynomial
poly2 = PolynomialFeatures(degree=2, include_bias=False)
X_poly2 = poly2.fit_transform(X)
models['2nd Degree'] = LinearRegression()
models['2nd Degree'].fit(X_poly2, y)
predictions['2nd Degree'] = models['2nd Degree'].predict(X_poly2)

# 3rd degree polynomial
poly3 = PolynomialFeatures(degree=3, include_bias=False)
X_poly3 = poly3.fit_transform(X)
models['3rd Degree'] = LinearRegression()
models['3rd Degree'].fit(X_poly3, y)
predictions['3rd Degree'] = models['3rd Degree'].predict(X_poly3)

# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(X, y, alpha=0.5, s=30, label='Actual Data', color='gray')
ax.plot(X, predictions['Linear'], label='Linear (no polynomial)',
        linewidth=2, linestyle='--', color='red')
ax.plot(X, predictions['2nd Degree'], label='2nd Degree Polynomial',
        linewidth=2, color='blue')
ax.plot(X, predictions['3rd Degree'], label='3rd Degree Polynomial',
        linewidth=2, linestyle=':', color='green')

ax.set_xlabel('Square Footage', fontsize=12)
ax.set_ylabel('House Price ($)', fontsize=12)
ax.set_title('How Polynomial Features Capture Non-Linear Relationships', fontsize=14)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Notice how the 2nd degree polynomial captures the curve,")
print("while the linear model misses the diminishing returns pattern.")

### 3.4 Interaction Terms

Capture how features affect each other.

In [None]:
# Real estate example: size × quality interaction
houses = pd.DataFrame({
    'Size_Sqft': [1200, 1200, 2400, 2400],
    'Neighborhood_Quality': [1, 5, 1, 5]  # 1=poor, 5=excellent
})

# Create interaction
houses['Size_x_Quality'] = houses['Size_Sqft'] * houses['Neighborhood_Quality']

print("Interaction term example:")
print(houses)
print("\n💡 The interaction captures that an extra sq ft in a good neighborhood")
print("   is worth more than in a poor neighborhood.")

### 🎯 Try It Yourself: Feature Creation

Create a new feature that combines `GrLivArea` and `OverallQual` as an interaction term. Check if it correlates more strongly with `SalePrice` than the individual features.

In [None]:
# TODO: Create interaction feature: GrLivArea × OverallQual
# TODO: Calculate correlations with SalePrice
# TODO: Compare to individual feature correlations

# Your code here:


---

## 4. Handling Missing Data

Real-world data often has missing values. We need strategies to handle them.

### 4.1 Identifying Missing Values

In [None]:
# Check for missing values in Ames dataset
missing_counts = ames.isnull().sum()
missing_percent = (missing_counts / len(ames)) * 100

missing_df = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Percent': missing_percent
}).sort_values('Missing_Count', ascending=False)

# Show features with missing values
missing_features = missing_df[missing_df['Missing_Count'] > 0]
print(f"Features with missing values: {len(missing_features)}")
print("\nTop 10 features by missing count:")
print(missing_features.head(10))

In [None]:
# Visualize missing data patterns
if len(missing_features) > 0:
    plt.figure(figsize=(10, 6))
    top_missing = missing_features.head(15)
    plt.barh(range(len(top_missing)), top_missing['Percent'])
    plt.yticks(range(len(top_missing)), top_missing.index)
    plt.xlabel('Percentage Missing (%)')
    plt.title('Features with Most Missing Values')
    plt.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()

### 4.2 Imputation Strategies

#### Mean/Median Imputation (Numerical Features)

In [None]:
# Create data with missing values
data_missing = pd.DataFrame({
    'Age': [25, 30, np.nan, 45, np.nan, 35],
    'Income': [50000, 60000, 55000, np.nan, 70000, 65000]
})

print("Original data with missing values:")
print(data_missing)

In [None]:
# Impute with median (more robust to outliers)
imputer = SimpleImputer(strategy='median')
imputed_data = imputer.fit_transform(data_missing)

imputed_df = pd.DataFrame(imputed_data, columns=['Age', 'Income'])
print("After median imputation:")
print(imputed_df)
print(f"\nMedian Age used for imputation: {imputer.statistics_[0]}")
print(f"Median Income used for imputation: {imputer.statistics_[1]}")

#### Mode Imputation (Categorical Features)

In [None]:
# Categorical data with missing values
cat_data = pd.DataFrame({
    'Color': ['Red', 'Blue', np.nan, 'Red', 'Blue', np.nan, 'Red']
})

print("Original categorical data:")
print(cat_data)
print(f"\nMissing values: {cat_data['Color'].isnull().sum()}")

In [None]:
# Impute with most frequent value (mode)
imputer = SimpleImputer(strategy='most_frequent')
cat_data['Color_Imputed'] = imputer.fit_transform(cat_data[['Color']]).ravel()

print("After mode imputation:")
print(cat_data)
print(f"\nMost frequent value used: {imputer.statistics_[0]}")

#### Constant Imputation

In [None]:
# Impute with a constant value
imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
cat_data['Color_Constant'] = imputer.fit_transform(cat_data[['Color']]).ravel()

print("After constant imputation with 'Unknown':")
print(cat_data[['Color', 'Color_Constant']])

### 4.3 Missingness Indicators

Sometimes the fact that a value is missing is itself informative.

In [None]:
# Create fresh data for this example
indicator_data = pd.DataFrame({
    'Age': [25, 30, np.nan, 45, np.nan, 35],
    'Income': [50000, 60000, 55000, np.nan, 70000, 65000]
})

# Create indicator for missingness BEFORE imputing
indicator_data['Age_Was_Missing'] = indicator_data['Age'].isna().astype(int)
indicator_data['Income_Was_Missing'] = indicator_data['Income'].isna().astype(int)

# Then impute the original columns
indicator_data['Age'] = indicator_data['Age'].fillna(indicator_data['Age'].median())
indicator_data['Income'] = indicator_data['Income'].fillna(indicator_data['Income'].median())

print("Data with missingness indicators:")
print(indicator_data)
print("\n💡 Now the model has both the imputed value AND information about whether it was missing!")

### 🎯 Try It Yourself: Missing Data

For the Ames dataset, pick a feature with missing values (like `LotFrontage`). Create a missingness indicator, impute with median, and check if houses with missing values have different sale prices.

In [None]:
# TODO: Pick a feature with missing values
# TODO: Create missingness indicator
# TODO: Impute with median
# TODO: Compare mean SalePrice for missing vs non-missing

# Your code here:


---

## 5. Building Scikit-Learn Pipelines

Pipelines chain transformations together and prevent data leakage.

### 5.1 Simple Pipeline Example

In [None]:
# Create a simple pipeline: scaling → model
simple_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

# Prepare simple data
X_simple = pd.DataFrame({
    'GrLivArea': [1200, 1500, 1800, 2100],
    'YearBuilt': [1990, 2000, 1985, 2015]
})
y_simple = pd.Series([200000, 250000, 240000, 350000])

# Fit the entire pipeline
simple_pipeline.fit(X_simple, y_simple)

# Make predictions (scaling happens automatically!)
predictions = simple_pipeline.predict(X_simple)
print("Predictions:", predictions)
print("\n✓ Pipeline automatically scales data before prediction!")

### 5.2 End-to-End Pipeline with Mixed Features

Real datasets have both numerical and categorical features requiring different preprocessing.

In [None]:
# Define feature types for Ames
numeric_features = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF']
categorical_features = ['Neighborhood', 'BldgType']

# Create preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("✓ Preprocessor created with separate pipelines for numeric and categorical features")

In [None]:
# Create full pipeline with model
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Ridge(alpha=1.0))
])

# Prepare data
X = ames[numeric_features + categorical_features]
y = ames['SalePrice']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]:,}")
print(f"Test set size: {X_test.shape[0]:,}")

In [None]:
# Fit the entire pipeline
print("Fitting pipeline...")
full_pipeline.fit(X_train, y_train)

# Evaluate
train_score = full_pipeline.score(X_train, y_train)
test_score = full_pipeline.score(X_test, y_test)

# Get predictions for additional metrics
y_train_pred = full_pipeline.predict(X_train)
y_test_pred = full_pipeline.predict(X_test)

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("\n" + "="*50)
print("PIPELINE RESULTS")
print("="*50)
print(f"Training R² Score: {train_score:.3f}")
print(f"Test R² Score: {test_score:.3f}")
print(f"\nTraining RMSE: ${train_rmse:,.0f}")
print(f"Test RMSE: ${test_rmse:,.0f}")
print("="*50)
print("\n✓ Pipeline handles all preprocessing automatically!")

### 5.3 Visualizing Pipeline Predictions

In [None]:
# Plot actual vs predicted
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Training set
ax1.scatter(y_train, y_train_pred, alpha=0.5, s=20)
ax1.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
         'r--', lw=2, label='Perfect Prediction')
ax1.set_xlabel('Actual Price ($)', fontsize=12)
ax1.set_ylabel('Predicted Price ($)', fontsize=12)
ax1.set_title(f'Training Set\nR² = {train_score:.3f}, RMSE = ${train_rmse:,.0f}', fontsize=12)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Test set
ax2.scatter(y_test, y_test_pred, alpha=0.5, s=20, color='orange')
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
         'r--', lw=2, label='Perfect Prediction')
ax2.set_xlabel('Actual Price ($)', fontsize=12)
ax2.set_ylabel('Predicted Price ($)', fontsize=12)
ax2.set_title(f'Test Set\nR² = {test_score:.3f}, RMSE = ${test_rmse:,.0f}', fontsize=12)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 🎯 Try It Yourself: Build Your Own Pipeline

Extend the pipeline above by:
1. Adding more numerical features (like `OverallQual`, `GarageCars`)
2. Adding more categorical features (like `MSZoning`)
3. Try different models (LinearRegression, Ridge with different alpha values)
4. Compare the performance

In [None]:
# TODO: Define new feature lists
# TODO: Create preprocessor with ColumnTransformer
# TODO: Build pipeline with your choice of model
# TODO: Fit and evaluate
# TODO: Compare to the baseline pipeline above

# Your code here:


---

## 6. Summary and Key Takeaways

### What We Learned

In this notebook, we explored:

1. **Encoding Categorical Variables**
   - Dummy/one-hot encoding for nominal variables
   - Label encoding for high-cardinality features (use with tree-based models)
   - Ordinal encoding for variables with natural order

2. **Scaling and Normalization**
   - StandardScaler (mean=0, std=1) for most use cases
   - MinMaxScaler (range [0,1]) for bounded ranges
   - When scaling matters (distance-based algorithms) vs doesn't (tree-based)

3. **Creating New Features**
   - Domain-specific features using expert knowledge
   - Polynomial features for capturing non-linearity
   - Interaction terms for feature dependencies

4. **Handling Missing Data**
   - Imputation strategies (mean/median, mode, constant)
   - Missingness indicators to preserve signal
   - When to drop vs impute

5. **Scikit-Learn Pipelines**
   - Chain transformations to prevent data leakage
   - ColumnTransformer for mixed feature types
   - End-to-end reproducible workflows

### Critical Principles

✅ **Feature engineering often matters more than model selection**  
✅ **Always use pipelines to prevent data leakage**  
✅ **Fit transformers on training data only**  
✅ **Domain knowledge creates the most powerful features**  
✅ **Test different approaches and let results guide decisions**  

### Next Steps

- Practice with different datasets
- Experiment with feature combinations
- Explore advanced techniques (target encoding, feature selection, dimensionality reduction)
- Build end-to-end ML projects with proper pipelines

---

## 📚 Additional Resources

**Books:**
- [Feature Engineering for Machine Learning](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/) by Alice Zheng and Amanda Casari
- [Feature Engineering and Selection](http://www.feat.engineering/) by Max Kuhn and Kjell Johnson

**Online Resources:**
- [Kaggle's Feature Engineering Course](https://www.kaggle.com/learn/feature-engineering)
- [Scikit-learn User Guide: Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Scikit-learn User Guide: Pipelines](https://scikit-learn.org/stable/modules/compose.html)

**Practice:**
- [Kaggle Competitions](https://www.kaggle.com/competitions) - Real-world datasets
- [Kaggle Notebooks](https://www.kaggle.com/code) - Learn from others' feature engineering