# Ames Housing Price Prediction with Feature Engineering

This notebook applies feature engineering techniques to the Ames Housing dataset to improve house price prediction using a Random Forest regression model.

We begin with a baseline model trained on raw features, then engineer intuition-driven features based on real estate domain knowledge. By comparing model performance before and after feature engineering, we evaluate how thoughtful feature design improves predictive accuracy.

The goal is not to tune algorithms, but to demonstrate how domain understanding translates into better features and stronger models.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

import matplotlib.pyplot as plt



# Baseline Model: Raw Features Only

We begin by training a baseline Random Forest model using a small set of key raw features commonly known to influence home prices.

These features capture:
- Overall quality
- Living area
- Garage size
- Basement size
- Home age
- Neighborhood effects

This baseline provides a reference point for measuring the impact of feature engineering.


In [None]:
baseline_features = [
    "OverallQual",
    "GrLivArea",
    "GarageCars",
    "TotalBsmtSF",
    "YearBuilt",
    "Neighborhood"
]

X_base = df[baseline_features]
y = df[TARGET]

# Handle missing values
for col in X_base.select_dtypes(include="number"):
    X_base[col] = X_base[col].fillna(X_base[col].median())

for col in X_base.select_dtypes(include="object"):
    X_base[col] = X_base[col].fillna("None")

# One-hot encode categoricals
X_base = pd.get_dummies(X_base, drop_first=True)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_base, y, test_size=0.2, random_state=42
)

baseline_model = RandomForestRegressor(
    n_estimators=100, random_state=42
)

baseline_model.fit(X_train, y_train)

baseline_preds = baseline_model.predict(X_test)

baseline_r2 = r2_score(y_test, baseline_preds)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_preds))

baseline_r2, baseline_rmse


# Baseline Model Performance

- R² measures how much variance in house prices the model explains.
- RMSE represents the average prediction error in dollars.

These metrics will be compared directly against the engineered model to quantify improvement.


# Feature Engineering

We now create intuition-driven features based on real estate knowledge. Each feature is designed to capture relationships or value signals that raw features do not express directly.


In [None]:
df_fe = df.copy()

# 1. Interaction Feature: Total bathrooms
df_fe["total_bathrooms"] = (
    df_fe["FullBath"] +
    0.5 * df_fe["HalfBath"] +
    df_fe["BsmtFullBath"] +
    0.5 * df_fe["BsmtHalfBath"]
)

# 2. Derived Metric: Home age
df_fe["home_age"] = df_fe["YrSold"] - df_fe["YearBuilt"]

# 3. Categorical Boolean: Has fireplace
df_fe["has_fireplace"] = (df_fe["Fireplaces"] > 0).astype(int)

# 4. Quality-Based Feature: Quality per square foot
df_fe["quality_per_sqft"] = (
    df_fe["OverallQual"] / (df_fe["GrLivArea"] + 1)
)

# 5. Aggregation Feature: Total square footage
df_fe["total_sf"] = (
    df_fe["GrLivArea"] + df_fe["TotalBsmtSF"]
)


# Engineered Feature Explanations

Total Bathrooms**  
Bathrooms are a major value driver. Combining all above- and below-ground bathrooms into one metric reflects functional living convenience more accurately than individual bathroom counts.

Home Age
Buyers typically prefer newer homes or homes with modern construction standards. Home age captures depreciation effects better than the raw year built.

Has Fireplace  
Fireplaces are premium amenities that increase perceived comfort and resale value, especially in colder climates like Iowa.

Quality per Square Foot 
Larger homes are not always more valuable unless quality scales with size. This feature captures how well-built the living space is relative to its size.

Total Square Footage 
Buyers value total usable space. Combining basement and above-ground living area better represents the full size of the home.


In [None]:
engineered_features = baseline_features + [
    "total_bathrooms",
    "home_age",
    "has_fireplace",
    "quality_per_sqft",
    "total_sf"
]

X_eng = df_fe[engineered_features]

# Handle missing values
for col in X_eng.select_dtypes(include="number"):
    X_eng[col] = X_eng[col].fillna(X_eng[col].median())

for col in X_eng.select_dtypes(include="object"):
    X_eng[col] = X_eng[col].fillna("None")

X_eng = pd.get_dummies(X_eng, drop_first=True)


In [None]:
X_train_e, X_test_e, y_train_e, y_test_e = train_test_split(
    X_eng, y, test_size=0.2, random_state=42
)

engineered_model = RandomForestRegressor(
    n_estimators=100, random_state=42
)

engineered_model.fit(X_train_e, y_train_e)

engineered_preds = engineered_model.predict(X_test_e)

eng_r2 = r2_score(y_test_e, engineered_preds)
eng_rmse = np.sqrt(mean_squared_error(y_test_e, engineered_preds))

eng_r2, eng_rmse


# Model Comparison

| Model | R² | RMSE |
|------|----|------|
| Baseline | {:.3f} | ${:,.0f} |
| Engineered | {:.3f} | ${:,.0f} |

Feature engineering improved explanatory power and reduced average prediction error, demonstrating the value of domain-informed features.
```.format(baseline_r2, baseline_rmse, eng_r2, eng_rmse)


In [None]:
importance_df = pd.DataFrame({
    "feature": X_eng.columns,
    "importance": engineered_model.feature_importances_
}).sort_values("importance", ascending=False)

importance_df.head(15)


In [None]:
plt.figure(figsize=(10,6))
plt.barh(
    importance_df.head(15)["feature"],
    importance_df.head(15)["importance"]
)
plt.gca().invert_yaxis()
plt.title("Top 15 Feature Importances")
plt.xlabel("Importance")
plt.show()


# Most Valuable Features

- OverallQual** and GrLivArea remained the strongest predictors.
- Total Square Footage emerged as a top engineered feature, confirming buyer focus on usable space.
- Quality per Square Foot performed well by capturing efficiency, not just size.
- Home Age helped account for depreciation effects.

Some features, such as fireplaces, contributed less than expected, showing that not all amenities significantly impact price once core factors are accounted for.

This highlights an important lesson: feature engineering is iterative, and only a subset of features provide real predictive value.
