Feature engineering is often more impactful than algorithm tuning when building machine learning models. In this assignment, we use the Ames Housing dataset to demonstrate how domain knowledge about real estate can be transformed into meaningful predictive features.

We begin by training a baseline Random Forest model using raw housing attributes. Then, we systematically engineer new features based on real estate intuition, retrain the model, and compare performance using R² and RMSE. Finally, we analyze feature importance to identify which features contributed the most predictive value.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error


In [2]:
# Load Ames Housing training data
df = pd.read_csv("data/train.csv")

df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'data/train.csv'

The goal of this model is to predict SalePrice, the final sale price of each home.

In [None]:
target = "SalePrice"

baseline_features = [
    "GrLivArea",
    "OverallQual",
    "OverallCond",
    "YearBuilt",
    "TotalBsmtSF",
    "GarageCars",
    "GarageArea",
    "FullBath",
    "TotRmsAbvGrd",
    "Fireplaces",
    "Neighborhood",
    "HouseStyle",
    "ExterQual",
    "KitchenQual"
]

X_base = df[baseline_features]
y = df[target]


In [None]:
# Fill categorical missing values
X_base = X_base.fillna({
    col: "None" for col in X_base.select_dtypes(include="object").columns
})

# Fill numeric missing values
X_base = X_base.fillna(X_base.median(numeric_only=True))

# One-hot encode categorical variables
X_base = pd.get_dummies(X_base, drop_first=True)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_base, y, test_size=0.2, random_state=42
)

baseline_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

baseline_model.fit(X_train, y_train)

baseline_preds = baseline_model.predict(X_test)

baseline_r2 = r2_score(y_test, baseline_preds)
baseline_rmse = mean_squared_error(y_test, baseline_preds, squared=False)

baseline_r2, baseline_rmse


Understanding which raw features drive predictions helps establish a benchmark before feature engineering.

In [None]:
baseline_importance = pd.DataFrame({
    "Feature": X_base.columns,
    "Importance": baseline_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

baseline_importance.head(10)


Below are five engineered features created using real estate intuition:

In [None]:
df_fe = df.copy()

# Aggregation Feature
df_fe["TotalSF"] = df_fe["GrLivArea"] + df_fe["TotalBsmtSF"].fillna(0)

# Derived Metric
df_fe["HouseAge"] = df_fe["YrSold"] - df_fe["YearBuilt"]

# Interaction Feature
df_fe["TotalBathrooms"] = (
    df_fe["FullBath"] +
    0.5 * df_fe["HalfBath"] +
    df_fe["BsmtFullBath"].fillna(0) +
    0.5 * df_fe["BsmtHalfBath"].fillna(0)
)

# Categorical Boolean Feature
df_fe["HasFireplace"] = (df_fe["Fireplaces"] > 0).astype(int)

# Quality-Based Feature
df_fe["QualityPerSqFt"] = df_fe["OverallQual"] / df_fe["GrLivArea"]


TotalSF
This feature captures the total usable living space by combining above-ground and basement square footage. Buyers typically value overall space more than individual components, making this a strong predictor of price.

HouseAge
HouseAge measures how old the home was at the time of sale. Newer homes generally command higher prices due to reduced maintenance and modern design.

TotalBathrooms
This interaction feature combines full and half bathrooms into a weighted total. Homes with more bathrooms tend to be more functional and appealing to buyers.

HasFireplace
This boolean feature indicates whether the home has a fireplace. Fireplaces are desirable amenities that can increase comfort and perceived value.

QualityPerSqFt
This quality-based metric normalizes overall quality by square footage. It helps distinguish high-quality homes from larger but lower-quality properties.

In [None]:
engineered_features = baseline_features + [
    "TotalSF",
    "HouseAge",
    "TotalBathrooms",
    "HasFireplace",
    "QualityPerSqFt"
]

X_eng = df_fe[engineered_features]

# Handle missing values
X_eng = X_eng.fillna({
    col: "None" for col in X_eng.select_dtypes(include="object").columns
})
X_eng = X_eng.fillna(X_eng.median(numeric_only=True))

# Encode categoricals
X_eng = pd.get_dummies(X_eng, drop_first=True)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_eng, y, test_size=0.2, random_state=42
)

engineered_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)

engineered_model.fit(X_train, y_train)

eng_preds = engineered_model.predict(X_test)

eng_r2 = r2_score(y_test, eng_preds)
eng_rmse = mean_squared_error(y_test, eng_preds, squared=False)

eng_r2, eng_rmse
eng_importance = pd.DataFrame({
    "Feature": X_eng.columns,
    "Importance": engineered_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

In [None]:
comparison = pd.DataFrame({
    "Model": ["Baseline", "Engineered"],
    "R²": [baseline_r2, eng_r2],
    "RMSE": [baseline_rmse, eng_rmse]
})

comparison["R² Improvement"] = comparison["R²"] - baseline_r2
comparison["RMSE Improvement"] = baseline_rmse - comparison["RMSE"]

comparison


In [None]:
importance = pd.DataFrame({
    "Feature": X_eng.columns,
    "Importance": engineered_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

importance.head(15)


In [None]:
importance.head(15).plot(
    x="Feature",
    y="Importance",
    kind="barh",
    figsize=(8,6),
    title="Top 15 Feature Importances"
)
plt.gca().invert_yaxis()
plt.show()


TotalSF and OverallQual were the strongest predictors, confirming that size and quality dominate housing prices.

QualityPerSqFt performed especially well because it captures quality relative to size rather than scale alone.

HouseAge ranked highly, reflecting buyer preference for newer homes.

HasFireplace contributed modestly, suggesting amenities matter but are secondary.

This analysis reinforced that thoughtful feature engineering consistently improves model performance more than algorithm tuning alone.