# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

In [1]:
# We clone the HOPUS repository to have access to all its data and routines
!git clone https://github.com/aremondtiedrez/hopus.git
%cd hopus

Cloning into 'hopus'...
remote: Enumerating objects: 313, done.[K
remote: Counting objects: 100% (109/109), done.[K
remote: Compressing objects: 100% (91/91), done.[K
remote: Total 313 (delta 64), reused 44 (delta 17), pack-reused 204 (from 1)[K
Receiving objects: 100% (313/313), 769.88 KiB | 3.61 MiB/s, done.
Resolving deltas: 100% (173/173), done.
/content/hopus


In [2]:
# Import requisite modules from HOPUS
import evaluation
import models
import preprocessing

## Data cleaning

In [3]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [4]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


## Baseline model
Average (time-normalized) price-per-square-foot over each ZIP code

In [5]:
import numpy as np
import secrets

In [6]:
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["price"])
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $157.830k


In [7]:
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["price"], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $158k
Cross-validation test error:     $159k


## Technical preliminary (before linear regression or XGBoost)
We group the columns into key features, auxiliary features, and target
(as well as into information columns and unused columns).

In [8]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [None]:
import numpy as np
import secrets
from sklearn.model_selection import train_test_split

In [None]:
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = models.LinearRegression()
model.fit(train_features, train_target)

# Evaluate model
train_rmse = np.sqrt(model.evaluate(train_features, train_target))
test_rmse = np.sqrt(model.evaluate(test_features, test_target))

# Report evaluations
print(f"Seed: {seed}")
print(f"Training error: ${train_rmse / 1_000:.3f}k")
print(f"Test error:     ${test_rmse / 1_000:.3f}k")

Seed: 1480318944
Training error: $161.367k
Test error:     $137.479k


## Linear regression: evaluation with cross-validation

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $156k
Cross-validation test error:     $159k


## Linear regression: cross-validation for various data subsets

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)

In [None]:
# --------------------------------------------------
# PART A: USING ONLY THE KEY PREDICTION FEATURES
# --------------------------------------------------

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $248k
Cross-validation test error:     $252k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $163k
Cross-validation test error:     $167k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $240k
Cross-validation test error:     $244k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $156k
Cross-validation test error:     $160k


In [None]:
# ------------------------------------------------------------
# PART B: USING THE KEY AND THE AUXILIARY PREDICTION FEATURES
# ------------------------------------------------------------
features_label = ["keyPredictionFeatures", "auxiliaryPredictionFeatures"]

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $230k
Cross-validation test error:     $248k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $144k
Cross-validation test error:     $161k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $227k
Cross-validation test error:     $240k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $138k
Cross-validation test error:     $151k


## XGBoost: training and cross-validation

In [9]:
import numpy as np
import secrets

In [10]:
features = listings_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = listings_data[("target", "price")]

hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)
train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


In [11]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $26k
Cross-validation test error:     $127k


## XGBoost: Hierarchical hyperparameter search

In [14]:
import numpy as np
import pandas as pd
import secrets

In [64]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 50,
}

experiment_records = []
n_experiments = 4
# START - HYPERPARAMETER-DEPENDENT SECTION
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)

experiment_records = pd.DataFrame(experiment_records)

In [65]:
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,165328.884302
5,1,0,0.8,0.8,1,0.1,30,133277.373184
5,1,0,0.8,0.8,1,0.1,100,125176.910132
5,1,0,0.8,0.8,1,0.1,300,125621.217459
5,1,0,0.8,0.8,1,0.1,1000,125821.204287
