# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

In [1]:
# We clone the HOPUS repository to have access to all its data and routines
!git clone https://github.com/aremondtiedrez/hopus.git
%cd hopus

Cloning into 'hopus'...
remote: Enumerating objects: 257, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (43/43), done.[K
remote: Total 257 (delta 27), reused 25 (delta 9), pack-reused 204 (from 1)[K
Receiving objects: 100% (257/257), 755.91 KiB | 8.89 MiB/s, done.
Resolving deltas: 100% (136/136), done.
/content/hopus


In [2]:
# Import requisite modules from HOPUS
import evaluation
import models
import preprocessing

## Data cleaning

In [3]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [4]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


In [5]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [6]:
import numpy as np
import secrets
from sklearn.model_selection import train_test_split

In [7]:
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = models.LinearRegressionModel()
model.fit(train_features, train_target)

# Evaluate model
train_rmse = np.sqrt(model.evaluate(train_features, train_target))
test_rmse = np.sqrt(model.evaluate(test_features, test_target))

# Report evaluations
print(f"Seed: {seed}")
print(f"Training error: ${train_rmse / 1_000:.3f}k")
print(f"Test error:     ${test_rmse / 1_000:.3f}k")

Seed: 2291501443
Training error: $156.786k
Test error:     $153.281k


## Linear regression: evaluation with cross-validation

In [8]:
import numpy as np
import secrets

In [9]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, "keyPredictionFeatures", ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $156k
Cross-validation test error:     $160k


## Linear regression: cross-validation for various data subsets

In [10]:
import numpy as np
import secrets

In [11]:
seed = secrets.randbits(32)

In [12]:
# --------------------------------------------------
# PART A: USING ONLY THE KEY PREDICTION FEATURES
# --------------------------------------------------
features_label = "keyPredictionFeatures"

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $248k
Cross-validation test error:     $252k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $163k
Cross-validation test error:     $167k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $240k
Cross-validation test error:     $245k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $156k
Cross-validation test error:     $159k


In [13]:
# ------------------------------------------------------------
# PART B: USING THE KEY AND THE AUXILIARY PREDICTION FEATURES
# ------------------------------------------------------------
features_label = ["keyPredictionFeatures", "auxiliaryPredictionFeatures"]

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegressionModel, listings_data, features_label, ("target", "price"), 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $230k
Cross-validation test error:     $247k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $144k
Cross-validation test error:     $161k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $227k
Cross-validation test error:     $242k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $138k
Cross-validation test error:     $151k
