# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices,
- evaluate the performance of these models, and
- make predictions on new data and display the predictions on a map.

## Technical preliminaries

We install the `hopus` package and import it.
We also import other libraries, or tools from other libraries, which will be of use.

In [None]:
pip install git+https://github.com/aremondtiedrez/hopus.git

In [None]:
import numpy as np
import pandas as pd
import secrets

from sklearn.model_selection import train_test_split

import hopus

## Data pipeline

The data used here is obtained via
the [RentCast API](https://www.rentcast.io/api).

Our data pipeline proceeds in three steps.

1. We clean the raw data. The data provided by the RentCast API is already
   of good quality, but nonetheless some gaps in the data are present which
   we must account for. Moreover, the RentCast data comes in the form of a JSON
   file which, due to its hierarchical and nested nature, needs to be untangled
   in order to put the data into a more digestible tabular format.

   Another important step at this stage has to do with the evolution of prices
   over time. Predicting how real estates prices evolve over time is
   an important, complex, and challenging question. We therefore avoid it
   entirely here. Instead we use a national (U.S.) home price index
   (known as HPI below) to "normalize" home prices over time and
   then predict these "normalized" home prices.

2. We remove outliers and samples which are missing key features.

   Why remove outliers? Because, when an unusual real estate sale listing
   is encountered, it is best to gather more information and proceed with caution. For example: some houses are bought at prices significantly above
   market rates by developers than then raze them and then redevelop the lot
   into a multi-apartment building. Without access to additional information,
   such as zoning information which indicates neighbourhoods where such
   redevelomenpts are possible, or such as demographic and economic data, which
   indicates where such redevelopments are likely to be profitable,
   it is unlikely that our model could reliably predict the sale price
   of such properties.

   Why remove samples which are missing key features? For two reasons.
   First, because by *key* features we mean things such as
   the number of bathrooms or the lot size. These are quantities that
   can always be gathered "by-hand".
   Second, because the overwhelming majority of real estate listings data
   accessed via the RentCast API contain all of the features we label
   as key features. Only about 5% of the listings fails to do so.

3. We split the data into a training set and a test set.
   (Nothing about this step is specific to the data or problem at hand.)

In [None]:
# Load the data
hpi = hopus.preprocessing.home_price_index.load_demo_data()
listings_data = hopus.preprocessing.property_listings.load_demo_data()

# Pre-process the data, i.e. clean the data
hopus.preprocessing.home_price_index.preprocess(hpi)
listings_data = hopus.preprocessing.property_listings.preprocess(listings_data, hpi)

# Remove outliers and listings which are missing key features
hopus.preprocessing.property_listings.drop_outliers(listings_data)
hopus.preprocessing.property_listings.drop_missing_key_features(listings_data)

# Split the data into training and testing sets.
# The training and test data obtained by using the seed below
# are already saved in the HOPUS package, to enable easy access
# later in this notebook.
# (The seed used was randomly generated via secrets.randbits)
seed = 726109520
training_data, test_data = train_test_split(listings_data, train_size=0.8, shuffle=True, random_state=seed)

## Training models

### Pedagogical prelude: predicting prices (and not *log*-prices)

Ultimately we will seek to train models that predict *log*-prices.
After all, a prediction error of \$50,000 is never desirable,
but it is much easier to swallow on a property whose value is around \$1,000,000
than on a property whose value is about \$200,000.

In other words: what we care about here is not the *absolute* error
but the error *relative* to the property's value. In some sense,
that is precisely what predicting *log*-prices will take care of:
an absolute error in the prediction of the *log*-price is really
a relative error in the prediction of the price.

That being said, for pedagogical and expository purposes it is a good idea
to start by predicting prices, instead of log-prices, since the magnitude
of the errors obtained are much more readily interpretable. In other words:
it is immediately clear what a prediction error of \$50,000 means,
when predicting the price. It is less clear what a prediction error of 0.13
means when predicting the log-price.

In [None]:
# First we load the training data
training_data = hopus.demo.load_training_data()

#### The `Baseline` model

Before we do anything fancier, we should establish a baseline to see
how well "reasonable guesses" perform when predicting home prices.
And what is a common way to guess the price of a house? Well, you would look
at nearby houses, see what their price-per-square-foot is, and use that
to predict the price of the house of interest. That is exactly how
the `Baseline` model proceeds, where "nearby houses" means "all houses with
the same ZIP code".

In [None]:
from hopus.models import Baseline

In [None]:
# Train the model
model = Baseline()
model.fit(features=training_data, target=None)

# Compute and report the training error
train_mse = model.evaluate(features=training_data, target=training_data["price"])
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.0f}k")

Training error: $154k


In [None]:
# Of course, the training error underestimates the true error.
# Instead we can use the cross-validation error to better estimate the true error.
seed = secrets.randbits(32)
train_cv_mse, dev_cv_mse, trained_models = hopus.evaluation.cv_evaluation(Baseline, training_data, training_data["price"], n_splits=10, seed=seed)
train_cv_rmse, dev_cv_rmse = np.sqrt(np.array([train_cv_mse, dev_cv_mse]))
print(f"Cross-validation error: ${dev_cv_rmse / 1_000:.0f}k")

Cross-validation error: $155k


#### The `LinearRegression` model

The `Baseline` model above cares only about geography, and about
where a home is located. (That is, after all, the prime directive
of real estate: "Location, location, location".)

We could also estimate the price of a house in a completely different way,
dismissing location entirely. We would instead look at quantities such as
the number of bedrooms, the number of bathrooms, the number of floors, or
the year the house was built, and predict the price based on a combination
of these values (newer houses are more expensive, those with fewer bedrooms
are cheaper).

This is what the `LinearRegression` model does. It also also makes its best
effort to predict the price if that price varies only *linearly* with each
of the quantities mentionned above. (This is, of course, unreasonable: the seventeenth bedroom is not as valuable as the second one. Nonetheless, this
model remains valuable.)

Technical note: the `hopus.models.LinearRegression` class is a thin wrapper
around the . The wrapper is helpful in that it allows using the same interface
to access both the `Baseline` model above and
this `LinearRegression` model here.

In [None]:
from hopus.models import LinearRegression

In [None]:
# An extra dose of data pre-processing is done here since
# the linear regression model will only depend on features
# deemed to be "key" features.

# WARNING: Running this cell more than once may cause issues.
# If so, simply reload the training data above before re-running this cell.

hopus.preprocessing.property_listings.group_columns(training_data)

In [None]:
# Extract the features and target
features = training_data["keyPredictionFeatures"]
target = training_data[("target", "price")]

# Train the model
model = LinearRegression()
model.fit(features, target)

# Compute and report the training error
train_mse = model.evaluate(features, target)
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.0f}k")

Training error: $148k


In [None]:
# Once again, the training error underestimates the true error.
# We use the cross-validation error to better estimate the true error.
seed = secrets.randbits(32)
train_cv_mse, dev_cv_mse, trained_models = hopus.evaluation.cv_evaluation(LinearRegression, features, target, n_splits=10, seed=seed)
train_cv_rmse, dev_cv_rmse = np.sqrt(np.array([train_cv_mse, dev_cv_mse]))
print(f"Cross-validation error: ${dev_cv_rmse / 1_000:.0f}k")

Cross-validation error: $152k


#### The `BoostedTrees` model

The two models above, `Baseline` and `LinearRegression`,
perform similarly even though their approaches are very different:
the former focuses solely on geographical location, the latter eschews it
entirely!

We now train a more complicated model. Since the data is *tabular*,
a reasonable class of models to use are *decision trees*.
In particular, the `BoostedTrees` model discussed here
leverages `xgboost`, a popular library for *boosted* decision trees
(in short, boosted decision trees repeatedly train one decision tree
after another, focusing on the samples on which the previous trees
are producing large errors).

The good news is that, even though this `BoostedTrees` model
is built on a completely different library than the previous two models,
the `hopus.models` interface stays the same.

In [None]:
from hopus.models import BoostedTrees

In [None]:
# Extract the features and target
features = training_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = training_data[("target", "price")]

# One of the appeal of the xgboost library is
# that additional performance can be squeezed out
# of judicious hyper-parameter choices.
# The hyper-parameter below were chosen after
# a careful hyper-parameter search.
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train the model
model = BoostedTrees(**hyperparameters)
model.fit(features, target)

# Compute and report the training error
train_mse = model.evaluate(features, target)
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.1f}k")

Training error: $13.0k


In [None]:
# That training error looks amazing! But, be careful:
# here the training error grossly underestimates the true error.
# Once again, we use the cross-validation error to better estimate the true error.
# (This may take between about a minute to run.)
seed = secrets.randbits(32)
train_cv_mse, dev_cv_mse, trained_models = hopus.evaluation.cv_evaluation(BoostedTrees, features, target, n_splits=10, seed=seed, hyperparameters=hyperparameters)
train_cv_rmse, dev_cv_rmse = np.sqrt(np.array([train_cv_mse, dev_cv_mse]))
print(f"Cross-validation error: ${dev_cv_rmse / 1_000:.0f}k")

Cross-validation error: $132k


### Predicting *log*-prices

As mentioned earlier, what we are really after
is the prediction of *log*-prices, and not merely prices.

Nonetheless, we proceed as above otherwise and compare
the performance of the `Baseline`, `LinearRegression`,
and `BoostedTrees` models.

In [None]:
# We start with the Baseline model.
# To avoid a mismatch between how the data is formatted and
# what the Baseline model expects, we reload the training data
training_data = hopus.demo.load_training_data()

# Train a Baseline model and, if desired, save it
# Note: the Baseline model is trained the same way whether we seek
# to predict prices or log-prices. The difference comes later,
# namely when this model is evaluated.
model = Baseline()
model.fit(training_data, None)
#model.save("trained_baseline.csv")

# Compute the cross-validation error of the Baseline model
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = hopus.evaluation.cv_evaluation(Baseline, training_data, training_data["logPrice"], n_splits=10, seed=seed, target_type="log_price")
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
print(f"Cross-validation test error: {test_cv_rmse:.3f}")

Cross-validation test error: 0.317


In [None]:
# We now turn our attention to the LinearRegression and BoostedTrees models.
# They expect the data in a different format than the Baseline model,
# so we process the training data first.

# WARNING: Running this cell more than once may cause issues.
# If so, simply relo
hopus.preprocessing.property_listings.group_columns(training_data)

In [None]:
# First, the LinearRegression model.

# Extract the features and target
features = training_data["keyPredictionFeatures"]
target = training_data[("target", "logPrice")]

# Train a LinearRegression model, and, if desired, save it
model = LinearRegression()
model.fit(features, target)
#model.save("trained_linear_regression.npz")

# Compute the cross-validation error of the LinearRegression model
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = hopus.evaluation.cv_evaluation(LinearRegression, features, target, n_splits=10, seed=seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
print(f"Cross-validation test error: {test_cv_rmse:.3f}")

Cross-validation test error: 0.281


In [None]:
# Second, the BoostedTrees model.

# Extract the features and target
features = training_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = training_data[("target", "logPrice")]

# Once again, the hyper-parameter below were chosen
# after a careful hyper-parameter search
hyperparameters = {
    "max_depth": 3,
    "min_child_weight": 5,
    "gamma": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.4,
    "reg_lambda": 1,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train a `BoostedTrees` model and, if desired, save ti
model = BoostedTrees(**hyperparameters)
model.fit(features, target)
#model.save("trained_boosted_trees.json")

# Compute the cross-validation error of the BoostedTrees model
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = hopus.evaluation.cv_evaluation(BoostedTrees, features, target, n_splits=10, seed=seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
print(f"Cross-validation test error: {test_cv_rmse:.3f}")

Cross-validation test error: 0.259


**Model selection**: The best performing model is the `BoostedTree` model,
and so that is the model we will use going forward.

## Evaluating the chosen model

Since the best performing model is the `BoostedTrees` model,
that is the one we will use going forward.

Keep in mind, however, that since we "optimized" over the cross-validation
error, namely choosing the model with the lowest such error, we can no longer
use that error as a good estimate of the true error.

Instead, we bring back the *test* data to do so.

In [None]:
# Load the test data and process it
test_data = hopus.demo.load_test_data()
hopus.preprocessing.property_listings.group_columns(test_data)

# Extract the features and target
test_features = test_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
test_target = test_data[("target", "logPrice")]

In [None]:
# Load the trained BoostedTrees model
model = hopus.demo.load_trained_model("BoostedTrees")

# Typically, to load a model we would proceed as follows.
# We would first create the (untrained) model, then
# load its trained parameters from `filename`:
# model = BoostedTrees()
# model.load(filename)

# Here, for demonstration purposes, a trained Baseline model
# and a trained LinearRegression model can be loaded as follows:
# model = hopus.demo.load_trained_model("Baseline")
# model = hopus.demo.load_trained_model("LinearRegression")

In [None]:
# Estimate the error by computing the test error
test_mse = model.evaluate(test_features, test_target)
test_rmse = np.sqrt(test_mse)
print(f"Test error: {test_rmse:.3f}")

Test error: 0.228


## Predictions

Now that we have selected a model and estimated its error,
we can make predictions over new data and display the predictions
on a geographical map.

In [47]:
# Load new data (here, for simplicity, we use the test data as new data)
data = hopus.demo.load_test_data()
geographical_data = data[["latitude", "longitude", "addressLine1"]]

# Load the model
model = hopus.demo.load_trained_model("BoostedTrees")

# Make predictions
hopus.preprocessing.property_listings.group_columns(data)
features = data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
predictions = np.exp(model.predict(features))

# Format the predictions and combine them with the geographical data and the true price
predictions = pd.Series(predictions)
predictions.name = "predictedPrice"
predictions = predictions.apply(lambda value: f"${value:,.0f}")

# Combine the predictions with the true price and with the geographical data
truth = data[("target", "price")].apply(lambda value: f"${value:,.0f}")
truth.name = "truePrice"
geographical_predictions = pd.concat([geographical_data, truth, predictions], axis=1)

# Save the geographical data as a CSV file
geographical_predictions.to_csv("predictions.csv", index=False)

In [48]:
geographical_predictions

Unnamed: 0,latitude,longitude,addressLine1,truePrice,predictedPrice
0,43.144371,-89.309030,4138 Grayhawk Trl,"$410,000","$360,716"
1,43.079762,-89.473906,1121 Lorraine Dr,"$374,000","$395,339"
2,43.066756,-89.469030,5201 Burnett Dr,"$475,000","$466,880"
3,43.043863,-89.386815,2013 Sundstrom St,"$320,000","$323,857"
4,43.092631,-89.330444,3441 Hargrove St,"$277,000","$273,727"
...,...,...,...,...,...
329,43.062265,-89.464542,322 Cheyenne Trl,"$640,000","$565,584"
330,43.065900,-89.312330,5105 Camden Rd,"$360,000","$347,773"
331,43.069807,-89.425577,2226 Van Hise Ave,"$549,700","$562,818"
332,43.082758,-89.362019,1228 Spaight St,"$825,000","$607,226"


After all this hard work, the resulting predictions are shown on this [interactive map](https://aremondtiedrez.github.io/hopus/predictions_map.html).