# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

In [1]:
# We clone the HOPUS repository to have access to all its data and routines
!git clone https://github.com/aremondtiedrez/hopus.git
%cd hopus

Cloning into 'hopus'...
remote: Enumerating objects: 234, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 234 (delta 14), reused 18 (delta 6), pack-reused 204 (from 1)[K
Receiving objects: 100% (234/234), 748.57 KiB | 12.27 MiB/s, done.
Resolving deltas: 100% (123/123), done.
/content/hopus


In [2]:
# Import requisite modules from HOPUS
import evaluation
import preprocessing

## Data cleaning

In [3]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [4]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


In [5]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [6]:
import numpy as np
import secrets

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [23]:
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = LinearRegression()
model.fit(train_features, train_target)

# Evaluate model
train_predictions = model.predict(train_features)
test_predictions = model.predict(test_features)

train_rmse = np.sqrt(mean_squared_error(train_target, train_predictions))
test_rmse = np.sqrt(mean_squared_error(test_target, test_predictions))

In [24]:
# Report evaluations
print(f"Training error: ${train_rmse / 1_000:.0f}k")
print(f"Test error:     ${test_rmse / 1_000:.0f}k")

Training error: $158k
Test error:     $146k
