# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

In [1]:
# We clone the HOPUS repository to have access to all its data and routines
!git clone https://github.com/aremondtiedrez/hopus.git
%cd hopus

Cloning into 'hopus'...
remote: Enumerating objects: 361, done.[K
remote: Counting objects: 100% (157/157), done.[K
remote: Compressing objects: 100% (131/131), done.[K
remote: Total 361 (delta 98), reused 59 (delta 25), pack-reused 204 (from 1)[K
Receiving objects: 100% (361/361), 786.70 KiB | 5.92 MiB/s, done.
Resolving deltas: 100% (207/207), done.
/content/hopus


In [2]:
# Import requisite modules from HOPUS
import evaluation
import models
import preprocessing

## Data cleaning

In [3]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [4]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


## Baseline model
Average (time-normalized) price-per-square-foot over each ZIP code

In [5]:
import numpy as np
import secrets

In [6]:
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["price"])
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $157.830k


In [7]:
# Save the model
model.save("baseline_model")
del model

In [8]:
# Loading the model
model = models.Baseline()
model.load("baseline_model")
loaded_rmse = np.sqrt(model.evaluate(listings_data, listings_data["price"]))
print(f"Training error: ${loaded_rmse / 1_000:.3f}k")

Training error: $157.830k


In [10]:
# Cross-validation
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["price"], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $158k
Cross-validation test error:     $158k


## Baseline model for *log*-prices

In [13]:
# Training and evaluating a single model
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["logPrice"], target_type="log_price")
train_rmse = np.sqrt(train_mse)
print(f"Training error: {train_rmse:.3f}")

Training error: 0.317


In [16]:
# Cross-validation
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["logPrice"], 100, seed, target_type="log_price")
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.317
Cross-validation test error:     0.318


## Technical preliminary (before linear regression or XGBoost)
We group the columns into key features, auxiliary features, and target
(as well as into information columns and unused columns).

In [17]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [18]:
import numpy as np
import secrets
from sklearn.model_selection import train_test_split

In [19]:
# Train a single model, then save it, delete it, and load it back

# Train
model = models.LinearRegression()
model.fit(listings_data["keyPredictionFeatures"], listings_data[("target", "price")])

# Evaluate
train_rmse = np.sqrt(model.evaluate(listings_data["keyPredictionFeatures"], listings_data[("target", "price")]))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

# Save
model.save("linear_regression_model")

# Delete
del model

# Load
model = models.LinearRegression()
model.load("linear_regression_model")
print(f"Training error (after deleting the original model and loading it back): ${train_rmse / 1_000:.3f}k")

Training error: $155.970k
Training error (after deleting the original model and loading it back): $155.970k


In [20]:
# Cross-validation
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = models.LinearRegression()
model.fit(train_features, train_target)

# Evaluate model
train_rmse = np.sqrt(model.evaluate(train_features, train_target))
test_rmse = np.sqrt(model.evaluate(test_features, test_target))

# Report evaluations
print(f"Seed: {seed}")
print(f"Training error: ${train_rmse / 1_000:.3f}k")
print(f"Test error:     ${test_rmse / 1_000:.3f}k")

Seed: 3645453190
Training error: $152.175k
Test error:     $171.501k


## Linear regression: evaluation with cross-validation

In [21]:
import numpy as np
import secrets

In [24]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $156k
Cross-validation test error:     $159k


## Linear regression for *log*-prices: evaluation with cross-validation

In [25]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(
    models.LinearRegression,
    listings_data["keyPredictionFeatures"],
    listings_data[("target", "logPrice")],
    100,
    seed,
    target_type="log_price"
)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.280
Cross-validation test error:     0.281


## Linear regression: cross-validation for various data subsets

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)

In [None]:
# --------------------------------------------------
# PART A: USING ONLY THE KEY PREDICTION FEATURES
# --------------------------------------------------

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $248k
Cross-validation test error:     $252k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $163k
Cross-validation test error:     $167k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $240k
Cross-validation test error:     $244k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $156k
Cross-validation test error:     $160k


In [None]:
# ------------------------------------------------------------
# PART B: USING THE KEY AND THE AUXILIARY PREDICTION FEATURES
# ------------------------------------------------------------
features_label = ["keyPredictionFeatures", "auxiliaryPredictionFeatures"]

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $230k
Cross-validation test error:     $248k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $144k
Cross-validation test error:     $161k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $227k
Cross-validation test error:     $240k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $138k
Cross-validation test error:     $151k


## XGBoost: training and cross-validation

In [26]:
import numpy as np
import secrets

In [27]:
# Training a single model
features = listings_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = listings_data[("target", "price")]

hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)
train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


In [28]:
# Saving a model
model.save("boosted_tree_model")
del model

  self.get_booster().save_model(fname)


In [29]:
# Loading a model
model = models.BoostedTrees()
model.load("boosted_tree_model")

train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


  self.get_booster().load_model(fname)


In [31]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $26k
Cross-validation test error:     $125k


## XGBoost: training and cross-validation for *log*-prices

In [33]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, np.log(target), 100, seed, hyperparameters=hyperparameters, target_type="log_price")
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.065
Cross-validation test error:     0.255


## XGBoost: Hierarchical hyperparameter search

We follow the hierarchical hyperparameter search procedure described in
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [34]:
import numpy as np
import pandas as pd

from itertools import product

### Step 1: `n_estimators`

In [None]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,167866.093173
5,1,0,0.8,0.8,1,0.1,30,154970.591688
5,1,0,0.8,0.8,1,0.1,100,142922.38565
5,1,0,0.8,0.8,1,0.1,300,131397.791722
5,1,0,0.8,0.8,1,0.1,1000,141849.177067


### Step 2: `max_depth` and `min_child_weight`

In [None]:
# STEP 2
# We fix `n_estimators = 300` from the previous step and
# now seek to find good values for `max_depth` and `min_child_weight`.
hyperparameters = {
    "n_estimators": 100,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for max_depth, min_child_weight in product((3, 5, 7, 9), (1, 3, 5)):
    hyperparameters["max_depth"] = max_depth
    hyperparameters["min_child_weight"] = min_child_weight
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,max_depth,min_child_weight,Unnamed: 8_level_1
100,0,0.8,0.8,1,0.1,3,1,122362.546601
100,0,0.8,0.8,1,0.1,3,3,130029.778251
100,0,0.8,0.8,1,0.1,3,5,131322.190905
100,0,0.8,0.8,1,0.1,5,1,140524.192897
100,0,0.8,0.8,1,0.1,5,3,131566.630501
100,0,0.8,0.8,1,0.1,5,5,132154.710467
100,0,0.8,0.8,1,0.1,7,1,148857.905292
100,0,0.8,0.8,1,0.1,7,3,132531.70392
100,0,0.8,0.8,1,0.1,7,5,132905.108605
100,0,0.8,0.8,1,0.1,9,1,123762.845216


### Step 3: `gamma`

In [None]:
# STEP 3
# We fix `max_depth = 5` and `min_child_weight = 3` from the previous step and
# now seek to find a good value for `gamma`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for gamma in (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7):
    hyperparameters["gamma"] = gamma
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,scale_pos_weight,learning_rate,gamma,Unnamed: 8_level_1
100,5,3,0.8,0.8,1,0.1,0.0,129868.921322
100,5,3,0.8,0.8,1,0.1,0.1,139951.417399
100,5,3,0.8,0.8,1,0.1,0.2,136127.287884
100,5,3,0.8,0.8,1,0.1,0.3,127061.01858
100,5,3,0.8,0.8,1,0.1,0.4,128624.666416
100,5,3,0.8,0.8,1,0.1,0.5,154614.412506
100,5,3,0.8,0.8,1,0.1,0.6,130056.174284
100,5,3,0.8,0.8,1,0.1,0.7,134921.474051


### Step 4: `subsample` and `colsample_bytree`

In [None]:
# STEP 4
# We fix `gamma = 0.2` from the previous step and
# now seek to find good values for `subsample` and `colsample_bytree`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for subsample, colsample_bytree in product((0.6, 0.7, 0.8, 0.9), (0.6, 0.7, 0.8, 0.9)):
    hyperparameters["subsample"] = subsample
    hyperparameters["colsample_bytree"] = colsample_bytree
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,scale_pos_weight,learning_rate,subsample,colsample_bytree,Unnamed: 8_level_1
100,5,3,0.2,1,0.1,0.6,0.6,132869.031376
100,5,3,0.2,1,0.1,0.6,0.7,131054.764492
100,5,3,0.2,1,0.1,0.6,0.8,138038.242894
100,5,3,0.2,1,0.1,0.6,0.9,127430.389929
100,5,3,0.2,1,0.1,0.7,0.6,138724.340417
100,5,3,0.2,1,0.1,0.7,0.7,142196.856449
100,5,3,0.2,1,0.1,0.7,0.8,140327.985786
100,5,3,0.2,1,0.1,0.7,0.9,130399.316808
100,5,3,0.2,1,0.1,0.8,0.6,127570.024483
100,5,3,0.2,1,0.1,0.8,0.7,127403.831654


### Step 5: `reg_lambda`

In [None]:
# STEP 5
# We fix `subsample = 0.6` and `colsample_bytree = 0.9` from the previous step and
# now seek to find a good values for `reg_lambda`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
n_experiments = 10
for reg_lambda in (0, 1e-5, 1e-4, 1e-3, 1e-2, 1, 10, 100):
    hyperparameters["reg_lambda"] = reg_lambda
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,reg_lambda,Unnamed: 9_level_1
100,5,3,0.2,0.6,0.9,1,0.1,0.0,130598.809918
100,5,3,0.2,0.6,0.9,1,0.1,1e-05,129145.321268
100,5,3,0.2,0.6,0.9,1,0.1,0.0001,139496.294011
100,5,3,0.2,0.6,0.9,1,0.1,0.001,144907.922949
100,5,3,0.2,0.6,0.9,1,0.1,0.01,132371.172048
100,5,3,0.2,0.6,0.9,1,0.1,1.0,133882.609197
100,5,3,0.2,0.6,0.9,1,0.1,10.0,140659.833262
100,5,3,0.2,0.6,0.9,1,0.1,100.0,153363.197463


### Step 6: Evaluate the final hyperparameter choice and train a final model

In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

seed = secrets.randbits(32)

start_time = time.time()
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Duration of the experiment: 13.0 minutes.
Cross-validation training error: $14k
Cross-validation test error:     $119k


In [35]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train the model, then save it
model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)

In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

In [None]:
model._model.save_model("xgb.model")

  self.get_booster().save_model(fname)


In [None]:
model._model.load_model("xgb.model")

  self.get_booster().load_model(fname)


In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

## XGBoost: Hierarchical hyperparameter search for the prediction of *log*-prices

We follow the hierarchical hyperparameter search procedure described in
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [36]:
import numpy as np
import pandas as pd

from itertools import product

### Step 1: `n_estimators`

In [None]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,167866.093173
5,1,0,0.8,0.8,1,0.1,30,154970.591688
5,1,0,0.8,0.8,1,0.1,100,142922.38565
5,1,0,0.8,0.8,1,0.1,300,131397.791722
5,1,0,0.8,0.8,1,0.1,1000,141849.177067


### Step 2: `max_depth` and `min_child_weight`

In [None]:
# STEP 2
# We fix `n_estimators = 300` from the previous step and
# now seek to find good values for `max_depth` and `min_child_weight`.
hyperparameters = {
    "n_estimators": 100,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for max_depth, min_child_weight in product((3, 5, 7, 9), (1, 3, 5)):
    hyperparameters["max_depth"] = max_depth
    hyperparameters["min_child_weight"] = min_child_weight
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,max_depth,min_child_weight,Unnamed: 8_level_1
100,0,0.8,0.8,1,0.1,3,1,122362.546601
100,0,0.8,0.8,1,0.1,3,3,130029.778251
100,0,0.8,0.8,1,0.1,3,5,131322.190905
100,0,0.8,0.8,1,0.1,5,1,140524.192897
100,0,0.8,0.8,1,0.1,5,3,131566.630501
100,0,0.8,0.8,1,0.1,5,5,132154.710467
100,0,0.8,0.8,1,0.1,7,1,148857.905292
100,0,0.8,0.8,1,0.1,7,3,132531.70392
100,0,0.8,0.8,1,0.1,7,5,132905.108605
100,0,0.8,0.8,1,0.1,9,1,123762.845216


### Step 3: `gamma`

In [None]:
# STEP 3
# We fix `max_depth = 5` and `min_child_weight = 3` from the previous step and
# now seek to find a good value for `gamma`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for gamma in (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7):
    hyperparameters["gamma"] = gamma
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,scale_pos_weight,learning_rate,gamma,Unnamed: 8_level_1
100,5,3,0.8,0.8,1,0.1,0.0,129868.921322
100,5,3,0.8,0.8,1,0.1,0.1,139951.417399
100,5,3,0.8,0.8,1,0.1,0.2,136127.287884
100,5,3,0.8,0.8,1,0.1,0.3,127061.01858
100,5,3,0.8,0.8,1,0.1,0.4,128624.666416
100,5,3,0.8,0.8,1,0.1,0.5,154614.412506
100,5,3,0.8,0.8,1,0.1,0.6,130056.174284
100,5,3,0.8,0.8,1,0.1,0.7,134921.474051


### Step 4: `subsample` and `colsample_bytree`

In [None]:
# STEP 4
# We fix `gamma = 0.2` from the previous step and
# now seek to find good values for `subsample` and `colsample_bytree`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for subsample, colsample_bytree in product((0.6, 0.7, 0.8, 0.9), (0.6, 0.7, 0.8, 0.9)):
    hyperparameters["subsample"] = subsample
    hyperparameters["colsample_bytree"] = colsample_bytree
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,scale_pos_weight,learning_rate,subsample,colsample_bytree,Unnamed: 8_level_1
100,5,3,0.2,1,0.1,0.6,0.6,132869.031376
100,5,3,0.2,1,0.1,0.6,0.7,131054.764492
100,5,3,0.2,1,0.1,0.6,0.8,138038.242894
100,5,3,0.2,1,0.1,0.6,0.9,127430.389929
100,5,3,0.2,1,0.1,0.7,0.6,138724.340417
100,5,3,0.2,1,0.1,0.7,0.7,142196.856449
100,5,3,0.2,1,0.1,0.7,0.8,140327.985786
100,5,3,0.2,1,0.1,0.7,0.9,130399.316808
100,5,3,0.2,1,0.1,0.8,0.6,127570.024483
100,5,3,0.2,1,0.1,0.8,0.7,127403.831654


### Step 5: `reg_lambda`

In [None]:
# STEP 5
# We fix `subsample = 0.6` and `colsample_bytree = 0.9` from the previous step and
# now seek to find a good values for `reg_lambda`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
n_experiments = 10
for reg_lambda in (0, 1e-5, 1e-4, 1e-3, 1e-2, 1, 10, 100):
    hyperparameters["reg_lambda"] = reg_lambda
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,reg_lambda,Unnamed: 9_level_1
100,5,3,0.2,0.6,0.9,1,0.1,0.0,130598.809918
100,5,3,0.2,0.6,0.9,1,0.1,1e-05,129145.321268
100,5,3,0.2,0.6,0.9,1,0.1,0.0001,139496.294011
100,5,3,0.2,0.6,0.9,1,0.1,0.001,144907.922949
100,5,3,0.2,0.6,0.9,1,0.1,0.01,132371.172048
100,5,3,0.2,0.6,0.9,1,0.1,1.0,133882.609197
100,5,3,0.2,0.6,0.9,1,0.1,10.0,140659.833262
100,5,3,0.2,0.6,0.9,1,0.1,100.0,153363.197463


### Step 6: Evaluate the final hyperparameter choice and train a final model

In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

seed = secrets.randbits(32)

start_time = time.time()
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Duration of the experiment: 13.0 minutes.
Cross-validation training error: $14k
Cross-validation test error:     $119k


In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train the model, then save it
model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)

In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

In [None]:
model._model.save_model("xgb.model")

  self.get_booster().save_model(fname)


In [None]:
model._model.load_model("xgb.model")

  self.get_booster().load_model(fname)


In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)