# Random Forests vs Histogram Gradient Boosting models

In this example we compare the performance of Random Forest (RF) and Histogram Gradient Boosting (HGBT) models in terms of score and computation time for a regression dataset, though all the concepts here presented apply to classification as well.

The comparison is made by varying the parameters that control the number of trees according to each estimator:

n_estimators controls the number of trees in the forest. It’s a fixed number.

max_iter is the maximum number of iterations in a gradient boosting based model. The number of iterations corresponds to the number of trees for regression and binary classification problems. Furthermore, the actual number of trees required by the model depends on the stopping criteria.

HGBT uses gradient boosting to iteratively improve the model’s performance by fitting each tree to the negative gradient of the loss function with respect to the predicted value. RFs, on the other hand, are based on bagging and use a majority vote to predict the outcome.


# 1. Load the dataset

In [84]:
from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

In [87]:
X.head() , y.head()

(   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
 0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
 1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
 2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
 3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
 4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
 
    Longitude  
 0    -122.23  
 1    -122.22  
 2    -122.24  
 3    -122.25  
 4    -122.25  ,
 0    4.526
 1    3.585
 2    3.521
 3    3.413
 4    3.422
 Name: MedHouseVal, dtype: float64)

In [88]:
X.shape

(20640, 8)

# 2. Compute score and computation times

Notice that many parts of the implementation of HistGradientBoostingClassifier and HistGradientBoostingRegressor are parallelized by default.

The implementation of RandomForestRegressor and RandomForestClassifier can also be run on multiple cores by using the n_jobs parameter, here set to match the number of physical cores on the host machine.

In [89]:
import joblib

N_CORES = joblib.cpu_count(only_physical_cores=True)
print(f"Number of physical cores: {N_CORES}")

Number of physical cores: 1


In [90]:
import pandas as pd

from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}
param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}
cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []
for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)
    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

In [93]:
len(results)

2

# 3. Performance of Random forest

In [92]:
results[0]

{'model': 'Random Forest',
 'cv_results':    mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
 0       1.645796      0.430740         0.013860        0.001694   
 1       2.571259      0.271305         0.021254        0.002835   
 2       6.254258      0.142329         0.049803        0.005455   
 3      12.709991      0.032782         0.087468        0.012664   
 
    param_n_estimators                 params  split0_test_score  \
 0                  10   {'n_estimators': 10}           0.775129   
 1                  20   {'n_estimators': 20}           0.780163   
 2                  50   {'n_estimators': 50}           0.787525   
 3                 100  {'n_estimators': 100}           0.790414   
 
    split1_test_score  split2_test_score  split3_test_score  mean_test_score  \
 0           0.802565           0.797065           0.798195         0.793239   
 1           0.808334           0.802799           0.802941         0.798559   
 2           0.812032           0.8

# 4. Performance of Histogram Gradient Boosting

In [94]:
results[1]

{'model': 'Hist Gradient Boosting',
 'cv_results':    mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
 0       0.090888      0.007222         0.010296        0.000744   
 1       0.082654      0.002437         0.011589        0.000909   
 2       0.157705      0.003031         0.024664        0.001274   
 3       0.272489      0.006223         0.044208        0.002045   
 4       0.732775      0.024231         0.109540        0.004732   
 5       1.301649      0.217493         0.191668        0.037432   
 
    param_max_iter             params  split0_test_score  split1_test_score  \
 0              10   {'max_iter': 10}           0.570686           0.582963   
 1              20   {'max_iter': 20}           0.702434           0.720154   
 2              50   {'max_iter': 50}           0.791318           0.804498   
 3             100  {'max_iter': 100}           0.818657           0.825981   
 4             300  {'max_iter': 300}           0.842970           0.845785  

# 5. Plot and visualize the results

In [95]:
import plotly.colors as colors
import plotly.express as px
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = list(param_grids[model_name].keys())[0]
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

Conclusion

Overall, one should often observe that the Histogram-based gradient boosting models uniformly dominate the Random Forest models in the “test score vs training speed trade-off” (the HGBDT curve should be on the top left of the RF curve, without ever crossing). The “test score vs prediction speed” trade-off can also be more disputed, but it’s most often favorable to HGBDT. It’s always a good idea to check both kinds of model (with hyper-parameter tuning) and compare their performance on your specific problem to determine which model is the best fit but HGBT almost always offers a more favorable speed-accuracy trade-off than RF, either with the default hyper-parameters or including the hyper-parameter tuning cost.

There is one exception to this rule of thumb though: when training a multiclass classification model with a large number of possible classes, HGBDT fits internally one-tree per class at each boosting iteration while the trees used by the RF models are naturally multiclass which should improve the speed accuracy trade-off of the RF models in this case.