Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace random forest #1116

Open
benjamc opened this issue Jun 24, 2024 · 1 comment
Open

Replace random forest #1116

benjamc opened this issue Jun 24, 2024 · 1 comment
Assignees
Labels
dependency Pull requests that update a dependency file.
Milestone

Comments

@benjamc
Copy link
Contributor

benjamc commented Jun 24, 2024

Issue: Installation of cpp difficult, replace by sth pythonic.

H.S.:

  • RF implemented as described in Algorithm runtime prediction: Methods & evaluation Frank Hutter∗, Lin Xu, Holger H. Hoos, Kevin Leyton-Brown. in Section 4.3.2. SMAC uses same implementation but different HPs
  • changes related to bias/variance → F. reduces variance
  • in SMAC: no compute law of total variance used. H.S. tried using it with it, leading to worse performance.
  • max features is really function dependent but should not be a problem. Maybe optimizing HPs of scikit learn is enough. BBOB: extremely randomized forests(scikit learn skopt (bias/variance) works a bit better.
  • Idea: Integrate skopt models into SMAC
@hadarshavit
Copy link

I investigated it a bit more.
In the original SMAC (see extended version: https://www.cs.ubc.ca/labs/algorithms/Projects/SMAC/papers/10-TR-SMAC.pdf, section 4.1 "Transformations of the Cost Metric") they explain the transformation in the aggregation of the leaves samples (which happens in line 222 in the current SMAC implementation

preds_as_array = np.log(np.nanmean(np.exp(preds_as_array), axis=2) + VERY_SMALL_NUMBER)
)

Note that the current implementation computes each leaf value for every sample. It can also create huge matrices (the "preds_as_array" matrix).

I check the scikit-learn implementation of random forest. There is an option to set the DecesionTreeRegressor split to "random" instead of "best" (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor), which I think is more similar to the SMAC implementation.
To have the log-transformations, a change to the criterion is required (i.e, compute the node value in a different way https://github.com/scikit-learn/scikit-learn/blob/4aeb191100f409c880d033683972ab9f47963fa4/sklearn/tree/_criterion.pyx#L1032). Such change should be possible as different criteria already use different terminal values ("MSE and Poisson deviance both set the predicted value of terminal nodes to the learned mean value of the node whereas the MAE sets the predicted value of terminal nodes to the median" from https://scikit-learn.org/stable/modules/tree.html#regression-criteria)

@benjamc benjamc added this to the v2.3 milestone Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency Pull requests that update a dependency file.
Projects
Status: No status
Development

No branches or pull requests

3 participants