-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring proposal to enable large-scale forestError test set computations #6
Comments
I'm on board for this. It's a good idea, and one that I remember we've discussed before. Thanks for putting together this well-thought-out roadmap for how to implement it. I actually have a bit of time over the coming weeks, so I'd also be happy to make these changes if you haven't already and have other things on your plate. Let me know how you'd like to proceed. |
As it happens I am in any-day-now waiting for a new baby... so if you're freer to implement, thanks! I'd be happy to do some extra testing and performance grinding over the coming months. |
Oh, wow, definitely let me handle this then. Best wishes to you and your family! |
Just pushed the major changes. I still haven't modularized the computation of |
The refactoring is complete and uploaded to CRAN. Please let me know if you encounter any issues using the new version. Thanks for the proposal! |
@benjilu hope all's well since we last caught up on the package -- still using it all the time.
Under the current design, forestError calls scale reasonably well out to 50k or 100k observations in the test set for forests with moderate (100-200) tree counts. Beyond that point, either in test set observations or tree count, the multiplication in row counts that occurs in the edgelist join of train_node to test_node starts to break memory limits.
It's still possible to take a large test set and iterate, doing e.g.
... but that recomputes everything to do with the training set in every iteration over batch ID.
So, here's a proposal for a moderately large refactoring of the quantForestError function into independently reusable components. The main objective would be to separate the two costly parts of the computation: (1) turn a forest into the trainset tree/node OOB error data structure we use internally, and return it; (2) take in the training error data and return the computed test set statistics.
This can easily be wrapped in a single function (like now) so existing code doesn't break.
quantForestError
would gain two optional parameters (e.g.use_train_nodes = NULL
andreturn_train_nodes = FALSE
)return_train_nodes
is TRUE, return the final computed form oflong_train_nodes
use_train_nodes
is not null, must be a data objectlong_train_nodes
returned from a prior computation, so we can skip the steps required to create it on a second pass.Internally, we'd want to change a few things:
long_train_nodes
should happen in a (probably also exported) function called within quantForestError;estimates
dataframe for a specific test set should get broken outA few other benefits of doing this, besides scalability, would be
long_train_nodes
separately allows for experimentation on other ways of summarizing the forest errors, beyond just bias and quantile statistics; here I'm thinking of e.g. clustering and second-stage bias correction models beyond a node-wise mean. Those wouldn't fit naturally inside a single function but could be other bolt-ons later.I'm pretty happy to do the work and send a PR over the next couple of weeks, but this is a very large set of changes and wanted to check first. Happy to fork/modify on my own outside of the main/CRAN version if you prefer that too.
The text was updated successfully, but these errors were encountered: