Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring proposal to enable large-scale forestError test set computations #6

Closed
sheffe opened this issue Jul 28, 2021 · 5 comments
Closed

Comments

@sheffe
Copy link
Contributor

sheffe commented Jul 28, 2021

@benjilu hope all's well since we last caught up on the package -- still using it all the time.

Under the current design, forestError calls scale reasonably well out to 50k or 100k observations in the test set for forests with moderate (100-200) tree counts. Beyond that point, either in test set observations or tree count, the multiplication in row counts that occurs in the edgelist join of train_node to test_node starts to break memory limits.

It's still possible to take a large test set and iterate, doing e.g.

huge_test_df %>%
  split(.$batch_id) %>%
  purrr::map(
    .f = ~quantForestError(
               forest = forest,
               X.train = trainset[c("x1", "x2", "x3")],
               X.test = .x[c("x1", "x2", "x3")],
               Y.train = trainset$y
  )

... but that recomputes everything to do with the training set in every iteration over batch ID.

So, here's a proposal for a moderately large refactoring of the quantForestError function into independently reusable components. The main objective would be to separate the two costly parts of the computation: (1) turn a forest into the trainset tree/node OOB error data structure we use internally, and return it; (2) take in the training error data and return the computed test set statistics.

This can easily be wrapped in a single function (like now) so existing code doesn't break.

  • The interface to quantForestError would gain two optional parameters (e.g. use_train_nodes = NULL and return_train_nodes = FALSE)
  • If return_train_nodes is TRUE, return the final computed form of long_train_nodes
  • If use_train_nodes is not null, must be a data object long_train_nodes returned from a prior computation, so we can skip the steps required to create it on a second pass.

Internally, we'd want to change a few things:

  • The input-forest class-specific extraction of data/params would need to be broken out further, so we're not grabbing forest/train/test params all at once;
  • Computing the final form of long_train_nodes should happen in a (probably also exported) function called within quantForestError;
  • Computing the estimates dataframe for a specific test set should get broken out

A few other benefits of doing this, besides scalability, would be

  • IMHO this will make the code overall easier to understand by separating concerns; the method is mostly encapsulated in a very long single function as things stand
  • Returning long_train_nodes separately allows for experimentation on other ways of summarizing the forest errors, beyond just bias and quantile statistics; here I'm thinking of e.g. clustering and second-stage bias correction models beyond a node-wise mean. Those wouldn't fit naturally inside a single function but could be other bolt-ons later.

I'm pretty happy to do the work and send a PR over the next couple of weeks, but this is a very large set of changes and wanted to check first. Happy to fork/modify on my own outside of the main/CRAN version if you prefer that too.

@benjilu
Copy link
Owner

benjilu commented Aug 3, 2021

I'm on board for this. It's a good idea, and one that I remember we've discussed before. Thanks for putting together this well-thought-out roadmap for how to implement it.

I actually have a bit of time over the coming weeks, so I'd also be happy to make these changes if you haven't already and have other things on your plate. Let me know how you'd like to proceed.

@sheffe
Copy link
Contributor Author

sheffe commented Aug 3, 2021

As it happens I am in any-day-now waiting for a new baby... so if you're freer to implement, thanks! I'd be happy to do some extra testing and performance grinding over the coming months.

@benjilu
Copy link
Owner

benjilu commented Aug 3, 2021

Oh, wow, definitely let me handle this then. Best wishes to you and your family!

@benjilu
Copy link
Owner

benjilu commented Aug 8, 2021

Just pushed the major changes. I still haven't modularized the computation of estimates, but, as far as I can tell, that should be straightforward to implement. Will do that and more thoroughly test the package soon.

@benjilu
Copy link
Owner

benjilu commented Aug 10, 2021

The refactoring is complete and uploaded to CRAN. Please let me know if you encounter any issues using the new version. Thanks for the proposal!

@benjilu benjilu closed this as completed Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants