Refactoring proposal to enable large-scale forestError test set computations #6

sheffe · 2021-07-28T13:42:09Z

@benjilu hope all's well since we last caught up on the package -- still using it all the time.

Under the current design, forestError calls scale reasonably well out to 50k or 100k observations in the test set for forests with moderate (100-200) tree counts. Beyond that point, either in test set observations or tree count, the multiplication in row counts that occurs in the edgelist join of train_node to test_node starts to break memory limits.

It's still possible to take a large test set and iterate, doing e.g.

huge_test_df %>%
  split(.$batch_id) %>%
  purrr::map(
    .f = ~quantForestError(
               forest = forest,
               X.train = trainset[c("x1", "x2", "x3")],
               X.test = .x[c("x1", "x2", "x3")],
               Y.train = trainset$y
  )

... but that recomputes everything to do with the training set in every iteration over batch ID.

So, here's a proposal for a moderately large refactoring of the quantForestError function into independently reusable components. The main objective would be to separate the two costly parts of the computation: (1) turn a forest into the trainset tree/node OOB error data structure we use internally, and return it; (2) take in the training error data and return the computed test set statistics.

This can easily be wrapped in a single function (like now) so existing code doesn't break.

The interface to quantForestError would gain two optional parameters (e.g. use_train_nodes = NULL and return_train_nodes = FALSE)
If return_train_nodes is TRUE, return the final computed form of long_train_nodes
If use_train_nodes is not null, must be a data object long_train_nodes returned from a prior computation, so we can skip the steps required to create it on a second pass.

Internally, we'd want to change a few things:

The input-forest class-specific extraction of data/params would need to be broken out further, so we're not grabbing forest/train/test params all at once;
Computing the final form of long_train_nodes should happen in a (probably also exported) function called within quantForestError;
Computing the estimates dataframe for a specific test set should get broken out

A few other benefits of doing this, besides scalability, would be

IMHO this will make the code overall easier to understand by separating concerns; the method is mostly encapsulated in a very long single function as things stand
Returning long_train_nodes separately allows for experimentation on other ways of summarizing the forest errors, beyond just bias and quantile statistics; here I'm thinking of e.g. clustering and second-stage bias correction models beyond a node-wise mean. Those wouldn't fit naturally inside a single function but could be other bolt-ons later.

I'm pretty happy to do the work and send a PR over the next couple of weeks, but this is a very large set of changes and wanted to check first. Happy to fork/modify on my own outside of the main/CRAN version if you prefer that too.

The text was updated successfully, but these errors were encountered:

benjilu · 2021-08-03T17:49:45Z

I'm on board for this. It's a good idea, and one that I remember we've discussed before. Thanks for putting together this well-thought-out roadmap for how to implement it.

I actually have a bit of time over the coming weeks, so I'd also be happy to make these changes if you haven't already and have other things on your plate. Let me know how you'd like to proceed.

sheffe · 2021-08-03T19:01:12Z

As it happens I am in any-day-now waiting for a new baby... so if you're freer to implement, thanks! I'd be happy to do some extra testing and performance grinding over the coming months.

benjilu · 2021-08-03T20:17:42Z

Oh, wow, definitely let me handle this then. Best wishes to you and your family!

benjilu · 2021-08-08T22:03:03Z

Just pushed the major changes. I still haven't modularized the computation of estimates, but, as far as I can tell, that should be straightforward to implement. Will do that and more thoroughly test the package soon.

benjilu · 2021-08-10T19:29:13Z

The refactoring is complete and uploaded to CRAN. Please let me know if you encounter any issues using the new version. Thanks for the proposal!

benjilu closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring proposal to enable large-scale forestError test set computations #6

Refactoring proposal to enable large-scale forestError test set computations #6

sheffe commented Jul 28, 2021 •

edited

Loading

benjilu commented Aug 3, 2021

sheffe commented Aug 3, 2021

benjilu commented Aug 3, 2021

benjilu commented Aug 8, 2021

benjilu commented Aug 10, 2021

Refactoring proposal to enable large-scale forestError test set computations #6

Refactoring proposal to enable large-scale forestError test set computations #6

Comments

sheffe commented Jul 28, 2021 • edited Loading

benjilu commented Aug 3, 2021

sheffe commented Aug 3, 2021

benjilu commented Aug 3, 2021

benjilu commented Aug 8, 2021

benjilu commented Aug 10, 2021

sheffe commented Jul 28, 2021 •

edited

Loading