Speedups for `epix_slide` #386

nmdefries · 2023-12-04T22:29:00Z

Speeds epix_slide up ~2x on a series of test cases.

… slow

…tput

R/epi_df.R

brookslogan · 2023-12-04T23:53:23Z

R/grouped_epi_archive.R

                                .drop=private$drop) %>%
                  dplyr::group_modify(group_modify_fn,
                                      f = f, ...,
                                      ref_time_value = ref_time_value,
                                      new_col = new_col,
                                      .keep = TRUE)
              )
-            })
+            }) %>% rbindlist()


Nice speedup! Is this going to change the output class of our result? This might need to have a setDF + as_tibble (but check to make sure we can't possibly be aliasing something, e.g., in the 1-reference-time case).

I fixed the class aspect, but not sure about the aliasing. My attempt at investigating this using lobstr::ref to compare addresses of list of epi_dfs containing computed values and tibble of computed values after using rbindlist:

> data$DT # Input data geo_value sex time_value version value 1: al M 2012-01-01 2012-01-01 25 2: ca M 2012-01-01 2012-01-01 100 > data %>% + epix_slide(f = ~ 5, before = 0, ref_time_values = as.Date("2012-01-01")) %>% + ungroup() # output from lobstr::ref(<list of epi_dfs containing computed values >, <tibble of computed values after using rbindlist>) █ [1:0x559b7f330b08] <list> └─█ [2:0x559b80ee4dd8] <tibble[,2]> ├─time_value = [3:0x559b7f330b40] <date> └─slide_value = █ [4:0x559b7f467918] <list> └─[5:0x559b7f331160] <dbl> █ [6:0x559b80ee37d8] <tibble[,2]> ├─time_value = [7:0x559b7f4842f8] <date> └─slide_value = █ [8:0x559b7f484288] <list> └─[5:0x559b7f331160] # Output of epix_slide # A tibble: 1 × 2 time_value slide_value * <date> <dbl> 1 2012-01-01 5

Does this maintain the speed benefits? group_by could potentially be costly. I'll try to double-check the no-aliasing part.

Does this maintain the speed benefits? group_by could potentially be costly.

Yes, it seems to. I was also wondering about the time for that group_by but it didn't seem to impact anything. Edit: it's called fewer times because it's not in the inner loop.

Great! Regarding aliasing:

Actually, we don't need to avoid it; we are fine with out outputs aliasing stuff I think.

I've updated the docs to reflect the above and some other mutation&aliasing details. Please sanity check. (I hope we can eventually simplify & make all these notes internal by providing a filter method & hiding away the DT from the user.)

We already probably don't ever alias.

all_versions = FALSE gets input that has gone through a DT -> tibble conversion using as_tibble with no setDT, which should prevent DT/column aliasing

all_versions = TRUE has a manual step making sure we don't alias the DT (I'm actually not sure why now... maybe I was thinking we were going to be mutating like in fill_through_version) but might potentially somehow alias the columns (I'm not sure if this is possible... but since the aliasing case of unique didn't appear to be documented before, we probably should assume it could change to alias columns instead of the DT).

Thanks for adding all of those comments, they look good.

I'll remove some pipes as a final change.

Did removing pipes actually make a difference in your testing? I just tried re-introducing magrittr pipes to what I thought would be the highest-traffic pipe operations, the group_by and group_modify inside the FUN arg to lapply, but don't see any real change.

Also it'd be nice if you could maybe summarize in general what seemed like the highest-impact changes here or any other places you think might benefit significantly from optimization.

I don't see that de-piping changed runtime. Theoretically, extra function calls will always take more time, but the cost is just so low compared to everything else we're doing here.

On the other hand, not using pipes does make profiles easier to read. Using pipes somehow causes functions to be marked as not having a sensible parent call, so e.g. group_modify shows up in a spot on the profile that you don't expect and that doesn't match the code.

brookslogan · 2023-12-05T00:02:46Z

These are nice cheap speedups. FYI a couple other potential avenues at a much higher cost:

Changing backend... I've checked out a few, but so far no improvements. Might still need to check if some of these alternatives have explicit indexing options to speed them up.
Partitioning DT based on the ref_time_values requested, taking each element of the partition and "squash" its updates using unique fromLast=TRUE (or a rolling join if it's faster), then apply those updates sequentially to build the snapshots sequentially.
Parallelism stuff. Conflicts with the sequential trick above.
Maybe... partitioning DT by group or epikey before looping over the as-of logic (rather than while processing each snapshot), especially if rolling joins are involved.

These are probably recorded in another issue or issues somewhere. All definitely seem more involved than these super-simple improvements, but if any catch your eye, feel free to try them! (Probably separately, it will be nice to have these simple improvements first!)

R/grouped_epi_archive.R

nmdefries · 2023-12-05T22:07:56Z

other potential avenues

@brookslogan If there are other particular epiprocess usecases you'd like to be faster, I can profile and investigate those as well.

The above changes were based on a task from the examples,

 # dataset is 2000 obs, but individual changes were benchmarked on datasets of varying size (20-20k)
  dataset %>%
    group_by(geo_value) %>%
    epix_slide(f = ~ 5, before = 7) %>%
    ungroup()

This reverts commit ff397c0.

R/grouped_epi_archive.R

nmdefries · 2023-12-06T21:25:22Z

Saw in profiles that messaging takes a fair amount of time. dplyr::group_map and dplyr::group_split look like

function (.data, .f, ..., .keep = FALSE) 
{
    lifecycle::signal_stage("experimental", "group_map()")
    UseMethod("group_map")
}

with the lifecycle bit taking a lot of time. After the changes already included in this PR, the call to signal_stage takes ~1 of 8 s in a test call. This seems easy to remove, if we can get dplyr's other functions to use a modified version of these two functions. I'm assuming this is bad practice. Maybe not impactful enough to be worth the potential jankiness/fragility. But I'm curious about other takes on this idea.

dshemetov · 2023-12-07T01:15:15Z

Seems like something that would cause more development work downstream as those functions change, so I'm not that thrilled about it.

Question: are group_map and group_split called very often? Is there way for us to get around this by calling them less often?

nmdefries · 2023-12-07T01:32:25Z

Yeah, it's a bad idea... but tempting since the change is so small 😀 We use group_map and group_split indirectly. I'm not entirely sure where they're coming from.

dshemetov · 2023-12-07T01:36:07Z

It does seem like an easy change for ~10% speed boost!

Hm, since they're used indirectly, does this mean loading epiprocess would have to overwrite those functions in the dplyr namespace?

nmdefries · 2023-12-07T03:21:48Z

epiprocess would have to overwrite those functions in the dplyr namespace?

Yep. I was hoping there could be a workaround, but unclear.

brookslogan · 2023-12-12T01:27:40Z

@brookslogan If there are other particular epiprocess usecases you'd like to be faster, I can profile and investigate those as well.

I don't know yet. I didn't expect epix_slide overhead to even be a deal for pseudoprospective forecasting since the fitting routines we were using back a while ago were pretty slow. In the process of finishing up some chapters I will hopefully get a sense what it feels like for lag extraction and might ask you to profile if it seems slow. There's a lot of specialized stuff we could do for lag extraction, though; it doesn't have to be a slide operation necessarily.

Re. group_map and group_split... If we could refactor to call them directly, then we could also easily refactor to call our own version of group_map and group_split that UseMethod without the lifecycle call, though this does have some chance of breakage if they decide to move some pre/post-processing into the generic. But we also might want to try to "group split" the archive on the "outside" and loop over ref_time_values on the "inside" like we do in epi_slide, rather than the other way around; this seems like it could potentially avoid some unneeded sorts by group_by inside the ref_time_value loop, though it might be hard to determine beforehand which approach incurs more function call overhead.

nmdefries · 2023-12-13T23:02:38Z

@brookslogan if you review this before I'm back and everything looks good, feel free to merge.

brookslogan

Looks good to me! Have some trailing requests regarding documenting whether the de-pipe-ing was worth it, and which optimizations had the most impact, but this seems ready to merge.

nmdefries · 2023-12-14T21:24:38Z

Slow bits in epix_slide from a test call. 23.5 s is the baseline runtime.

[done] 25% of time (5.8 of 23.5 s) spent in x$slide -> map_dfr -> bind_rows
[done] 5% (1.2 of 23.5 s) spent in as_epi_df.tbl_df -> new_epi_df -> relocate.data.frame
[done] ?? spent in group_by-across-tidyselect::all_of
[done] ?? spent in turning computation output into a tibble
[probably intractable] 16% of time (3.7 of 23.5 s) spent in group_modify.grouped_df -> group_map -> lifecycle::signal_stage or group_modify.grouped_df -> group_map.data.frame -> group_split -> lifecycle::signal_stage
[intractable] 25% of time (6 of 23.5 s) spent in signal from group_modify.epi_df -> dplyr_reconstruct -> dplyr_reconstruct.grouped_df -> grouped_df -> compute_groups -> signal and group_split -> signal and other places (apparently a logging feature, not necessary for production https://github.com/tidyverse/dplyr/pull/4751/files ). Too deep in dplyr to change.

It was hard to tell in Dmitry's tests profile which functions were coming from epiprocess (pipe-caused "unaffiliation" might be part of the problem). I could tell that epix_slide -> anonymous -> purrr::map_dfr -> group_by was slow, and epi_df.R::new_epi_df -> relocate.data.frame was slow. Besides those, there were a lot of dplyr functions that were slow but not clearly coming from epiprocess (unnest.data.frame, arrange.data.frame, rename.data.frame, group_by.data.frame, select.data.frame, mutate.data.frame).

nmdefries · 2023-12-14T21:33:19Z

Code to profile, and output of baseline profile.

baseline_profile_example.tar.gz

nmdefries · 2023-12-14T21:56:21Z

Remaining slow parts are doing group_by and group_modify on different objects inside the lapply-compute loop.

20231214_profile.tar.gz

We can't just swap in a different function here, so like Logan was saying above (here and here), this gets into more demanding changes. We should figure out what particular types of requests/computations are slow and how much slower than desired before spending a lot of time on a rewrite.

nmdefries added 5 commits December 1, 2023 16:58

use lapply-rbindlisht instead of map_dfr -- the included bind_rows is…

2f99b20

… slow

do faster relocate step

e00dc12

use single-step data.table fn for relocate

ff397c0

speed up group_by using across and tidyselect

99b01e3

df -> tibble conversion faster than gluing when making computation ou…

d079ceb

…tput

nmdefries requested review from dajmcdon and brookslogan as code owners December 4, 2023 22:29

nmdefries marked this pull request as draft December 4, 2023 22:44

brookslogan reviewed Dec 4, 2023

View reviewed changes

R/epi_df.R Outdated Show resolved Hide resolved

brookslogan reviewed Dec 4, 2023

View reviewed changes

brookslogan reviewed Dec 5, 2023

View reviewed changes

R/grouped_epi_archive.R Outdated Show resolved Hide resolved

nmdefries mentioned this pull request Dec 6, 2023

profile the data tests cmu-delphi/exploration-tooling#60

Closed

nmdefries added 4 commits December 6, 2023 13:20

Revert "use single-step data.table fn for relocate"

a076358

This reverts commit ff397c0.

make sure computation values come out as tibble, not DT

278d410

carry over .drop setting for groups, too

077a7a0

make relocate input-col-order agnostic

5886ec1

dsweber2 added this to the Epiprocess Issue Triage milestone Dec 6, 2023

dshemetov assigned nmdefries Dec 6, 2023

nmdefries commented Dec 6, 2023

View reviewed changes

R/grouped_epi_archive.R Outdated Show resolved Hide resolved

nmdefries added 2 commits December 6, 2023 18:13

use private$ to access group keys and drop value

7d9f97e

convert list result to tibble altogether

d368be5

nmdefries marked this pull request as ready for review December 7, 2023 03:22

Document mutation&aliasing behavior of archive as_of, slide

d7c1cba

brookslogan added 3 commits December 11, 2023 16:39

Fix some mutation&aliasing discussion

63f8441

Fix the fix of mutation & aliasing discussion

5491a93

Add even more notes on aliasing&mutation in archive slide

1acc0ad

remove some pipes

88df207

nmdefries requested a review from brookslogan December 12, 2023 17:07

nmdefries added 3 commits December 12, 2023 13:39

use new_tibble instead of as_tibble in epix_slide

4ac3980

more pipe removal

1acd8d3

validate new_tibble output for safety

c6863ea

brookslogan approved these changes Dec 14, 2023

View reviewed changes

nmdefries merged commit b444a3c into dev Dec 14, 2023
2 checks passed

nmdefries deleted the ndefries/speedups branch December 14, 2023 22:01

nmdefries mentioned this pull request Jan 25, 2024

Consider performance improvements for epix_slide, as_of #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedups for `epix_slide` #386

Speedups for `epix_slide` #386

nmdefries commented Dec 4, 2023 •

edited

Loading

brookslogan Dec 4, 2023 •

edited

Loading

nmdefries Dec 6, 2023

brookslogan Dec 11, 2023

nmdefries Dec 12, 2023 •

edited

Loading

brookslogan Dec 12, 2023

nmdefries Dec 12, 2023

brookslogan Dec 14, 2023

brookslogan Dec 14, 2023

nmdefries Dec 14, 2023

brookslogan commented Dec 5, 2023

nmdefries commented Dec 5, 2023 •

edited

Loading

nmdefries commented Dec 6, 2023 •

edited

Loading

dshemetov commented Dec 7, 2023

nmdefries commented Dec 7, 2023

dshemetov commented Dec 7, 2023

nmdefries commented Dec 7, 2023

brookslogan commented Dec 12, 2023

nmdefries commented Dec 13, 2023

brookslogan left a comment

nmdefries commented Dec 14, 2023 •

edited

Loading

nmdefries commented Dec 14, 2023

nmdefries commented Dec 14, 2023 •

edited

Loading

Speedups for epix_slide #386

Speedups for epix_slide #386

Conversation

nmdefries commented Dec 4, 2023 • edited Loading

brookslogan Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

nmdefries Dec 6, 2023

Choose a reason for hiding this comment

brookslogan Dec 11, 2023

Choose a reason for hiding this comment

nmdefries Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

brookslogan Dec 12, 2023

Choose a reason for hiding this comment

nmdefries Dec 12, 2023

Choose a reason for hiding this comment

brookslogan Dec 14, 2023

Choose a reason for hiding this comment

brookslogan Dec 14, 2023

Choose a reason for hiding this comment

nmdefries Dec 14, 2023

Choose a reason for hiding this comment

brookslogan commented Dec 5, 2023

nmdefries commented Dec 5, 2023 • edited Loading

nmdefries commented Dec 6, 2023 • edited Loading

dshemetov commented Dec 7, 2023

nmdefries commented Dec 7, 2023

dshemetov commented Dec 7, 2023

nmdefries commented Dec 7, 2023

brookslogan commented Dec 12, 2023

nmdefries commented Dec 13, 2023

brookslogan left a comment

Choose a reason for hiding this comment

nmdefries commented Dec 14, 2023 • edited Loading

nmdefries commented Dec 14, 2023

nmdefries commented Dec 14, 2023 • edited Loading

Speedups for `epix_slide` #386

Speedups for `epix_slide` #386

nmdefries commented Dec 4, 2023 •

edited

Loading

brookslogan Dec 4, 2023 •

edited

Loading

nmdefries Dec 12, 2023 •

edited

Loading

nmdefries commented Dec 5, 2023 •

edited

Loading

nmdefries commented Dec 6, 2023 •

edited

Loading

nmdefries commented Dec 14, 2023 •

edited

Loading

nmdefries commented Dec 14, 2023 •

edited

Loading