Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedups for epix_slide #386

Merged
merged 19 commits into from
Dec 14, 2023
Merged

Speedups for epix_slide #386

merged 19 commits into from
Dec 14, 2023

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Dec 4, 2023

Speeds epix_slide up ~2x on a series of test cases.

@nmdefries nmdefries marked this pull request as draft December 4, 2023 22:44
R/epi_df.R Outdated Show resolved Hide resolved
.drop=private$drop) %>%
dplyr::group_modify(group_modify_fn,
f = f, ...,
ref_time_value = ref_time_value,
new_col = new_col,
.keep = TRUE)
)
})
}) %>% rbindlist()
Copy link
Contributor

@brookslogan brookslogan Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speedup! Is this going to change the output class of our result? This might need to have a setDF + as_tibble (but check to make sure we can't possibly be aliasing something, e.g., in the 1-reference-time case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the class aspect, but not sure about the aliasing. My attempt at investigating this using lobstr::ref to compare addresses of list of epi_dfs containing computed values and tibble of computed values after using rbindlist:

> data$DT

# Input data
geo_value sex time_value    version value
1:        al   M 2012-01-01 2012-01-01    25
2:        ca   M 2012-01-01 2012-01-01   100

> data %>%
  +   epix_slide(f = ~ 5, before = 0, ref_time_values = as.Date("2012-01-01")) %>% 
  +   ungroup()

# output from lobstr::ref(<list of epi_dfs containing computed values >, <tibble of computed values after using rbindlist>)
█ [1:0x559b7f330b08] <list> 
  └─█ [2:0x559b80ee4dd8] <tibble[,2]> 
  ├─time_value = [3:0x559b7f330b40] <date> 
  └─slide_value = █ [4:0x559b7f467918] <list> 
  └─[5:0x559b7f331160] <dbl> 
  
  █ [6:0x559b80ee37d8] <tibble[,2]> 
  ├─time_value = [7:0x559b7f4842f8] <date> 
  └─slide_value = █ [8:0x559b7f484288] <list> 
  └─[5:0x559b7f331160] 

# Output of epix_slide
# A tibble: 1 × 2
time_value slide_value
* <date>           <dbl>
  1 2012-01-01           5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this maintain the speed benefits? group_by could potentially be costly. I'll try to double-check the no-aliasing part.

Copy link
Contributor Author

@nmdefries nmdefries Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this maintain the speed benefits? group_by could potentially be costly.

Yes, it seems to. I was also wondering about the time for that group_by but it didn't seem to impact anything. Edit: it's called fewer times because it's not in the inner loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Regarding aliasing:

  • Actually, we don't need to avoid it; we are fine with out outputs aliasing stuff I think.
  • I've updated the docs to reflect the above and some other mutation&aliasing details. Please sanity check. (I hope we can eventually simplify & make all these notes internal by providing a filter method & hiding away the DT from the user.)
  • We already probably don't ever alias.
    • all_versions = FALSE gets input that has gone through a DT -> tibble conversion using as_tibble with no setDT, which should prevent DT/column aliasing
    • all_versions = TRUE has a manual step making sure we don't alias the DT (I'm actually not sure why now... maybe I was thinking we were going to be mutating like in fill_through_version) but might potentially somehow alias the columns (I'm not sure if this is possible... but since the aliasing case of unique didn't appear to be documented before, we probably should assume it could change to alias columns instead of the DT).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding all of those comments, they look good.

I'll remove some pipes as a final change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did removing pipes actually make a difference in your testing? I just tried re-introducing magrittr pipes to what I thought would be the highest-traffic pipe operations, the group_by and group_modify inside the FUN arg to lapply, but don't see any real change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it'd be nice if you could maybe summarize in general what seemed like the highest-impact changes here or any other places you think might benefit significantly from optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that de-piping changed runtime. Theoretically, extra function calls will always take more time, but the cost is just so low compared to everything else we're doing here.

On the other hand, not using pipes does make profiles easier to read. Using pipes somehow causes functions to be marked as not having a sensible parent call, so e.g. group_modify shows up in a spot on the profile that you don't expect and that doesn't match the code.

@brookslogan
Copy link
Contributor

These are nice cheap speedups. FYI a couple other potential avenues at a much higher cost:

  • Changing backend... I've checked out a few, but so far no improvements. Might still need to check if some of these alternatives have explicit indexing options to speed them up.
  • Partitioning DT based on the ref_time_values requested, taking each element of the partition and "squash" its updates using unique fromLast=TRUE (or a rolling join if it's faster), then apply those updates sequentially to build the snapshots sequentially.
  • Parallelism stuff. Conflicts with the sequential trick above.
  • Maybe... partitioning DT by group or epikey before looping over the as-of logic (rather than while processing each snapshot), especially if rolling joins are involved.

These are probably recorded in another issue or issues somewhere. All definitely seem more involved than these super-simple improvements, but if any catch your eye, feel free to try them! (Probably separately, it will be nice to have these simple improvements first!)

R/grouped_epi_archive.R Outdated Show resolved Hide resolved
@nmdefries
Copy link
Contributor Author

nmdefries commented Dec 5, 2023

other potential avenues

@brookslogan If there are other particular epiprocess usecases you'd like to be faster, I can profile and investigate those as well.

The above changes were based on a task from the examples,

 # dataset is 2000 obs, but individual changes were benchmarked on datasets of varying size (20-20k)
  dataset %>%
    group_by(geo_value) %>%
    epix_slide(f = ~ 5, before = 7) %>%
    ungroup()

R/grouped_epi_archive.R Outdated Show resolved Hide resolved
@nmdefries
Copy link
Contributor Author

nmdefries commented Dec 6, 2023

Saw in profiles that messaging takes a fair amount of time. dplyr::group_map and dplyr::group_split look like

function (.data, .f, ..., .keep = FALSE) 
{
    lifecycle::signal_stage("experimental", "group_map()")
    UseMethod("group_map")
}

with the lifecycle bit taking a lot of time. After the changes already included in this PR, the call to signal_stage takes ~1 of 8 s in a test call. This seems easy to remove, if we can get dplyr's other functions to use a modified version of these two functions. I'm assuming this is bad practice. Maybe not impactful enough to be worth the potential jankiness/fragility. But I'm curious about other takes on this idea.

@dshemetov
Copy link
Contributor

Seems like something that would cause more development work downstream as those functions change, so I'm not that thrilled about it.

Question: are group_map and group_split called very often? Is there way for us to get around this by calling them less often?

@nmdefries
Copy link
Contributor Author

Yeah, it's a bad idea... but tempting since the change is so small 😀 We use group_map and group_split indirectly. I'm not entirely sure where they're coming from.

@dshemetov
Copy link
Contributor

It does seem like an easy change for ~10% speed boost!

Hm, since they're used indirectly, does this mean loading epiprocess would have to overwrite those functions in the dplyr namespace?

@nmdefries
Copy link
Contributor Author

epiprocess would have to overwrite those functions in the dplyr namespace?

Yep. I was hoping there could be a workaround, but unclear.

@nmdefries nmdefries marked this pull request as ready for review December 7, 2023 03:22
@brookslogan
Copy link
Contributor

@brookslogan If there are other particular epiprocess usecases you'd like to be faster, I can profile and investigate those as well.

I don't know yet. I didn't expect epix_slide overhead to even be a deal for pseudoprospective forecasting since the fitting routines we were using back a while ago were pretty slow. In the process of finishing up some chapters I will hopefully get a sense what it feels like for lag extraction and might ask you to profile if it seems slow. There's a lot of specialized stuff we could do for lag extraction, though; it doesn't have to be a slide operation necessarily.

Re. group_map and group_split... If we could refactor to call them directly, then we could also easily refactor to call our own version of group_map and group_split that UseMethod without the lifecycle call, though this does have some chance of breakage if they decide to move some pre/post-processing into the generic. But we also might want to try to "group split" the archive on the "outside" and loop over ref_time_values on the "inside" like we do in epi_slide, rather than the other way around; this seems like it could potentially avoid some unneeded sorts by group_by inside the ref_time_value loop, though it might be hard to determine beforehand which approach incurs more function call overhead.

@nmdefries
Copy link
Contributor Author

@brookslogan if you review this before I'm back and everything looks good, feel free to merge.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Have some trailing requests regarding documenting whether the de-pipe-ing was worth it, and which optimizations had the most impact, but this seems ready to merge.

@nmdefries
Copy link
Contributor Author

nmdefries commented Dec 14, 2023

Slow bits in epix_slide from a test call. 23.5 s is the baseline runtime.

  • [done] 25% of time (5.8 of 23.5 s) spent in x$slide -> map_dfr -> bind_rows
  • [done] 5% (1.2 of 23.5 s) spent in as_epi_df.tbl_df -> new_epi_df -> relocate.data.frame
  • [done] ?? spent in group_by-across-tidyselect::all_of
  • [done] ?? spent in turning computation output into a tibble
  • [probably intractable] 16% of time (3.7 of 23.5 s) spent in group_modify.grouped_df -> group_map -> lifecycle::signal_stage or group_modify.grouped_df -> group_map.data.frame -> group_split -> lifecycle::signal_stage
  • [intractable] 25% of time (6 of 23.5 s) spent in signal from group_modify.epi_df -> dplyr_reconstruct -> dplyr_reconstruct.grouped_df -> grouped_df -> compute_groups -> signal and group_split -> signal and other places (apparently a logging feature, not necessary for production https://github.com/tidyverse/dplyr/pull/4751/files ). Too deep in dplyr to change.

It was hard to tell in Dmitry's tests profile which functions were coming from epiprocess (pipe-caused "unaffiliation" might be part of the problem). I could tell that epix_slide -> anonymous -> purrr::map_dfr -> group_by was slow, and epi_df.R::new_epi_df -> relocate.data.frame was slow. Besides those, there were a lot of dplyr functions that were slow but not clearly coming from epiprocess (unnest.data.frame, arrange.data.frame, rename.data.frame, group_by.data.frame, select.data.frame, mutate.data.frame).

@nmdefries
Copy link
Contributor Author

Code to profile, and output of baseline profile.

baseline_profile_example.tar.gz

@nmdefries
Copy link
Contributor Author

nmdefries commented Dec 14, 2023

Remaining slow parts are doing group_by and group_modify on different objects inside the lapply-compute loop.

Screenshot from 2023-12-14 16-40-18

20231214_profile.tar.gz

We can't just swap in a different function here, so like Logan was saying above (here and here), this gets into more demanding changes. We should figure out what particular types of requests/computations are slow and how much slower than desired before spending a lot of time on a rewrite.

@nmdefries nmdefries merged commit b444a3c into dev Dec 14, 2023
2 checks passed
@nmdefries nmdefries deleted the ndefries/speedups branch December 14, 2023 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants