Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and consider applying ?dplyr_extending for epi_df #223

Open
brookslogan opened this issue Aug 31, 2022 · 11 comments
Open

Review and consider applying ?dplyr_extending for epi_df #223

brookslogan opened this issue Aug 31, 2022 · 11 comments
Assignees
Labels
cleanup improvements to developing&building experience&quality, not directly to built pkg&docs P2 low priority

Comments

@brookslogan
Copy link
Contributor

I believe that we implemented epi_df without the use of ?dplyr_extending. We may have:

  • Missing S3 implementations
  • Unnecessary S3 implementations
  • Noncompliant behavior

We should review ?dplyr_extending to see if it hints at bugs, enhancements, code simplifications, etc. (E.g., the mention of [ might have hinted that we would want a [ implementation before we discovered it on our own.)

@brookslogan brookslogan added the P2 low priority label Aug 31, 2022
@brookslogan brookslogan added the cleanup improvements to developing&building experience&quality, not directly to built pkg&docs label Sep 20, 2022
@brookslogan
Copy link
Contributor Author

brookslogan commented Oct 29, 2022

Might have ran into one of these cases in #64 where grouped epi_dfs being bound together lost their epi_dfness. Implementing a dplyr_reconstruct.epi_df based on [.epi_df seems to partially resolve. But note this exacerbates #242 (which underlies #213).

May also related to #117, cmu-delphi/epipredict#51.

@brookslogan brookslogan self-assigned this Nov 1, 2022
brookslogan pushed a commit that referenced this issue Nov 1, 2022
* Fix failing grouped `epix_slide` test, where `group_modify` dropped the
  `epi_df` class when binding results together, by implementing
  `dplyr_reconstruct.epi_df`.
* Implement the rest of `?dplyr_extending` and remove some now-unnecessary S3
  methods for dplyr verbs, addressing #195, #223, and changing `epi[x]_slide` to
  leave grouping intact.
* Update tests for grouped slides to reflect new behavior.
* Update NEWS.md.
brookslogan pushed a commit that referenced this issue Nov 1, 2022
* Half-fix failing grouped `epix_slide` test, where `group_modify` dropped the
  `epi_df` class when binding results together, by implementing
  `dplyr_reconstruct.epi_df`.  Somehow a session documenting and testing
  immediately before this commit and then after it will have the tests pass, but
  the tests won't pass off of this branch directly.  The class vector involved
  in the failure does appear to change.
* Explicate `epix_slide` `count` derivation.
* Implement the rest of `?dplyr_extending` and remove some now-unnecessary S3
  methods for dplyr verbs, addressing #195, #223, and changing `epi[x]_slide` to
  leave grouping intact.
* Update tests for grouped slides to reflect new behavior.
* Update NEWS.md.
brookslogan pushed a commit that referenced this issue Nov 2, 2022
…modify`

* Implement `?dplyr_extending` and remove some now-unnecessary S3 methods for
  dplyr verbs, addressing #195, #223, and failing `epix_slide` test.
* Don't ungroup `epix_slide` result.  Update corresponding test.
* Update NEWS.md.
* Explicate `epix_slide` `count` derivation in comments and variable names.
* Fix some desynced duplicated code in `epix_slide` and use `group_modify` again
  instead of `summarize` in order to keep slide computation input available as
  an `epi_df`.
brookslogan pushed a commit that referenced this issue Nov 2, 2022
…modify`

* Implement `?dplyr_extending` and remove some now-unnecessary S3 methods for
  dplyr verbs, addressing #195, #223, and failing `epix_slide` test.
* Don't ungroup `epix_slide` result.  Update corresponding test.
* Update NEWS.md.
* Explicate `epix_slide` `count` derivation in comments and variable names.
* Fix some desynced duplicated code in `epix_slide` and use `group_modify` again
  instead of `summarize` in order to keep slide computation input available as
  an `epi_df`.
brookslogan pushed a commit that referenced this issue Nov 3, 2022
…modify`

* Implement `?dplyr_extending` and remove some now-unnecessary S3 methods for
  dplyr verbs, addressing #195, #223, and failing `epix_slide` test.
* Don't ungroup `epix_slide` result.  Update corresponding test.
* Update NEWS.md.
* Explicate `epix_slide` `count` derivation in comments and variable names.
* Fix some desynced duplicated code in `epix_slide` and use `group_modify` again
  instead of `summarize` in order to keep slide computation input available as
  an `epi_df`.
@brookslogan
Copy link
Contributor Author

brookslogan commented Nov 16, 2022

A potential issue with ?dplyr_extending guidance: arrange on grouped epi_dfs drops epi_dfness. We need to implement dplyr_row_slice as well. (We also need to implement other methods interacting with grouping: group_by, ungroup, group_modify [or maybe not; an appropriate group_data + any other required extensions of grouped_df methods (see below) might get this "for free"/"properly"]; as well as unnest; we may be missing some others.)

@brookslogan
Copy link
Contributor Author

There are a few other issues with grouped epi dfs in the current dev / lcb/make_epix_slide_more_like_reframe branches; see #113, #298. Note:

  1. Part of this may be due to missing a group_data method, alluded to in ?dplyr_extending. We'd want either/both (a) "epi_df" before "grouped_df" in the class vector to be able to dispatch to group_data.epi_df instead of group_data.grouped_df, or (b) to mess with the class of the "groups" attr generated by group_by (doesn't sound great.)
  2. Note in comments of Make print.epi_df header mirror style of print.tsibble and print.tbl_df #113 that we want "grouped_df" before "epi_df" due to some pillar methods for grouped_df expecting it to be before more base-like classes in the class vector.

1(a) and 2 (and maybe more points) seem to conflict. One potential resolution:

  • Make grouped epi_dfs use the class c("grouped_edf", "grouped_df", "epi_df", "tbl_df", "tbl", "data.frame"). That's also suggested in this older thread.

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 24, 2023

While applying the "potential resolution" above seems to help in some aspects, it also loses epi_df-ness in a lot of other situations. The problem appears to be that grouped_df's methods don't really respect dplyr_extending for grouped_* classes in front of it. So we'd need to look at methods(,"grouped_df") and/or methods(,"grouped_{somethingelse}") and make sure to override/mirror the things we want. [2024-09-24: perhaps this wasn't/isn't the case; it seems you can get these ops to work as long as you make sure not to lose your head class after any NextMethods in dplyr_extending impls, including e.g. names<-, and this may be sufficient. Also, looking at grouped_df S3 methods might not have even worked, as there might be some non-S3 functions manually branching on grouped_df-ness.] Related: #145. [We'll need to look in some packages beyond dplyr for these, e.g., for vctrs::restore.]

[Same approach would need to be applied for rowwise edfs. Plus, in any approach, we will want to review tidyr for more methods that may need wrapped like, e.g., unnest.]

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 24, 2023

More recent reference: tidyverse/dplyr#5480

Also, I'm not sure if the approaches with a separate grouped_{something} class approach were all limited to tidyverts and not tidyverse. It seems like tidyverse things like dtplyr and sf might use either (a) manual delegation rather than via NextMethod (and have custom print with as_tibble, format, [-1] to remove header line(s)) or (b) the single-class-in-front-of-grouped_df approach with some additional methods implemented (and just has the same "# A tibble: [...]" line sticking around).

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 27, 2023

Progress is in lcb/round-out-grouped-epi_dfs. However, it already seems to have some problems with summarize drop_last seeing time_value as an additional grouping variable when we are only grouping by geo_value, and thus grouping results by geo_value by default when outputting 1 row per computation [within epi_cor. Seems like we can't just use reclass or dplyr_reconstruct everywhere without thinking, and/or there's an issue with the current dplyr_reconstruct.grouped_edf; dplyr_reconstruct.grouped_df tries to take grouping from template (intersecting with available columns in data)]. Also needs further, more focused, tests that it actually fixes what we wanted to fix. Some other existing test(s) are broken from not dropping data.table attributes that are now being kept around by group_modify result assembly in epix_slide (and kept inside the computations by as_tibble).

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 27, 2023

The questions raised by how to reconstruct a grouped summarize might tie into questions about when epix_slide should output an epi_df, when geo_value might either come from "outside", when the the slide is grouped by geo_value, or the "inside", when the slide isn't grouped but the user computations output epi_dfs (with geo_values). A useful concept might be an epi_df_selection/epi_df_slice/sub_epi_df or an epi_df_group, which would be a special class we'd use, e.g., as the first or both args in group_modify computations, but only decaying if essential columns don't appear in either arg, rather than decaying when they're gone from the first arg. First, this special class would solve us having to use .keep = TRUE to keep a relevant class passed in the first arg of group_modify, and second, it might help in other summarization situations to avoid having to attempt reconstructing things against a template that might not be very similar (since summarization can alter so much). [Alternatively, there could be an epi_df_groups class output from group_data and hopefully retained until it's time to unnest, etc.]

@brookslogan
Copy link
Contributor Author

Another point where things seem unclear: should we drop metadata in an as_tibble.epi_df method? This seems like it would correspond to grouped_df, rowwise_df, and other classes' behavior of dropping groupings (though the former is viewed in vctrs as a more general concept than ordinary tibbles, while epi_dfs are a narrower concept). And we have the metadata dropped when as_tibble-ing grouped epi_dfs currently anyway. I'll likely implement metadata-dropping in #311.

@brookslogan
Copy link
Contributor Author

brookslogan commented Jul 30, 2024

Having an epi_df_slice/epi_df_selection/sub_epi_df / epi_df_group / epi_df_view class would also allow us to disable accidental use of mean/sum on .x in slide computations when .x$value was meant instead. By simply providing an S3 method that complains. Currently we lose epi-dfness when doing slides grouped by geos, so doing the same for epi_dfs isn't going to catch much. [Though alternatively, we may be able to use output column de-duplication on unnesting (rather than errors) + do keep = TRUE for epi_slides as we do epix_slides. Then epi_df impls should catch it.]

@brookslogan
Copy link
Contributor Author

Some notes:

  • Our current application of ?dplyr_extending is awkward, routing many/all core ops through a dplyr_reconstruct with all validation checks. Experimenting with a "real" application in brookslogan/dplyr.extending.test
  • We avoid some issues by using .keep = TRUE in group_map ops both epix_slide and epi_slide now/soon. But if a user directly uses group_modify with the default .keep = FALSE, then they could run into annoyances, so we should at least maintain a group_modify method to fix up there. But there's less need for a slice/group/view class given these adjustments.

@brookslogan
Copy link
Contributor Author

  • Requiring key uniqueness makes geo/time aggregation awkward; it'd be nice if we could edf %>% group_by(geo_value = state_of(geo_value), time_value = week_of(time_value)) %>% summarize(value = sum(value)) and get an epi_df out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup improvements to developing&building experience&quality, not directly to built pkg&docs P2 low priority
Projects
None yet
Development

No branches or pull requests

2 participants