`epi_df` argument refactoring #460

dsweber2 · 2024-06-07T17:03:21Z

Checklist

Please:

Make sure this PR is against "dev", not "main" (unless this is a release
PR).
Request a review from one of the current main reviewers:
brookslogan, nmdefries.
Makes sure to bump the version number in DESCRIPTION. Always increment
the patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
See DEVELOPMENT.md for more information on the development
process.

Change explanations for reviewer

There's a couple of things we want to change about the arguments to epi_df and/or as_epi_df:

columns that aren't named exactly time_value but are unambiguously meant to be that should be interpreted as such, e.g. date.
columns that aren't named exactly geo_value but are unambiguously meant to be that should be interpreted as such, e.g. geo_id.
tidyselect to handle renaming (so if you had both forecast_date and target_date you could tell it time_value = target_date to disambiguate)
break additional metadata into "additional" and other_keys=, by far the most useful part of that
update the docs to reflect these changes
all of these, also for epi_archives. Some have already happened (e.g. other_keys is already separated)

Current list of time_value equivalents:

c(
         time_value = "date",
        time_value = "time",
        time_value = "datetime",
        time_value = "dateTime",
        tmie_value = "date_time",
        time_value = "forecast_date",
        time_value = "target_date",
        time_value = "week",
        time_value = "day",
        time_value = "epiweek",
        time_value = "month",
        time_value = "year",
        time_value = "yearmon",
        time_value = "yearMon",
        time_value = "dates",
        time_value = "time_values",
        time_value = "forecast_dates",
        time_value = "target_dates"

)

Current list of geo_value equivalents:

c(
      geo_value = "geo_values",
      geo_value = "geo_id",
      geo_value = "geos",
      geo_value = "fips",
      geo_value = "zip",
      geo_value = "county",
      geo_value = "hrr",
      geo_value = "msa",
      geo_value = "state",
      geo_value = "province",
      geo_value = "nation",
      geo_value = "states",
      geo_value = "provinces",
      geo_value = "counties"
)

Current list of version equivalents:

c(
      version = "issue",
      version = "release"
)
And for all of these, the Snake_Case capitalized versions

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves "Promote" other_keys to be printed, a constructor parameter, and more clearly documented #186, though some features there are out of scope; going to make separate issues for
- print.epi_df should print the other_keys metadata
- keys documentation
- possibly "regenerate the saved data sets",
Resolves [enh] promote other_keys #446
Resolves#456

Future extensions

combining multiple columns into unique keys, e.g. time_value = join_by(month, year) where month and year are separate columns

brookslogan · 2024-06-07T18:09:05Z

Nice lists; some notes/opinions:

I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.
location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.
time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

nmdefries · 2024-06-07T19:13:55Z

time_value = time, datetime

dsweber2 · 2024-06-07T19:45:48Z

I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.

oh yeah that makes sense thanks!

time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

brookslogan · 2024-06-10T17:58:18Z

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

In that instance, yes, although I'd also be fine with guessing it to be the target_date.

But suppose there is only a forecast_date (no target_date); I don't think we should guess it to be the time_value.

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

Sorry, this was mostly me reasoning about why it's good to exclude these. You could maybe accept yearmon & yearmonth, and perhaps year if it appears by itself without month or mon or week or any other possible pairings you could think of. But the rest are too ambiguous by name. If you detect a an appropriate class from tsibble (e.g., whatever tsibble::yearweek outputs) then that is less fraught --- tsibble does disambiguate these. But the simplest "solution" is just to exclude all of these possibilities and require the user to specify.

dajmcdon · 2024-06-19T18:56:51Z

On forecast_date / target_date, I actually think we might want it to match forecast_date. For example, {hubUtils} would only contain a forecast_date in their model output: https://hubverse-org.github.io/hubData/reference/as_model_out_tbl.html

dsweber2 · 2024-06-25T00:09:34Z

Looking into actually implementing the other_keys change, I/someone should do that in a separate PR, because it's sufficiently wide-ranging w/docs/vignette updates that it should be separated out.

I think this is ready if someone wants to do a review. @lcbrooks I think most of the concerning cases you're thinking about will be taken care of by either multiple names triggering, and thus forcing an error. Seems like there are legit use cases for forecast_date, so if it's the only date-like, we should be using it.

brookslogan · 2024-06-25T00:20:51Z

But even if we have chopped off target date, we would not want forecast date to be the time value. We would want to first reattach the target date then convert.. if we were ever to put these in an epi df at all, it seems a bit of a mismatch vs a dedicated predictions format or archive. I think forecast date should just be excluded from the considered set.

dsweber2 · 2024-06-25T15:38:16Z

If someone were trying to say, smooth the scores, I could see using forecast_date on it's own. This was actually the use-case that got me started down making this. It just seems very prescriptive to say that a user making an epi_df where there's nothing else that looks like time value doesn't want time_value=forecast_date.

brookslogan · 2024-06-25T15:40:56Z

At the same time, we've had bugs from lining up forecast dates of forecasts with time values of signals, and making the default make this easier seems undesirable.

brookslogan · 2024-06-25T15:46:37Z

This is more into personal preferences, but I'd also actually probably prefer the prescriptive approach for data structures [or just column names, but we force people to use time_value regardless of what it actually represents which is the opposite of what I'm imagining] and a optionally nonprescriptive interface for functionality. E.g. an option or function to slide by forecast date rather than time value.

dsweber2 · 2024-06-25T15:49:39Z

In the interest of shipping things rather than leaving every PR open, I've dropped it as a default; time_value = forecast_date will work, so the option isn't gone gone, just not a default. I'd appreciate a review and merge from someone in the not too distant future.

nmdefries

Brief review, will be more thorough in second pass. Please add some tests for the new functions.

R/archive.R

R/epi_df.R

nmdefries · 2024-07-03T20:34:57Z

R/epi_df.R

  if (!test_subset(c("geo_value", "time_value"), names(x))) {
    cli_abort(
-      "Columns `geo_value` and `time_value` must be present in `x`."
+      "Either columns `geo_value` and `time_value` must be present in `x`, or related columns (see the internal


question: do we need the same type of check and error in the as_epi_archive version of this?

it's already there? The most recent commit includes a wording change to match this though.

R/utils.R

nmdefries · 2024-07-03T20:46:12Z

R/archive.R

  assert_data_frame(x)
+  x <- rename(x, ...)


question: do we need to do a tryCatch here, too?

The error in this case is

as_epi_archive(rename(dt, weirdName = version), version = weirdName, version = time_value) Error in `rename()` at �]8;line = 461:col = 3;file:///home/dsweber/allHail/delphi/epiprocess/R/archive.R�epiprocess/R/archive.R:461:3�]8;;�: ! Names must be unique. ✖ These names are duplicated: * "version" at locations 1 and 2. Run `�]8;;rstudio:run:rlang::last_trace()�rlang::last_trace()�]8;;�` to see where the error occurred.

which I think is more obvious what went wrong. For the other rename, it's buried a little deeper why the names are redundant, so I wanted to give some context

dsweber2 self-assigned this Jun 20, 2024

dsweber2 mentioned this pull request Jun 25, 2024

as_epi_df convienence: combine separate day/month/year columns etc #476

Open

dsweber2 force-pushed the autoName branch from 08059b0 to e07ec94 Compare June 25, 2024 00:06

dsweber2 marked this pull request as ready for review June 25, 2024 00:07

dsweber2 requested review from brookslogan and nmdefries June 25, 2024 15:50

nmdefries reviewed Jul 3, 2024

View reviewed changes

dsweber2 force-pushed the autoName branch from af2dc58 to 9a451b0 Compare July 9, 2024 16:47

dsweber2 and others added 11 commits July 9, 2024 12:44

basic auto-naming

1ab1095

geo_value and version, separate functions, more ex

0f0fe40

docs: document (GHA)

ec23ec7

errant renamed variables

a30d41e

More tests, ... tidyselect, doc as_epi_df, more values

441bdae

docs: document (GHA)

c8b73ce

remove forecast_date as a default

b5e9d20

happier linter

7af380b

wrong test, better too many columns error

b7414a8

minor refactor to make col name subs accessible

e144803

Nat's suggestions

924e5ee

dsweber2 and others added 4 commits July 9, 2024 12:44

docs: document (GHA)

5825181

avoid arg prefix-completion for versions_end

33ef7c7

style: styler (GHA)

536d7d7

docs: document (GHA)

243c20e

dsweber2 force-pushed the autoName branch from f9499e6 to 243c20e Compare July 9, 2024 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`epi_df` argument refactoring #460

`epi_df` argument refactoring #460

dsweber2 commented Jun 7, 2024 •

edited by nmdefries

Loading

brookslogan commented Jun 7, 2024

nmdefries commented Jun 7, 2024

dsweber2 commented Jun 7, 2024 •

edited

Loading

brookslogan commented Jun 10, 2024

dajmcdon commented Jun 19, 2024

dsweber2 commented Jun 25, 2024 •

edited

Loading

brookslogan commented Jun 25, 2024

dsweber2 commented Jun 25, 2024

brookslogan commented Jun 25, 2024

brookslogan commented Jun 25, 2024 •

edited

Loading

dsweber2 commented Jun 25, 2024

nmdefries left a comment

nmdefries Jul 3, 2024

dsweber2 Jul 8, 2024

nmdefries Jul 3, 2024

dsweber2 Jul 9, 2024

epi_df argument refactoring #460

Are you sure you want to change the base?

epi_df argument refactoring #460

Conversation

dsweber2 commented Jun 7, 2024 • edited by nmdefries Loading

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Future extensions

brookslogan commented Jun 7, 2024

nmdefries commented Jun 7, 2024

dsweber2 commented Jun 7, 2024 • edited Loading

brookslogan commented Jun 10, 2024

dajmcdon commented Jun 19, 2024

dsweber2 commented Jun 25, 2024 • edited Loading

brookslogan commented Jun 25, 2024

dsweber2 commented Jun 25, 2024

brookslogan commented Jun 25, 2024

brookslogan commented Jun 25, 2024 • edited Loading

dsweber2 commented Jun 25, 2024

nmdefries left a comment

Choose a reason for hiding this comment

nmdefries Jul 3, 2024

Choose a reason for hiding this comment

dsweber2 Jul 8, 2024

Choose a reason for hiding this comment

nmdefries Jul 3, 2024

Choose a reason for hiding this comment

dsweber2 Jul 9, 2024

Choose a reason for hiding this comment

`epi_df` argument refactoring #460

`epi_df` argument refactoring #460

dsweber2 commented Jun 7, 2024 •

edited by nmdefries

Loading

dsweber2 commented Jun 7, 2024 •

edited

Loading

dsweber2 commented Jun 25, 2024 •

edited

Loading

brookslogan commented Jun 25, 2024 •

edited

Loading