Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epi_df argument refactoring #460

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from
Open

epi_df argument refactoring #460

wants to merge 15 commits into from

Conversation

dsweber2
Copy link
Contributor

@dsweber2 dsweber2 commented Jun 7, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

There's a couple of things we want to change about the arguments to epi_df and/or as_epi_df:

  • columns that aren't named exactly time_value but are unambiguously meant to be that should be interpreted as such, e.g. date.
  • columns that aren't named exactly geo_value but are unambiguously meant to be that should be interpreted as such, e.g. geo_id.
  • tidyselect to handle renaming (so if you had both forecast_date and target_date you could tell it time_value = target_date to disambiguate)
  • break additional metadata into "additional" and other_keys=, by far the most useful part of that
  • update the docs to reflect these changes
  • all of these, also for epi_archives. Some have already happened (e.g. other_keys is already separated)

Current list of time_value equivalents:

c(
         time_value = "date",
        time_value = "time",
        time_value = "datetime",
        time_value = "dateTime",
        tmie_value = "date_time",
        time_value = "forecast_date",
        time_value = "target_date",
        time_value = "week",
        time_value = "day",
        time_value = "epiweek",
        time_value = "month",
        time_value = "year",
        time_value = "yearmon",
        time_value = "yearMon",
        time_value = "dates",
        time_value = "time_values",
        time_value = "forecast_dates",
        time_value = "target_dates"

)

Current list of geo_value equivalents:

c(
      geo_value = "geo_values",
      geo_value = "geo_id",
      geo_value = "geos",
      geo_value = "fips",
      geo_value = "zip",
      geo_value = "county",
      geo_value = "hrr",
      geo_value = "msa",
      geo_value = "state",
      geo_value = "province",
      geo_value = "nation",
      geo_value = "states",
      geo_value = "provinces",
      geo_value = "counties"
)

Current list of version equivalents:

c(
      version = "issue",
      version = "release"
)
And for all of these, the Snake_Case capitalized versions

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Future extensions

  • combining multiple columns into unique keys, e.g. time_value = join_by(month, year) where month and year are separate columns

@brookslogan
Copy link
Contributor

Nice lists; some notes/opinions:

  • I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.
  • location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.
  • time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

@nmdefries
Copy link
Contributor

time_value = time, datetime

@dsweber2
Copy link
Contributor Author

dsweber2 commented Jun 7, 2024

I don't think time_value should not be matched to forecast_date or forecast_dates automatically. target_date and target_dates seem fine though. This will help prevent misaligning forecast and signal data.

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

location should be a possibility for geo_value (think the Hub uses this), and maybe jurisdiction.

oh yeah that makes sense thanks!

time_value seems like it's sort of missing week, epiweek, EW, month, mon, year, yearmon, but those would take more complicated logic as e.g., week or epiweek could be YYYYww format or ww format and rely on a separate year column, plus it's ambiguous what type of week numbering system is being used simply from the name (even "epiweek" can differ between the US and other nations).

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

@brookslogan
Copy link
Contributor

So, if both forecast_date and target_date are present, with the current setup it's going to throw an error, asking the user to specify which they want. Does that avoid the footguns you're hoping to avoid?

In that instance, yes, although I'd also be fine with guessing it to be the target_date.

But suppose there is only a forecast_date (no target_date); I don't think we should guess it to be the time_value.

I guess I should just add whichever of these are supported by guess_time_type? The multi-column case does make sense, but I think I'm going to put that as out of scope for now and leave it as a future enhancement.

Sorry, this was mostly me reasoning about why it's good to exclude these. You could maybe accept yearmon & yearmonth, and perhaps year if it appears by itself without month or mon or week or any other possible pairings you could think of. But the rest are too ambiguous by name. If you detect a an appropriate class from tsibble (e.g., whatever tsibble::yearweek outputs) then that is less fraught --- tsibble does disambiguate these. But the simplest "solution" is just to exclude all of these possibilities and require the user to specify.

@dajmcdon
Copy link
Contributor

On forecast_date / target_date, I actually think we might want it to match forecast_date. For example, {hubUtils} would only contain a forecast_date in their model output: https://hubverse-org.github.io/hubData/reference/as_model_out_tbl.html

@dsweber2
Copy link
Contributor Author

dsweber2 commented Jun 25, 2024

Looking into actually implementing the other_keys change, I/someone should do that in a separate PR, because it's sufficiently wide-ranging w/docs/vignette updates that it should be separated out.

I think this is ready if someone wants to do a review. @lcbrooks I think most of the concerning cases you're thinking about will be taken care of by either multiple names triggering, and thus forcing an error. Seems like there are legit use cases for forecast_date, so if it's the only date-like, we should be using it.

@brookslogan
Copy link
Contributor

But even if we have chopped off target date, we would not want forecast date to be the time value. We would want to first reattach the target date then convert.. if we were ever to put these in an epi df at all, it seems a bit of a mismatch vs a dedicated predictions format or archive. I think forecast date should just be excluded from the considered set.

@dsweber2
Copy link
Contributor Author

If someone were trying to say, smooth the scores, I could see using forecast_date on it's own. This was actually the use-case that got me started down making this. It just seems very prescriptive to say that a user making an epi_df where there's nothing else that looks like time value doesn't want time_value=forecast_date.

@brookslogan
Copy link
Contributor

At the same time, we've had bugs from lining up forecast dates of forecasts with time values of signals, and making the default make this easier seems undesirable.

@brookslogan
Copy link
Contributor

brookslogan commented Jun 25, 2024

This is more into personal preferences, but I'd also actually probably prefer the prescriptive approach for data structures [or just column names, but we force people to use time_value regardless of what it actually represents which is the opposite of what I'm imagining] and a optionally nonprescriptive interface for functionality. E.g. an option or function to slide by forecast date rather than time value.

@dsweber2
Copy link
Contributor Author

In the interest of shipping things rather than leaving every PR open, I've dropped it as a default; time_value = forecast_date will work, so the option isn't gone gone, just not a default. I'd appreciate a review and merge from someone in the not too distant future.

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief review, will be more thorough in second pass. Please add some tests for the new functions.

R/archive.R Outdated Show resolved Hide resolved
R/epi_df.R Show resolved Hide resolved
if (!test_subset(c("geo_value", "time_value"), names(x))) {
cli_abort(
"Columns `geo_value` and `time_value` must be present in `x`."
"Either columns `geo_value` and `time_value` must be present in `x`, or related columns (see the internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we need the same type of check and error in the as_epi_archive version of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already there? The most recent commit includes a wording change to match this though.

R/utils.R Outdated Show resolved Hide resolved
assert_data_frame(x)
x <- rename(x, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we need to do a tryCatch here, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error in this case is

as_epi_archive(rename(dt, weirdName = version), version = weirdName, version = time_value)
Error in `rename()` at �]8;line = 461:col = 3;file:///home/dsweber/allHail/delphi/epiprocess/R/archive.R�epiprocess/R/archive.R:461:3�]8;;�:
! Names must be unique.
✖ These names are duplicated:
  * "version" at locations 1 and 2.
Run `�]8;;rstudio:run:rlang::last_trace()�rlang::last_trace()�]8;;�` to see where the error occurred.

which I think is more obvious what went wrong. For the other rename, it's buried a little deeper why the names are redundant, so I wanted to give some context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants