Skip to content

Conversation

brookslogan
Copy link
Contributor

This is a work in progress. Opening the PR just for any feedback and discussion.

Add `epi_tibble_archive` as an `R6` class with just
- `$new` (requires the user to perform any compression of snapshots/updates into
  minimal diffs if desired, and to reconstruct upon updating the archive itself)
- `$epi_tibble_as_of`

The next priority is to implement `epi_slide` for `epi_tibble_archive`s.
@brookslogan brookslogan marked this pull request as draft November 9, 2021 17:16
These need testing. The `epi_slide` implementation needs testing and
documentation and examples. The issues vignette still needs to be completed with
this implementation.
@brookslogan brookslogan mentioned this pull request Nov 9, 2021
- Renamed `epi_df_archive` to `epi_archive`
- Started updating documentation for `epi_slide()`
- Rebuilt documentation
- This allows us to create an `epi_df` object directly from columns,
  just like `tibble()` or `data.frame()`
- It also gives us a better place to point the documentation to, for
  describing the structure of an `epi_df` object (previously we were
  pointing to `as_epi_df()`, which was a bit awkward)
- Rename `as.epi_df()` to `as_epi_df()` to be consistent with tibble
  and tsibble functions
- Rebuild doc and vignettes
- Trying to make `epi_archive` "look" more like `epi_df` in terms
  of both the way the user sees it, and the way we maintain it
- Rename "issue" to be "version"
- Enforce "geo_value", "time_value" columns (no custom names)
- Added `epi_archive()` and `as_epi_archive()` functions
- Ensure that `epi_archive$as_of()` returns a tsibble, and make
  sure that this can be pushed through `as_epi_df()` without us
  losing the `tbl_ts` class
- Fiddle with various small tweaks to `epi_df` stuff
- Check out vignettes/weird.Rmd. Something weird is going on. When
  we have an `epi_df` that is `tbl_df`, then the `epi_df` class is
  kept when we do things like `dplyr::select()`. But when we have
  an `epi_df` that is `tbl_ts`, then the `epi_df` class is DROPPED
  when we do things like `dplyr::select()`
- I have no idea why this is going on. Also, I'm not sure what
  behavior we actually want. Seems like being brittle could be a
  good thing across many dplyr verbs (e.g., `dplyr::select()` on
  an `epi_df` could drop the required columns of an `epi_df, so
  the result SHOULDN'T be an `epi_df`)
@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 3, 2022

Weird thing is probably because [relocate calls [ and:]

tibble:::`[.tbl_df`

looks like it copies the class and other attributes using tibble:::vectbl_restore (calling tibble_restore_impl),
but

tsibble:::`[.tbl_ts`

does not look like it does the same [it rebuilds the result as a tibble or tsibble rather than reassigning attributes].

@ryantibs ryantibs changed the title Add epi_tibble_archive Add epi_archive Feb 3, 2022
- Refactor the way key variables work within an `epi_archive` so
  we have "other key variables" (apart from geo, time, version)
  which is back to the way Logan had it originally
- Get rid of `as_of()` returning a tsibble; now it gives a tibble
  but it puts the key variables into the metadata
- Add `as_tibble.epi_df()` S3 method so that it knows how to set
  the key variables sensibly from "geo_value" and the metadata, if
  keys exist in the metadata; and it also takes the index to be
  "time_value". This means there will be no loss of information if
  we do something like epi_archive -> epi_df -> tsibble
- Continuing to edit/refine the way we implement and document this
- Now we have the following symmetry: `epi_df` and `epi_archive`
  each have their own documentation files (and hopefully this is
  picked up on with auto-linking when the vignettes are rebuilt),
  which explain the formats in a kind of symmetric way
- We also have `as_epi_df()` and `as_epi_archive()` as fronts for
  converting data to those formats, which also proceed somewhat
  symmetrically
- Got rid of explicit constructor for `epi_df`; but the explicit
  constructor for `epi_archive` still lives in R6's `new()` method
  (which is confusingly overwritten by `initialize()`, yet still
  accessible to users via `new()`). I think this lack of symmetry
  is fine, because these are genuinely different structures
- Cleaned up the `as_of()` method of for `epi_archive`, and also
  introduced a `min_time_value` argument which should make sliding
  much more efficient
- Have yet to look at sliding, or revisit vignettes in any way
- Also fix assorted little things here and there
- Still need to work through slide and rebuild vignettes
- Clean up a lot more on `epi_archive` (and `epi_df` by symmetry)
- Simplify `epi_slide()`: now it does not require the user to set
  the column type explicitly, only specify whether they want it to
  be a list column or not
- Re-implement sliding for archive objects. Make it a method of the
  `epi_archive` class rather than a special case of `epi_slide()`.
  Just simpler that way. Also, I think Logan's understanding or
  conception of sliding here was different than mine: mine is now
  clearly explained in the documentation. Basically it is just the
  version-aware analog of `epi_slide()` for `epi_df` objects
- Start writing epi archive vignette. Got decently far but got
  stuck before I could test out the `slide()` method. I got stuck
  on this: I think we need a `join()` method to `epi_archive` so
  it knows how to join on another data table onto its own one, and
  perform LOCF appropriately, update its metadata, etc. I want to
  do this at the data table level (rather than data frame or tibble
  level) because it should be faster. But something isn't quite
  working for me
- Committing for now; still need to figure out joins, and test out
  sliding, then finish and rebuild vignettes
- I forgot to include it in the last commit
- Now we have two grouping variables, both customizable: `shift_by`
  and `cor_by`. The former specifies which variables we want to grp
  on before shifting, and the latter before computing correlations.
  They can be different in general
- Update correlation vignette and test out it works as expected
- Move epi_archive stuff into a wip folder (along with derivative
  vignette, which is no longer something to highlight)
- Rebuld documentation and vignettes
@ryantibs ryantibs marked this pull request as ready for review February 7, 2022 17:08
@ryantibs
Copy link
Member

ryantibs commented Feb 7, 2022

@lcbrooks I'm going to go ahead and merge this PR. I ended up doing a lot of work on this branch that I think will be valuable beyond archive that I just want to get this in the main branch, and take a break from the archive work while I pursue other things. Summary outside of the epi_archive work:

  • Cleaned up a lot of epi_df and epi_slide() behavior so that they're simpler and more complete
  • Generalized epi_cor() so that it takes two "by" arguments: shift_by and cor_by
  • Lots of documentation improvements and updates

This archive work is now in a wip folder, and currently removed from the main package (so no users try to user it at this point). We should revisit it in another PR to finish it off. It is close to being finished. Summary of what's left:

  • I think we need a join() method for epi_archive so it knows how to join on another data table onto its own one, and perform LOCF appropriately, update its metadata, etc. I want to do this at the data table level (rather than data frame or tibble level) because it should be faster
  • Still need to test out the slide() method for epi_archive (which I reimplemented)
  • Still need to finish off the archive vignette, probably the best idea just to repeat the case forecasting example from the slide vignette, but with doctor's visits as a covariate

@ryantibs ryantibs merged commit 23c829f into main Feb 7, 2022
@ryantibs ryantibs deleted the introduce-epi-archive branch February 20, 2022 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants