Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi signals #156

Merged
merged 21 commits into from
Nov 1, 2020
Merged

Multi signals #156

merged 21 commits into from
Nov 1, 2020

Conversation

ryantibs
Copy link
Member

@ryantibs ryantibs commented Oct 28, 2020

Hi all, particularly capnrefsmmat @chinandrew, this finishes off the covidcast_signals() function, and provides related signal aggregation functionality.

Because I branched this off of evalcast-review (which was branched off of evalcast), the commit log here is huge. My first "real" commit for this branch starts at b42b353. Once evalcast gets merged into main #85 (should happen within a couple days), the commit log here will look a lot cleaner, and we can merge this branch. chinandrew has already provided an informal (?) review, maybe he can take one more look.

┆Issue is synchronized with this Asana task by Unito

- As written, latest_issue() would retain all attributes of the
  original data frame, including the row names. But the row names
  should change because of the filtering
- Left the original column names in there for good measure too
- Applied the same fix to earliest_issue()
- Operates on a list of covidcast_signal data frames, as provided
  by covidcast_signals(), for example
- Can specify "wide" or "tall" format, with former as the default
- For convenience, this can operate on a list of data frame with
  time-shifted values appended to their columns, as provided by
  append_shifts(), for example
- Also, downstream helpers covidcast_longer(), covidcast_wider()
  are there if you change your mind
- covidcast_wider() is not fully implemented: it won't work when
  time-shifted columns are present
- Neither of the downstream helpers are documented
- I'm not certain that having the dt column be dynamically present
  (only present when there are time-shifted columns) is the best
  solution; but it seems confusing to add it as the default, since
  there user wouldn't know it's coming from if he/she never called
  append_shifts()
- I'm also not certain I like the following asymmetry: when you
  call append_shifts() you always get a wide result, but when you
  call aggregate_signals() you can choose between a wide or a tall
  result. If you wanted to get column-shifted values in tall form,
  then you'd have to do something silly, where you wrap the result
  in a list and then call aggregate_signals() with format = "tall"
- On the other hand, I don't want to add yet another option to
  append_shifts(), because that feels like it's getting overly
  featurized, offering every possible feature in every possible
  case
- Thanks to Andrew Chin for helpful feedback
- Rename append_shifts() to apply_shifts(): now you have to ask
  for dt = 0 in order to retain the original value column, else it
  is dropped. This is cleaner behavior and leads to easier logic
  downstream
- Hide apply_shifts() from the documentation. Remains in NAMESPACE
  so technically a user can call it but it's not visible. This is
  probably best as it presents a simpler set of options to the end
  user, and hides the overlap and somewhat ugly asymmetry between
  apply_shifts() and aggregate_signals()
- Change aggregate_signals() so that it takes an argument dt to do
  shifts; clarify in the documentation the three main use cases
  for this function
- Fix up covidcast_wider() and covidcast_taller(), and vignette:
- Strike direction from the record: doc, covidcast_signal object,
  plotting functionality
- Remove geo_type from the attributes of covidcast_signal object,
  since it already appears in the metadata attribute
- Change existing dependencies to fetch geo_type from metadata
- Change vignettes to use USAFacts rather than JHU CSSE cata
- Reorder columns in covidcast_signal object in a more sensible
  way, and modify aggregation functions to preserve this ordering
@chinandrew chinandrew changed the base branch from main to evalcast October 28, 2020 18:21
@chinandrew
Copy link
Collaborator

Changed the based of this PR to evalcast so 1) if it gets merged it doesn't accidently merge in all of evalcast to main, and 2) makes the diff and commit log a bit easier to read. It's easy to switch back if needed
Screenshot from 2020-10-28 11-23-29

- These were incompatible with how aggregate_signals() interprets
  dt, and I think the latter is more natural. Negative translates
  into a lag value, positive into a lead value
- Update documentation, correlation vignette
- Modify the intro-api talk to use the new dt_x argument properly,
  and it looks like nothing else needs to change (no other
  dependencies on covidcast_cor() with dt_x or dt_y being nonzero)
- Also fix a small bug to allow aggregate_signals() to operate
  directly on a data frame (not just a list)
- Do a small bit of code refactoring and some doc updates
@chinandrew
Copy link
Collaborator

As mentioned in #10 the interface here looks good to me, though someone more familiar with R paradigms should have the final say.

@ryantibs
Copy link
Member Author

@chinandrew @capnrefsmmat Few more updates. First, I filled out the vignette so this branch should be good to go. Andrew approves but suggests somebody more familiar with R takes a look. To repeat this doesn't need to be merged until evalcast gets merged, which shouldn't be for a few days at least. If Alex you're too busy then maybe @JedGrabman can take a look.

Second, a spotted a discrepancy in the way I defined dt in aggregate_signals() and covidcast_cor(). The definition in the former function is more natural; the latter has it flipped. So I changed everything surrounding covidcast_cor() to match aggregate_signals() and looked around for code that would break; all I could find was the correlations vignette and the intro-api talk, both of which I fixed.

To summarize:

  • This should be merged into main after evalcast gets merged.
  • This requires us to rebuild the documentation and the vignettes.
  • Maybe we should record as an update that covidcast_cor() now interprets dt_x and dt_y in the opposite way it did before, so that negative values mean lagging and positive values mean leading (which is more natural).

@capnrefsmmat
Copy link
Contributor

Maybe we should record as an update that covidcast_cor() now interprets dt_x and dt_y in the opposite way it did before, so that negative values mean lagging and positive values mean leading (which is more natural).

That can go into NEWS.md, and we can bump the minor version of the package accordingly. Note that this branch doesn't contain #148, so there'll be a conflict if you do this.

@ryantibs
Copy link
Member Author

@capnrefsmmat @chinandrew Upon reflection there was really no reason for me to have branched this off of the evalcast branch. I was just working on something else that I needed the evalcast package for, and then I started implementing these multi signal functions because I also needed them. But my "real" commits starting with b42b353 have noting to do with evalcast.

Since the evalcast branch is actually held up with more development needed, what's the best way to jut pick off the commits starting with b42b353 through the end, and move them to a separate branch, which is created off of main? Then we could just merge that new into main.

@ryantibs ryantibs mentioned this pull request Oct 31, 2020
@capnrefsmmat
Copy link
Contributor

We'd have to carefully cherry-pick, since there are changes to evalcast in this series of commits, namely in
1e674d8. Those already seem to have a merge conflict, so we'd want to cherry-pick every other commit into a new branch.

That's doable, although whether or not we do it, I think it still makes sense to merge evalcast and continue further development of it in new PRs targeting specific features/fixes.

@chinandrew
Copy link
Collaborator

chinandrew commented Oct 31, 2020

You should either be able to do a rebase like git rebase --onto main evalcast multi-signals, or creating a new branch and cherry-picking the commit range you want: git cherry-pick start_commit^..end_commit

EDIT: looks like alex beat me to it.

@ryantibs
Copy link
Member Author

ryantibs commented Nov 1, 2020

Thanks both of you! I guess I could try to cherry-pick but it might be easiest just to merge evalcast to main as it stands now (Alex I just left a message agreeing with your logic on #85).

@ryantibs ryantibs changed the base branch from evalcast to main November 1, 2020 00:59
capnrefsmmat and others added 3 commits October 31, 2020 21:12
- Change vignettes references to URL links (since vignettes aren't
  built along with the package by default)
- Use rlang::warn more consistently (whenever it might be useful)
- Copyedit a few small spots including the DESCRIPTION
@ryantibs
Copy link
Member Author

ryantibs commented Nov 1, 2020

@capnrefsmmat Well I snuck in a few other really small light housekeeping things. See the details of my last commit. I managed to do it before you merge this into main (but apparently not before you merged main into this!).

Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed two issues while preparing to merge

R-packages/covidcast/R/cor.R Outdated Show resolved Hide resolved
R-packages/covidcast/R/wrangle.R Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finish the covidcast_signals function
3 participants