Draft COVID Hub eval #240

ryantibs · 2020-11-03T05:17:21Z

Hi all, particularly @jacobbien @dajmcdon @bnaras @JedGrabman on the evalcast development side, and also @larrywasserman @ValerieVentura @ajgreen93 @capolitsch @brookslogan on the forecaster development/evaluation side.

I spent a little while trying to put evalcast through a "dry run", where I analyzed CMU-TimeSeries in tandem with a bunch of the major forecasters up on the COVID Hub. See here for a preview of covidhub_evaluation.html.

There's a bunch of noteworthy things here on both the evalcast development side, and on the forecaster evaluation side.

Evalcast development

I ran into several blockers which permitted this from being an "easy" task with evalcast. I marked each with a TODO in the notebook, then suggested what we need to do to address the blocker. I know @dajmcdon opened up a bunch of new issues. It would be worth going through my TODOs to map them onto the issues and see if there's anything new that we need to track. Nb @dajmcdon

Actually, I didn't mark any of the plotting + analysis functionality that I created here to do evaluation as a TODO, but I do find these kind of plots + analysis helpful (for many forecasters and many forecast dates, I found it more helpful than the plotting options offered by evalcast, to be honest). So it would be great to integrate this functionality into evalcast.

I will say one last thing on the evalcast development side: speed kills. Several parts here were slow (read my TODOs); if we can make a lot of this run faster, then that would be ideal for real forecaster development "in the wild". Otherwise if it ends up being way too slow, then I could see people getting frustrated and writing their own custom code.

Forecast evaluation

I compared CMU-TimeSeries, for state death incidence forecasts, to about 12 or so other forecasters. Over the forecast dates available for CMU-TimeSeries. I looked at a bunch of different ways of scoring and analyzing the forecasts. I didn't really write any interpretations or takeaways in the notebook itself, it's pretty bare bones. The summary:

The ensemble rules, in every way you look at it.
In terms of WIS (or AE), CMU-TimeSeries is usually around 3rd or 4th, not counting the ensemble, following YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator, depending on the setup: ahead value, and the choice of mean or median as the aggregator. These 4 are all usually quite close.
CMU-TimeSeries has pretty bad coverage for central 80% intervals at ahead = 1. Better for ahead = 2, 3, and 4.
I'm partial to looking at scaled WIS, where you scale by the WIS of COVIDhub-baseline. As I explain in the notebook, you can think of this as providing a nonparametric spatiotemporal adjustment. The results end up being grossly similar: CMU-TimeSeries, YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator occupy the top 4 spots (not counting the ensemble), in some order, depending on the setup.

Finally, I considered a kind of "pairwise tournament" inspired by Johannes Bracher's recent analysis. Here you basically look at the scaled WIS of every forecaster with respect to every other one and create one big matrix. Then you can summarize the performance of each forecaster using a geometric mean of the rows.

I'm showing you what it looks like for the median as the aggregation function (and this is done over all aheads). I like the median because it's the most robust (you can see that in the code I needed to use a trimmed mean to avoid overflow):

For row a and column b, you can read the cell as showing you median{ (WIS(a) / WIS(b) }. And, here's the overall summary of performance, given by the geometric mean of the rows:

rank	forecaster	theta
1	COVIDhub-ensemble	0.7100264
2	UMass-MechBayes	0.7813381
3	CMU-TimeSeries	0.8130536
4	YYG-ParamSearch	0.8170257
5	OliverWyman-Navigator	0.8176374
6	GT-DeepCOVID	0.9035405
7	COVIDhub-baseline	1.0974327
8	LANL-GrowthRate	1.1369224
9	UCLA-SuEIR	1.1818464
10	MOBS-GLEAM_COVID	1.3135599
11	UT-Mobility	1.3613481
12	JHU_IDD-CovidSP	1.3929529

After we smooth out the infrastructure headaches (or even earlier starting now), what I'm looking for is for somebody to take over maintenance of this notebook. We should:

create a more streamlined version that doesn't look at all these options (I was just experimenting to see what was informative);
integrate in other informative parts of the earlier evaluation reports that Jacob + others have built up (where does this code live?);
and update it regularly (READ: once per week!).

This person would likely also work closely with @capolitsch as he starts to work through our list of planned improvements to CMU-TimeSeries for state death forecasts. (Evaluation is going to of course be crucial for that.). Any takers? Or suggestions?

ryantibs · 2020-11-12T12:54:57Z

@krivard I see you're the code owner, but it looks like I have admin rights to merge this. I'd like to go ahead and merge this into main: but I'll wait for a little bit in case you object. Summary: it's a work in progress, but functional. It's only a demo notebook anyway, to be used internally like our other notebooks, and doesn't affect anything else, so it doesn't really matter.

@dajmcdon @JedGrabman @sgsmob Don't view this as a finished product. As you iteratively work on evalcast, please also iteratively improve this notebook.

krivard · 2020-11-12T22:25:12Z

It looks like this notebook would be included in the batch that gets processed nightly by make.R for the signal dashboards (whose results are then published publicly). Is that intentional?

ryantibs · 2020-11-12T22:55:47Z

Good point @krivard; there's no problem with making this public (it's not sensitive or anything). But I would exclude it from make.R only because it will be very very slow, and because @JedGrabman @sgsmob @dajmcdon are actively working to make it better and we don't need it automated until they are farther along.

krivard · 2020-11-13T15:24:14Z

@ryantibs cool, let me know when you've added the fix to make.R and I'll approve.

- Light renaming, make sure every dashboard has "dashboard" in its filename - Modify make.R so it only compiles filnames containing "dashboard"

ryantibs · 2020-11-14T18:14:47Z

@krivard Done. I renamed some files and fixed up make.R (commit msg gives details).

capnrefsmmat

One bug to fix, then this can merge

R-notebooks/make.R

capnrefsmmat

Fixed

Draft COVID Hub eval

f279444

JedGrabman mentioned this pull request Nov 12, 2020

Allow multiple forecasters to share data #219

Open

krivard self-requested a review November 13, 2020 15:24

Fix up makefile

9c8d210

- Light renaming, make sure every dashboard has "dashboard" in its filename - Modify make.R so it only compiles filnames containing "dashboard"

Fix merge conflict

964a8a0

This was referenced Nov 14, 2020

Rename vignettes #267

Merged

Create basic dashboards for our signals #32

Open

capnrefsmmat requested changes Nov 14, 2020

View reviewed changes

R-notebooks/make.R Outdated Show resolved Hide resolved

Fix error propagation bug

90ddf34

capnrefsmmat approved these changes Nov 14, 2020

View reviewed changes

capnrefsmmat merged commit 2547d3d into main Nov 14, 2020

ryantibs deleted the covidhub-eval branch November 14, 2020 21:19

JedGrabman mentioned this pull request Nov 18, 2020

Use Zoltar API rather than scraping GitHub #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft COVID Hub eval #240

Draft COVID Hub eval #240

Uh oh!

ryantibs commented Nov 3, 2020 •

edited

Loading

Uh oh!

ryantibs commented Nov 12, 2020

Uh oh!

krivard commented Nov 12, 2020 •

edited

Loading

Uh oh!

ryantibs commented Nov 12, 2020 •

edited

Loading

Uh oh!

krivard commented Nov 13, 2020

Uh oh!

ryantibs commented Nov 14, 2020

Uh oh!

capnrefsmmat left a comment

Uh oh!

Uh oh!

capnrefsmmat left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Draft COVID Hub eval #240

Draft COVID Hub eval #240

Uh oh!

Conversation

ryantibs commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evalcast development

Forecast evaluation

Uh oh!

ryantibs commented Nov 12, 2020

Uh oh!

krivard commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryantibs commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krivard commented Nov 13, 2020

Uh oh!

ryantibs commented Nov 14, 2020

Uh oh!

capnrefsmmat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

capnrefsmmat left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryantibs commented Nov 3, 2020 •

edited

Loading

krivard commented Nov 12, 2020 •

edited

Loading

ryantibs commented Nov 12, 2020 •

edited

Loading