-
Notifications
You must be signed in to change notification settings - Fork 27
Draft COVID Hub eval #240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft COVID Hub eval #240
Conversation
|
@krivard I see you're the code owner, but it looks like I have admin rights to merge this. I'd like to go ahead and merge this into main: but I'll wait for a little bit in case you object. Summary: it's a work in progress, but functional. It's only a demo notebook anyway, to be used internally like our other notebooks, and doesn't affect anything else, so it doesn't really matter. @dajmcdon @JedGrabman @sgsmob Don't view this as a finished product. As you iteratively work on evalcast, please also iteratively improve this notebook. |
|
It looks like this notebook would be included in the batch that gets processed nightly by |
|
Good point @krivard; there's no problem with making this public (it's not sensitive or anything). But I would exclude it from make.R only because it will be very very slow, and because @JedGrabman @sgsmob @dajmcdon are actively working to make it better and we don't need it automated until they are farther along. |
|
@ryantibs cool, let me know when you've added the fix to |
- Light renaming, make sure every dashboard has "dashboard" in its filename - Modify make.R so it only compiles filnames containing "dashboard"
|
@krivard Done. I renamed some files and fixed up make.R (commit msg gives details). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bug to fix, then this can merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Hi all, particularly @jacobbien @dajmcdon @bnaras @JedGrabman on the evalcast development side, and also @larrywasserman @ValerieVentura @ajgreen93 @capolitsch @brookslogan on the forecaster development/evaluation side.
I spent a little while trying to put evalcast through a "dry run", where I analyzed CMU-TimeSeries in tandem with a bunch of the major forecasters up on the COVID Hub. See here for a preview of covidhub_evaluation.html.
There's a bunch of noteworthy things here on both the evalcast development side, and on the forecaster evaluation side.
Evalcast development
I ran into several blockers which permitted this from being an "easy" task with evalcast. I marked each with a TODO in the notebook, then suggested what we need to do to address the blocker. I know @dajmcdon opened up a bunch of new issues. It would be worth going through my TODOs to map them onto the issues and see if there's anything new that we need to track. Nb @dajmcdon
Actually, I didn't mark any of the plotting + analysis functionality that I created here to do evaluation as a TODO, but I do find these kind of plots + analysis helpful (for many forecasters and many forecast dates, I found it more helpful than the plotting options offered by evalcast, to be honest). So it would be great to integrate this functionality into evalcast.
I will say one last thing on the evalcast development side: speed kills. Several parts here were slow (read my TODOs); if we can make a lot of this run faster, then that would be ideal for real forecaster development "in the wild". Otherwise if it ends up being way too slow, then I could see people getting frustrated and writing their own custom code.
Forecast evaluation
I compared CMU-TimeSeries, for state death incidence forecasts, to about 12 or so other forecasters. Over the forecast dates available for CMU-TimeSeries. I looked at a bunch of different ways of scoring and analyzing the forecasts. I didn't really write any interpretations or takeaways in the notebook itself, it's pretty bare bones. The summary:
The ensemble rules, in every way you look at it.
In terms of WIS (or AE), CMU-TimeSeries is usually around 3rd or 4th, not counting the ensemble, following YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator, depending on the setup: ahead value, and the choice of mean or median as the aggregator. These 4 are all usually quite close.
CMU-TimeSeries has pretty bad coverage for central 80% intervals at ahead = 1. Better for ahead = 2, 3, and 4.
I'm partial to looking at scaled WIS, where you scale by the WIS of COVIDhub-baseline. As I explain in the notebook, you can think of this as providing a nonparametric spatiotemporal adjustment. The results end up being grossly similar: CMU-TimeSeries, YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator occupy the top 4 spots (not counting the ensemble), in some order, depending on the setup.
Finally, I considered a kind of "pairwise tournament" inspired by Johannes Bracher's recent analysis. Here you basically look at the scaled WIS of every forecaster with respect to every other one and create one big matrix. Then you can summarize the performance of each forecaster using a geometric mean of the rows.
I'm showing you what it looks like for the median as the aggregation function (and this is done over all aheads). I like the median because it's the most robust (you can see that in the code I needed to use a trimmed mean to avoid overflow):
For row a and column b, you can read the cell as showing you median{ (WIS(a) / WIS(b) }. And, here's the overall summary of performance, given by the geometric mean of the rows:
After we smooth out the infrastructure headaches (or even earlier starting now), what I'm looking for is for somebody to take over maintenance of this notebook. We should:
This person would likely also work closely with @capolitsch as he starts to work through our list of planned improvements to CMU-TimeSeries for state death forecasts. (Evaluation is going to of course be crucial for that.). Any takers? Or suggestions?