Skip to content

Conversation

@ryantibs
Copy link
Member

@ryantibs ryantibs commented Nov 3, 2020

Hi all, particularly @jacobbien @dajmcdon @bnaras @JedGrabman on the evalcast development side, and also @larrywasserman @ValerieVentura @ajgreen93 @capolitsch @brookslogan on the forecaster development/evaluation side.

I spent a little while trying to put evalcast through a "dry run", where I analyzed CMU-TimeSeries in tandem with a bunch of the major forecasters up on the COVID Hub. See here for a preview of covidhub_evaluation.html.

There's a bunch of noteworthy things here on both the evalcast development side, and on the forecaster evaluation side.

Evalcast development

I ran into several blockers which permitted this from being an "easy" task with evalcast. I marked each with a TODO in the notebook, then suggested what we need to do to address the blocker. I know @dajmcdon opened up a bunch of new issues. It would be worth going through my TODOs to map them onto the issues and see if there's anything new that we need to track. Nb @dajmcdon

Actually, I didn't mark any of the plotting + analysis functionality that I created here to do evaluation as a TODO, but I do find these kind of plots + analysis helpful (for many forecasters and many forecast dates, I found it more helpful than the plotting options offered by evalcast, to be honest). So it would be great to integrate this functionality into evalcast.

I will say one last thing on the evalcast development side: speed kills. Several parts here were slow (read my TODOs); if we can make a lot of this run faster, then that would be ideal for real forecaster development "in the wild". Otherwise if it ends up being way too slow, then I could see people getting frustrated and writing their own custom code.

Forecast evaluation

I compared CMU-TimeSeries, for state death incidence forecasts, to about 12 or so other forecasters. Over the forecast dates available for CMU-TimeSeries. I looked at a bunch of different ways of scoring and analyzing the forecasts. I didn't really write any interpretations or takeaways in the notebook itself, it's pretty bare bones. The summary:

  1. The ensemble rules, in every way you look at it.

  2. In terms of WIS (or AE), CMU-TimeSeries is usually around 3rd or 4th, not counting the ensemble, following YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator, depending on the setup: ahead value, and the choice of mean or median as the aggregator. These 4 are all usually quite close.

  3. CMU-TimeSeries has pretty bad coverage for central 80% intervals at ahead = 1. Better for ahead = 2, 3, and 4.

  4. I'm partial to looking at scaled WIS, where you scale by the WIS of COVIDhub-baseline. As I explain in the notebook, you can think of this as providing a nonparametric spatiotemporal adjustment. The results end up being grossly similar: CMU-TimeSeries, YYG-ParamSearch, UMass-MechBayes, and OliverWyman-Navigator occupy the top 4 spots (not counting the ensemble), in some order, depending on the setup.

  5. Finally, I considered a kind of "pairwise tournament" inspired by Johannes Bracher's recent analysis. Here you basically look at the scaled WIS of every forecaster with respect to every other one and create one big matrix. Then you can summarize the performance of each forecaster using a geometric mean of the rows.

    I'm showing you what it looks like for the median as the aggregation function (and this is done over all aheads). I like the median because it's the most robust (you can see that in the code I needed to use a trimmed mean to avoid overflow):

    pairwise_tournament

    For row a and column b, you can read the cell as showing you median{ (WIS(a) / WIS(b) }. And, here's the overall summary of performance, given by the geometric mean of the rows:

    rank forecaster theta
    1 COVIDhub-ensemble 0.7100264
    2 UMass-MechBayes 0.7813381
    3 CMU-TimeSeries 0.8130536
    4 YYG-ParamSearch 0.8170257
    5 OliverWyman-Navigator 0.8176374
    6 GT-DeepCOVID 0.9035405
    7 COVIDhub-baseline 1.0974327
    8 LANL-GrowthRate 1.1369224
    9 UCLA-SuEIR 1.1818464
    10 MOBS-GLEAM_COVID 1.3135599
    11 UT-Mobility 1.3613481
    12 JHU_IDD-CovidSP 1.3929529

After we smooth out the infrastructure headaches (or even earlier starting now), what I'm looking for is for somebody to take over maintenance of this notebook. We should:

  • create a more streamlined version that doesn't look at all these options (I was just experimenting to see what was informative);
  • integrate in other informative parts of the earlier evaluation reports that Jacob + others have built up (where does this code live?);
  • and update it regularly (READ: once per week!).

This person would likely also work closely with @capolitsch as he starts to work through our list of planned improvements to CMU-TimeSeries for state death forecasts. (Evaluation is going to of course be crucial for that.). Any takers? Or suggestions?

@ryantibs
Copy link
Member Author

@krivard I see you're the code owner, but it looks like I have admin rights to merge this. I'd like to go ahead and merge this into main: but I'll wait for a little bit in case you object. Summary: it's a work in progress, but functional. It's only a demo notebook anyway, to be used internally like our other notebooks, and doesn't affect anything else, so it doesn't really matter.

@dajmcdon @JedGrabman @sgsmob Don't view this as a finished product. As you iteratively work on evalcast, please also iteratively improve this notebook.

@krivard
Copy link
Contributor

krivard commented Nov 12, 2020

It looks like this notebook would be included in the batch that gets processed nightly by make.R for the signal dashboards (whose results are then published publicly). Is that intentional?

@ryantibs
Copy link
Member Author

ryantibs commented Nov 12, 2020

Good point @krivard; there's no problem with making this public (it's not sensitive or anything). But I would exclude it from make.R only because it will be very very slow, and because @JedGrabman @sgsmob @dajmcdon are actively working to make it better and we don't need it automated until they are farther along.

@krivard
Copy link
Contributor

krivard commented Nov 13, 2020

@ryantibs cool, let me know when you've added the fix to make.R and I'll approve.

@krivard krivard self-requested a review November 13, 2020 15:24
- Light renaming, make sure every dashboard has "dashboard" in its
  filename
- Modify make.R so it only compiles filnames containing "dashboard"
@ryantibs
Copy link
Member Author

@krivard Done. I renamed some files and fixed up make.R (commit msg gives details).

Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bug to fix, then this can merge

Copy link
Contributor

@capnrefsmmat capnrefsmmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@capnrefsmmat capnrefsmmat merged commit 2547d3d into main Nov 14, 2020
@ryantibs ryantibs deleted the covidhub-eval branch November 14, 2020 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants