Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook: Confirmed Cases vs. Deaths #87

Closed
wants to merge 17 commits into from

Conversation

caglorithm
Copy link

@caglorithm caglorithm commented Mar 21, 2020

In this notebook I compare the number and change of confirmed cases per day with the number and change of deaths per country.

Output: https://caglorithm.github.io/covid19-analysis/covid-cases-to-deaths/

  • The notebook follows ideas from Tomas Pueyo's article "Coronavirus: Why You Must Act Now"
  • Estimated time of infection is 20 days before death
  • Fatality rate of each country is calculated for estimated infections
  • An overview of the number of confirmed cases vs. the number of deaths per country is given
  • Distance of the maximum of cumulative deaths and new deaths to the number of infections is used to estimate the progression of the infection across countries.

image
image
image
image

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@caglorithm
Copy link
Author

Can't seem to trigger a new build. Any idea why?

@hamelsmu
Copy link
Contributor

I need to clean up the CI system this is not your fault ignore CI for now, I’ll review this soon in the next couple of days

@hamelsmu
Copy link
Contributor

There is a tremendous number of charts in this. I think, in many situations "less is more". Some suggestions

Can we get rid of one of these pairs of charts? Do we need both? I know that they both show something useful, just trying to encourage brevity, somehow.

image

I am not sure how this chart below helps convey useful information (it is super cool though I love looking at it!) about the fatality rate. What am I supposed to learn from this chart? Does this chart tell me anything interesting about the fatality rate between countries or help to explain the difference? What is the story here?

image

I'm not sure I understand the y-axis title in the below chart perhaps should be re-worded to confirmed cases per death?

image

I'm not sure if I understand dividing confirmed cases by the number of deaths, as they occur at different times? Also what is the story or take away from this chart? What do you want the audience to learn? This is not so apparent to me. I would suggest getting rid of this chart, a more informative chart might be to show the overall rate of deaths/confirmed cases (not over time), with some error bars?

It would be good if you ran spell check as I see several spelling errors here.

On these charts do we need all of these lines. can we get rid of some of these marks? It is kind of busy

image

Suggestion: get rid of maximum new deaths

Meta Suggestion: there is too much going on in this dashboard, suggest breaking it up into two dashboards, perhaps. Also try to get rid of extraneous information if possible.

Copy link
Contributor

@hamelsmu hamelsmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for submitting this PR! I think you have some really interesting information, its just needs to be presented more parsimoniously perhaps broken up into different dashboards, and perhaps eliminating non-critical information.

@hamelsmu
Copy link
Contributor

cc: @dansbecker b/c I think you will enjoy this

@caglorithm
Copy link
Author

caglorithm commented Mar 21, 2020

Thanks for the review and your suggestions! I want to address them:

There is a tremendous number of charts in this. I think, in many situations "less is more".

I agree! I have split the fatality rate display to another notebook 2020-03-21-covid19-fatality-rates.ipynb, as you suggested. I want to plot each country above a certain threshold of cases though so make them all available. For my chosen threshold values, the number of countries increase in the last few days....

Can we get rid of one of these pairs of charts? Do we need both? I know that they both show something useful, just trying to encourage brevity, somehow.

Both plots are informative on their own. The above panel shows the total number. From this for example the number of estimated infections with the number of confirmed infections can be compared visually. The panels below allow for a comparison in the speed of growth of each wave and it is easier to see their temporal succession.

I've added more information about the plots now in the description, following your suggestions

I've cleaned up the plot a bit for better readability
image

I am not sure how this chart below helps convey useful information (it is super cool though I love looking at it!) about the fatality rate. What am I supposed to learn from this chart? Does this chart tell me anything interesting about the fatality rate between countries or help to explain the difference? What is the story here?

My goal is to make the numbers accessible and not really tell a story.

I'm not sure I understand the y-axis title in the below chart perhaps should be re-worded to confirmed cases per death?

This plot is now in a separate notebook. I've changed the label of the axis according to your suggestion, thanks for making me aware of this.

I'm not sure if I understand dividing confirmed cases by the number of deaths, as they occur at different times? Also what is the story or take away from this chart? What do you want the audience to learn? This is not so apparent to me. I would suggest getting rid of this chart, a more informative chart might be to show the overall rate of deaths/confirmed cases (not over time), with some error bars?

I've added an extended description for this part to make the displayed data easier to interpret. My goal was not to teach the audience anything but to clearly present the numbers. However, the data is indicative for whether tests are done early before the wave of deaths or not. I've added a better description.

On these charts do we need all of these lines. can we get rid of some of these marks? It is kind of busy

I agree! I have removed the second measure and only display the measure that is used to compute the country-wise statistics below.

image
image

@hamelsmu
Copy link
Contributor

  • The notebook 2020-03-21-covid19-cases-to-deaths still has the same graphs in it? I thought you were going to seperate things out?
  • If you want to dump all the countries into a dashboard, I'm afraid that just having a separate graph on the screen for each country is spamming the end-user, it makes it really difficult to find the information you are looking for. May I suggest you refactor these graphs to use something like Altair. See this example. I can help if needed, but that my take 3+ weeks for me to get to this. If you want to start down this path it will be helpful.
  • Let's think more about your Title. We can do a better job communicating what 2020-03-21-covid19-cases-to-deaths.ipynb is about. Imagine these graphs are in a newspaper, what headline would you put on it, even if there is no story? For example, I believe the title needs to convey that you are shwoing the lag between being infected, being identified, and then also deaths. Currently the title and description don't do a good job at communicating this.
  • Please proof read this PR carefully, For example in the section Ahead of the curve
Some countries start testing the population earlier in the outbreak than others. The time delay between the wave of deaths and the wave of confirmed cases is indicative for how early a country is detecting new cases ahead of the increase of deaths. Earlier detection means a better chances for successful isolation of an infected person and treatment of the desease.

We measure the distance of the maximum of cumulative deaths and new deaths to the number of infections to estimate the progression of the infection across countries.

If, in the early phase of the infection wave, the number of deaths rises faster than the number of confirmed cases, the distance drops, indicating that

A comparison of countries with respect to their mean time for reponse is presented below.

You will see that this needs some work. as the sentence e distance drops, indicating that stops abruptly.

  • Only one of your notebooks has an associated picture in the front matter? Why is that?
  • Did you preview the dashboards locally per the contributing guide? if not please do so.
  • I have other feedback but let's start with those first, as that is the major problems I see.

Thanks for your patience and willingness to work through this! Appreciate it

@dansbecker
Copy link

I'm a little concerned to see a growing ecosystem of analysis following the approach Tomas Pueyo suggests, but using wildly different assumptions and thus reaching totally different conclusions.

For example, @hamelsmu forwarded me another notebook this morning that assumed the time-to-death was 7 days. I don't know what the right answer is, but I'd point out that the 3x difference in assumed time-to-death implies almost an order of magnitude difference in predicted infection rates. And this is all driven by an assumption which has received less analysis than I'd hope.
I don't know what the right answer is. Here is one set of opinions that are worth comparing your predictions to:
image

Having so little domain knowledge, I should hedge my claims about what's plausible... but this result seems quite suspect to me:

image

Italy had 5000 new infections a day when they hit their 50th positive test? That just seems too high.

Another way to look at it: the prediction grew by about 70X over the last 17 days I see predicted infections for. That is a daily growth rate of 68%. That's the highest growth rate estimate I've seen in any analysis of either infections or positive cases. If you were to extrapolate that forward to today (where we don't have predictions), it would be about 0.5M infections. That seems too high (recognizing that there are many reasons we might expect the infection growth to decelerate).

Given that estimated infections is so sensitive to the time-to-death assumption, and we see a wide range of results being published due to differences in this assumption, I think it'd be especially interesting to see an analysis to determine the correct value for that parameter. I don't immediately know the right way to do that analysis. I think graphs like this make a nice starting point for that analysis, though again it's not obvious to me how to turn that into the relevant statistic.
image

Also, most of these analyses I've seen use a point value for that. I think that approach has a serious shortcoming due to Jensen's inequality.

If you have a goal of reducing the number of graphs, I personally think you don't lose much by showing the raw/unnormalized numbers only. Any trend I'd want to see from the normalized graph, my eye can pick out pretty easily from the unnormalized graph.

image

One last comment: At some point, Hamel also suggested I double down on learning Altair. I was hesitant because I'd previously spent some time doing something slowly in Altair that I could have done quickly in matplotlib. But I followed the materials he recommended here and it's changed the way I do data visualization. If you want to try creating a dynamic Altair dashboard as he suggests, those are very good materials to use.

@hamelsmu
Copy link
Contributor

Thanks so much for reviewing @dansbecker , really appreciate having another pair of eyes on this., it is very helpful to me 🙇 ❤️

@caglorithm
Copy link
Author

caglorithm commented Mar 22, 2020

Thank you both for your detailed responses. I really appreciate the effort and I'm glad about discussing the results with you! I am learning along as I'm reading through your suggestions.

To @dansbecker's response:

wildly different assumptions and thus reaching totally different conclusions

I am not aware of other attempts with wildly different assumptions and would appreciate, if you could point me towards them.

another notebook this morning that assumed the time-to-death was 7 days

I don't know what this notebook does right or does wrong, but the time from infection to death is certainly nowhere near 7 days. The value I've used is not something I've came up with but is available from the aggregate COVID-19 data repository here. Therein, you will find:
image
image

The time from infection to death is the time incubation period + time from symptom onset to death. The data was updated since i've looked the last time so the current mean value changed from 20 to 23 days. I will update this accordingly in the notebook.

Italy had 5000 new infections a day when they hit their 50th positive test? That just seems too high.

At a last case mortality rate of 8% in Italy, yesterday 627 patients died. These patients have been infected on average 20 days before yesterday. From this, I conclude that 20 days before, there would have been 7837 infections.

I also want to add that it is visible in the upper panels of the plots that these values are not harshly overestimated. The integral of the estimated new infections is close to the integral of the new confirmed cases.

Another way to look at it: the prediction grew by about 70X over the last 17 days I see predicted infections for. That is a daily growth rate of 68%. That's the highest growth rate estimate I've seen in any analysis of either infections or positive cases.

The predicted value is the new death cases multiplied by a constant factor (the inverse fatality rate) and shifted on the time axis (by 20 days). Your observations about its growth properties are equally true for the number of new deaths.

Edit: Observe growh rate for Iran from above-mentioned repository:
image

If you were to extrapolate that forward to today (where we don't have predictions), it would be about 0.5M infections. That seems too high (recognizing that there are many reasons we might expect the infection growth to decelerate).

There might be a very strong difference in the number of confirmed cases and actual infections today, in many countries. The graphs are not suited for extrapolation since that would basically mean that you know how the exponential (or rather logistic) growth will look like. Extrapolating from an exponential can be dangerous (imagine you would be looking at a graph of China from 30 days after 50th case).

Given that estimated infections is so sensitive to the time-to-death assumption, and we see a wide range of results being published due to differences in this assumption, I think it'd be especially interesting to see an analysis to determine the correct value for that parameter. I don't immediately know the right way to do that analysis. I think graphs like this make a nice starting point for that analysis, though again it's not obvious to me how to turn that into the relevant statistic.

I agree that the values are very sensitive towards the chosen delay. I've chosen to compare the delay between the new deaths and the new confirmed cases in order to estimate how well a country is ramping up its testing abilities compared to the growth of new deaths. I should think about incorporating the actual TP/TN values that some countries publish (for example, in Germany, 5% of the tests done turn out positive).

I didn't plot the absolute delay values for each country comparing them directly but chose to plot their difference to the mean of all countries). I'm aware that this is not a established way of analysing this but I thought it gave a nice visual intuition. A better way would do a proper delay analysis using an autocorrelation measure for example.

Also, most of these analyses I've seen use a point value for that. I think that approach has a serious shortcoming due to Jensen's inequality.

Not sure if I get this.

If you have a goal of reducing the number of graphs, I personally think you don't lose much by showing the raw/unnormalized numbers only. Any trend I'd want to see from the normalized graph, my eye can pick out pretty easily from the unnormalized graph.

I really want to keep both plots. I've been looking at them myself having to look up for the actual numbers but look down to see whether a value increased or decreased. It might be worth investing some time to be able to switch between different countries using altair though.

@hamelsmu
Copy link
Contributor

hamelsmu commented Mar 22, 2020 via email

@dansbecker
Copy link

dansbecker commented Mar 22, 2020

Thanks for the analysis on time-to-death @caglorithm .

I find it very compelling, and your response is going to inform the way I think about this problem moving forward.

Here is the notebook that uses 7 days. Pinging @jwrichar to think about whether he should change the 7 day assumption in his notebook.

I'm still missing something with your explanaiton I've been looking at them myself having to look up for the actual numbers but look down to see whether a value increased or decreased.

Don't the relative heights of the bars in the top graph say whether something went up or down?

For my Jensen's inequality comment, here's my very simplified example (I apologize for not doing a better job explaining my view than I'm about to). Say half of people die in 30 days and half die in 10. The mean (and median) time to death are both 20. But if you extrapolate cases 20 days ago off of deaths today, you're going to get the wrong number.

We could put some numbers on this. Let's stipulate a true data generating process as follows:
Say the doubling rate is 5 days.
Case fatality rate is 5%
10 people died today. Half (5) were infected 30 days ago, and half (5) were infected 10 days ago. The Pueyo method calculates deaths 20 days ago, because that's the mean and median time-to-death.

It says 10 people died, and case fatality is 5%. So it'd estimate 200 infections 20 days ago.

Now let's break it down by each of the subgroup time-to-death:
The people who were infected 10 days ago are not informative about infections 20 days ago.
Instead we extrapolate off the people with 30 day time-to-deaths.
Using his method for this subgroup, those 5 infections correspond to 100 people who were infected 30 days ago. With doubling happening every 5 days, it doubles twice between 30 days ago and 20 days ago. So 100 infections at T-30 corresponds to 400 people at T-20.

The difference I'd call out here is that using median time to death reports 200, vs the 400 which would be "correct" in this toy example.

I don't like that my example does this extrapolation step from T-30 to T-20. But the point is that the mean and median time to death don't work as proxies for the integral over time to death in a non-linear system (an implication of Jensen's inequality).

I don't mean that example of a criticism of your work. I'm seeing a lot of Pueyo style analysis. Every model has limitations, and I just want to call this out as one of the limitations in this particular approach.

@jwrichar
Copy link
Contributor

Here is the notebook that uses 7 days. Pinging @jwrichar to think about whether he should change the 7 day assumption in his notebook.

Thanks for the ping @dansbecker.

As has been discussed in this thread, the average-days-to-mortality value is critical for estimating mortality rate. In reality, this is the average days from the reporting of a test result to eventual mortality. And, as has been well documented, there are all sorts of delays in (1) contracting the virus to first symptoms, (2) first symptoms to symptoms severe enough to get tested, (3) doing the test (assuming it's even available and you qualify!) to the test results. So the ideal ~3 week (?) period gets pushed down significantly.

It would be great to be able to estimate that delay, but the reality is that it would simply be a degenerate parameter in the analysis (unless you made very strong assumptions somewhere else).

In the notebook you refer to, my main goal was to try to estimate the true case counts per US state, not to estimate the mortality rate. I anchor the analysis in the assumption that the underlying mortality rate is identical across all states, so that's where the window comes in. My hypothesis here is that (at least to first order) the window parameter does not impact the inferred test-volume-vs-case-underreporting curve. (In fact, we should test that 😄 ).

Thanks to @birdsarah for pointing out these underlying assumptions.

@dansbecker
Copy link

@jwrichar I think you're misunderstanding my comment.

Neither the notebook in this PR nor my interpretation of your notebook are about estimating mortality rates.

Instead, I believe you are using deaths to estimate true infections N days prior, where N is the delay between infection and death. I believe your analysis uses a value of 7, whereas the data in the comments on this PR show infection-to-death typically takes 20 days.

Assigning infection rates to the right time period is especially important given the quick growth in infection. At current infection growth rates, using a time-to-death assumption of 20 days rather than 7 days will change your estimated underreporting ratio by nearly an order of magnitude.

Incidentally, I suspect that's what's causing the mismatch between your estimated underreporting ratio and what most epidemiologists are believe (e.g. as shown in this graph) https://user-images.githubusercontent.com/1390442/77235952-c9dddb00-6b7f-11ea-9ac2-68786bc901c1.png

@jwrichar
Copy link
Contributor

@dansbecker thanks for the follow-up note.

Instead, I believe you are using deaths to estimate true infections N days prior, where N is the delay between infection and death.

That is true, but only insofar as the model is estimating true infections N days prior to ascertain the most likely relationship between (state-wise) testing volume and case underreporting subject to satisfy the underlying model assumption that the true (N-day) mortality rate is identical across all states.

Then, I take that fitted test-volume vs. case-underreported relationship (actually the posterior samples from the MCMC) and apply it to today's testing volume. The hypothesis that I was trying to lay out in my previous note is that the derivation of the "most likely" relationship is probably not all that sensitive to the choice of N (although the magnitude of the underlying mortality rates would be!).

Incidentally, I suspect that's what's causing the mismatch between your estimated underreporting ratio and what most epidemiologists are believe (e.g. as shown in this graph) https://user-images.githubusercontent.com/1390442/77235952-c9dddb00-6b7f-11ea-9ac2-68786bc901c1.png

I'm not so sure about that. I think the fundamental reason for those differences is the lack of testing for asymptomatic cases. Since tests in the US are so heavily biased toward symptomatic individuals, the model is completely ignorant of the entire sub-population of asymptomatic cases. I'm guessing that the domain experts have this info built in to their predictions (I think I read that in 1 small Italian town where they did complete testing, ~50% of the cases were actually asymptomatic!) Perhaps when the testing volumes drastically increase in some states, this effect will start being captured by the model -- before that point, I think we'd have to directly encode domain knowledge of expected % of asymptomatic cases into the model.

Thanks again for the feedback and questions! Keep 'em coming.

@vladpke
Copy link
Contributor

vladpke commented Mar 24, 2020

Johns Hopkins CSSE just deprecated the time series files you use in your notebooks and added new files with the same format. You will just have to update your urls to reflect these changes, since the deprecated files have incorrect data starting today. Have a look at the ---DEPRICATED WARNING--- in the link below.

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

@hamelsmu
Copy link
Contributor

@caglorithm Let us know if you still plan on working on this PR. I'll go ahead and close this if there is no activity, but you can always feel free to reopen when you are ready, and would be happy to review anytime. Thank you 🙇

@hamelsmu
Copy link
Contributor

temporarily closing this PR. Feel free to reopen or comment to continue

@hamelsmu hamelsmu closed this Mar 25, 2020
@caglorithm
Copy link
Author

@vladpke Thank you for letting me know, I'm updating the notebooks. Something has changed with the format of the files as far as I can see.

@hamelsmu Thank you a lot for the review. I'm hosting this on my own github page right now (thanks for fastpages, it's amazing!). I'd be happy to contribute to this repo though so I'm fine with stripping the notebook to a state that you find useful for this project. We can leave the PR closed for now, I will bump you later when I have something :)

@dansbecker Thank you a lot for the insights. I will dig into it a bit more and get back to you if I have more to share.

Have a great day everyone!

@hamelsmu
Copy link
Contributor

Thanks @caglorithm appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants