New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebook: Confirmed Cases vs. Deaths #87
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
Can't seem to trigger a new build. Any idea why? |
I need to clean up the CI system this is not your fault ignore CI for now, I’ll review this soon in the next couple of days |
There is a tremendous number of charts in this. I think, in many situations "less is more". Some suggestions Can we get rid of one of these pairs of charts? Do we need both? I know that they both show something useful, just trying to encourage brevity, somehow. I am not sure how this chart below helps convey useful information (it is super cool though I love looking at it!) about the fatality rate. What am I supposed to learn from this chart? Does this chart tell me anything interesting about the fatality rate between countries or help to explain the difference? What is the story here? I'm not sure I understand the y-axis title in the below chart perhaps should be re-worded to I'm not sure if I understand dividing confirmed cases by the number of deaths, as they occur at different times? Also what is the story or take away from this chart? What do you want the audience to learn? This is not so apparent to me. I would suggest getting rid of this chart, a more informative chart might be to show the overall rate of deaths/confirmed cases (not over time), with some error bars? It would be good if you ran spell check as I see several spelling errors here. On these charts do we need all of these lines. can we get rid of some of these marks? It is kind of busy Suggestion: get rid of Meta Suggestion: there is too much going on in this dashboard, suggest breaking it up into two dashboards, perhaps. Also try to get rid of extraneous information if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for submitting this PR! I think you have some really interesting information, its just needs to be presented more parsimoniously perhaps broken up into different dashboards, and perhaps eliminating non-critical information.
cc: @dansbecker b/c I think you will enjoy this |
You will see that this needs some work. as the sentence
Thanks for your patience and willingness to work through this! Appreciate it |
I'm a little concerned to see a growing ecosystem of analysis following the approach Tomas Pueyo suggests, but using wildly different assumptions and thus reaching totally different conclusions. For example, @hamelsmu forwarded me another notebook this morning that assumed the time-to-death was 7 days. I don't know what the right answer is, but I'd point out that the 3x difference in assumed time-to-death implies almost an order of magnitude difference in predicted infection rates. And this is all driven by an assumption which has received less analysis than I'd hope. Having so little domain knowledge, I should hedge my claims about what's plausible... but this result seems quite suspect to me: Italy had 5000 new infections a day when they hit their 50th positive test? That just seems too high. Another way to look at it: the prediction grew by about 70X over the last 17 days I see predicted infections for. That is a daily growth rate of 68%. That's the highest growth rate estimate I've seen in any analysis of either infections or positive cases. If you were to extrapolate that forward to today (where we don't have predictions), it would be about 0.5M infections. That seems too high (recognizing that there are many reasons we might expect the infection growth to decelerate). Given that estimated infections is so sensitive to the time-to-death assumption, and we see a wide range of results being published due to differences in this assumption, I think it'd be especially interesting to see an analysis to determine the correct value for that parameter. I don't immediately know the right way to do that analysis. I think graphs like this make a nice starting point for that analysis, though again it's not obvious to me how to turn that into the relevant statistic. Also, most of these analyses I've seen use a point value for that. I think that approach has a serious shortcoming due to Jensen's inequality. If you have a goal of reducing the number of graphs, I personally think you don't lose much by showing the raw/unnormalized numbers only. Any trend I'd want to see from the normalized graph, my eye can pick out pretty easily from the unnormalized graph. One last comment: At some point, Hamel also suggested I double down on learning Altair. I was hesitant because I'd previously spent some time doing something slowly in Altair that I could have done quickly in matplotlib. But I followed the materials he recommended here and it's changed the way I do data visualization. If you want to try creating a dynamic Altair dashboard as he suggests, those are very good materials to use. |
Thanks so much for reviewing @dansbecker , really appreciate having another pair of eyes on this., it is very helpful to me 🙇 ❤️ |
Thank you both for your detailed responses. I really appreciate the effort and I'm glad about discussing the results with you! I am learning along as I'm reading through your suggestions. To @dansbecker's response:
I am not aware of other attempts with wildly different assumptions and would appreciate, if you could point me towards them.
I don't know what this notebook does right or does wrong, but the time from infection to death is certainly nowhere near 7 days. The value I've used is not something I've came up with but is available from the aggregate COVID-19 data repository here. Therein, you will find: The time from infection to death is the time incubation period + time from symptom onset to death. The data was updated since i've looked the last time so the current mean value changed from 20 to 23 days. I will update this accordingly in the notebook.
At a last case mortality rate of 8% in Italy, yesterday 627 patients died. These patients have been infected on average 20 days before yesterday. From this, I conclude that 20 days before, there would have been 7837 infections. I also want to add that it is visible in the upper panels of the plots that these values are not harshly overestimated. The integral of the estimated new infections is close to the integral of the new confirmed cases.
The predicted value is the new death cases multiplied by a constant factor (the inverse fatality rate) and shifted on the time axis (by 20 days). Your observations about its growth properties are equally true for the number of new deaths. Edit: Observe growh rate for Iran from above-mentioned repository:
There might be a very strong difference in the number of confirmed cases and actual infections today, in many countries. The graphs are not suited for extrapolation since that would basically mean that you know how the exponential (or rather logistic) growth will look like. Extrapolating from an exponential can be dangerous (imagine you would be looking at a graph of China from 30 days after 50th case).
I agree that the values are very sensitive towards the chosen delay. I've chosen to compare the delay between the new deaths and the new confirmed cases in order to estimate how well a country is ramping up its testing abilities compared to the growth of new deaths. I should think about incorporating the actual TP/TN values that some countries publish (for example, in Germany, 5% of the tests done turn out positive). I didn't plot the absolute delay values for each country comparing them directly but chose to plot their difference to the mean of all countries). I'm aware that this is not a established way of analysing this but I thought it gave a nice visual intuition. A better way would do a proper delay analysis using an autocorrelation measure for example.
Not sure if I get this.
I really want to keep both plots. I've been looking at them myself having to look up for the actual numbers but look down to see whether a value increased or decreased. It might be worth investing some time to be able to switch between different countries using |
Hi maybe in this case it doesn’t make sense to include these dashboards and
merge this PR? I think your visualizations are really cool and I’m glad
you already published them with fastpages on your own blog! It’s just that
as an editor I have to curate the content on this site a bit, and I don’t
want to see you waste time especially if you are happy with what you have!
If this feedback is too onerous which I can totally understand there is
nothing wrong with closing this PR and moving on. i just wanted to offer
that as a fellow human being who can understand that you might have other
plans with your life than editing charts ( especially ones you like ).
…On Sun, Mar 22, 2020 at 5:40 AM caglorithm ***@***.***> wrote:
Thank you both for your detailed responses. I really appreciate the effort
and I'm glad about discussing the results with you! I am learning along as
I'm reading through your suggestions.
To @dansbecker <https://github.com/dansbecker>'s response:
wildly different assumptions and thus reaching totally different
conclusions
I am not aware of other attempts with wildly different assumptions and
would appreciate, if you could point me towards them.
another notebook this morning that assumed the time-to-death was 7 days
I don't know what this notebook does right or does wrong, but the time
from infection to death is certainly nowhere near 7 days. The value I've
used is not something I've came up with but is available from the aggregate COVID-19
data repository here
<https://github.com/midas-network/COVID-19/tree/master/parameter_estimates/2019_novel_coronavirus>.
Therein, you will find:
[image: image]
<https://user-images.githubusercontent.com/7763212/77248882-3d91ed80-6c3d-11ea-85f5-939a3148a616.png>
[image: image]
<https://user-images.githubusercontent.com/7763212/77248885-4256a180-6c3d-11ea-97a6-c40f84e20e06.png>
The time from infection to death is the time incubation period + time from
symptom onset to death. The data was updated since i've looked the last
time so the current mean *value changed from 20 to 23 days*. I will
update this accordingly in the notebook.
Italy had 5000 new infections a day when they hit their 50th positive
test? That just seems too high.
At a last case mortality rate of 8% in Italy, yesterday 627 patients died.
These patients have been infected on average 20 (now 23) days before
yesterday. From this, I conclude that 20 days before, there would have been
7837 infections.
I also want to add that it is visible in the upper panels of the plots
that these values are not harshly overestimated. The integral of the
estimated new infections is close to the integral of the new confirmed
cases.
Another way to look at it: the prediction grew by about 70X over the last
17 days I see predicted infections for. That is a daily growth rate of 68%.
That's the highest growth rate estimate I've seen in any analysis of either
infections or positive cases.
The predicted value is the new death cases multiplied by a constant factor
(the inverse fatality rate) and shifted on the time axis (by 20 days). Your
observations about its growth properties are equally true for the number of
new deaths.
If you were to extrapolate that forward to today (where we don't have
predictions), it would be about 0.5M infections. That seems too high
(recognizing that there are many reasons we might expect the infection
growth to decelerate).
There might be a very strong difference in the number of confirmed cases
and actual infections today, in many countries. The graphs are not suited
for extrapolation since that would basically mean that you know how the
exponential (or rather logistic) growth will look like. Extrapolating from
an exponential can be dangerous (imagine you would be looking at a graph of
China from 30 days after 50th case).
Given that estimated infections is so sensitive to the time-to-death
assumption, and we see a wide range of results being published due to
differences in this assumption, I think it'd be especially interesting to
see an analysis to determine the correct value for that parameter. I don't
immediately know the right way to do that analysis. I think graphs like
this make a nice starting point for that analysis, though again it's not
obvious to me how to turn that into the relevant statistic.
I agree that the values are very sensitive towards the chosen delay. I've
chosen to compare the delay between the new deaths and the new confirmed
cases in order to estimate how well a country is ramping up its testing
abilities compared to the growth of new deaths. I should think about
incorporating the actual TP/TN values that some countries publish (for
example, in Germany, 5% of the tests done turn out positive).
I didn't plot the absolute delay values for each country comparing them
directly but chose to plot their difference to the mean of all countries).
I'm aware that this is not a established way of analysing this but I
thought it gave a nice visual intuition. A better way would do a proper
delay analysis using an autocorrelation measure for example.
Also, most of these analyses I've seen use a point value for that. I think
that approach has a serious shortcoming due to Jensen's inequality.
Not sure if I get this.
If you have a goal of reducing the number of graphs, I personally think
you don't lose much by showing the raw/unnormalized numbers only. Any trend
I'd want to see from the normalized graph, my eye can pick out pretty
easily from the unnormalized graph.
I really want to keep both plots. I've been looking at them myself having
to look up for the actual numbers but look down to see whether a value
increased or decreased. It might be worth investing some time to be able to
switch between different countries using altair though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#87 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALKJEUDTDKRXYBOGOBTNWDRIYBK3ANCNFSM4LQ3JI6Q>
.
|
Thanks for the analysis on time-to-death @caglorithm . I find it very compelling, and your response is going to inform the way I think about this problem moving forward. Here is the notebook that uses 7 days. Pinging @jwrichar to think about whether he should change the 7 day assumption in his notebook. I'm still missing something with your explanaiton Don't the relative heights of the bars in the top graph say whether something went up or down? For my Jensen's inequality comment, here's my very simplified example (I apologize for not doing a better job explaining my view than I'm about to). Say half of people die in 30 days and half die in 10. The mean (and median) time to death are both 20. But if you extrapolate cases 20 days ago off of deaths today, you're going to get the wrong number. We could put some numbers on this. Let's stipulate a true data generating process as follows: It says 10 people died, and case fatality is 5%. So it'd estimate 200 infections 20 days ago. Now let's break it down by each of the subgroup time-to-death: The difference I'd call out here is that using median time to death reports 200, vs the 400 which would be "correct" in this toy example. I don't like that my example does this extrapolation step from T-30 to T-20. But the point is that the mean and median time to death don't work as proxies for the integral over time to death in a non-linear system (an implication of Jensen's inequality). I don't mean that example of a criticism of your work. I'm seeing a lot of Pueyo style analysis. Every model has limitations, and I just want to call this out as one of the limitations in this particular approach. |
Thanks for the ping @dansbecker. As has been discussed in this thread, the average-days-to-mortality value is critical for estimating mortality rate. In reality, this is the average days from the reporting of a test result to eventual mortality. And, as has been well documented, there are all sorts of delays in (1) contracting the virus to first symptoms, (2) first symptoms to symptoms severe enough to get tested, (3) doing the test (assuming it's even available and you qualify!) to the test results. So the ideal ~3 week (?) period gets pushed down significantly. It would be great to be able to estimate that delay, but the reality is that it would simply be a degenerate parameter in the analysis (unless you made very strong assumptions somewhere else). In the notebook you refer to, my main goal was to try to estimate the true case counts per US state, not to estimate the mortality rate. I anchor the analysis in the assumption that the underlying mortality rate is identical across all states, so that's where the window comes in. My hypothesis here is that (at least to first order) the window parameter does not impact the inferred test-volume-vs-case-underreporting curve. (In fact, we should test that 😄 ). Thanks to @birdsarah for pointing out these underlying assumptions. |
@jwrichar I think you're misunderstanding my comment. Neither the notebook in this PR nor my interpretation of your notebook are about estimating mortality rates. Instead, I believe you are using deaths to estimate true infections N days prior, where N is the delay between infection and death. I believe your analysis uses a value of 7, whereas the data in the comments on this PR show infection-to-death typically takes 20 days. Assigning infection rates to the right time period is especially important given the quick growth in infection. At current infection growth rates, using a time-to-death assumption of 20 days rather than 7 days will change your estimated underreporting ratio by nearly an order of magnitude. Incidentally, I suspect that's what's causing the mismatch between your estimated underreporting ratio and what most epidemiologists are believe (e.g. as shown in this graph) https://user-images.githubusercontent.com/1390442/77235952-c9dddb00-6b7f-11ea-9ac2-68786bc901c1.png |
@dansbecker thanks for the follow-up note.
That is true, but only insofar as the model is estimating true infections N days prior to ascertain the most likely relationship between (state-wise) testing volume and case underreporting subject to satisfy the underlying model assumption that the true (N-day) mortality rate is identical across all states. Then, I take that fitted test-volume vs. case-underreported relationship (actually the posterior samples from the MCMC) and apply it to today's testing volume. The hypothesis that I was trying to lay out in my previous note is that the derivation of the "most likely" relationship is probably not all that sensitive to the choice of N (although the magnitude of the underlying mortality rates would be!).
I'm not so sure about that. I think the fundamental reason for those differences is the lack of testing for asymptomatic cases. Since tests in the US are so heavily biased toward symptomatic individuals, the model is completely ignorant of the entire sub-population of asymptomatic cases. I'm guessing that the domain experts have this info built in to their predictions (I think I read that in 1 small Italian town where they did complete testing, ~50% of the cases were actually asymptomatic!) Perhaps when the testing volumes drastically increase in some states, this effect will start being captured by the model -- before that point, I think we'd have to directly encode domain knowledge of expected % of asymptomatic cases into the model. Thanks again for the feedback and questions! Keep 'em coming. |
Johns Hopkins CSSE just deprecated the time series files you use in your notebooks and added new files with the same format. You will just have to update your urls to reflect these changes, since the deprecated files have incorrect data starting today. Have a look at the ---DEPRICATED WARNING--- in the link below. https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series |
@caglorithm Let us know if you still plan on working on this PR. I'll go ahead and close this if there is no activity, but you can always feel free to reopen when you are ready, and would be happy to review anytime. Thank you 🙇 |
temporarily closing this PR. Feel free to reopen or comment to continue |
@vladpke Thank you for letting me know, I'm updating the notebooks. Something has changed with the format of the files as far as I can see. @hamelsmu Thank you a lot for the review. I'm hosting this on my own github page right now (thanks for @dansbecker Thank you a lot for the insights. I will dig into it a bit more and get back to you if I have more to share. Have a great day everyone! |
Thanks @caglorithm appreciate it |
In this notebook I compare the number and change of confirmed cases per day with the number and change of deaths per country.
Output: https://caglorithm.github.io/covid19-analysis/covid-cases-to-deaths/