-
Notifications
You must be signed in to change notification settings - Fork 0
/
deaths.rmd
460 lines (385 loc) · 18.3 KB
/
deaths.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
layout: default
title: 'Analysis of COVID-19 Mortality'
author: Deepayan Sarkar
code: collapse
---
```{r opts, echo = FALSE, results = "hide", warning = FALSE, message = FALSE}
opts_chunk$set(cache = FALSE, cache.path='~/knitr-cache/covid19/', autodep = TRUE,
comment = "", warning = TRUE, message = TRUE,
knitr.table.format = "html",
## dev.args = list(pointsize = 12),
fig.width = 12, fig.height = 12, dpi = 120, fig.path='figures/deaths-')
library(lattice)
library(RColorBrewer)
library(latticeExtra)
bpaired <- RColorBrewer::brewer.pal(n = 12, name = "Paired")
ct <- custom.theme(symbol = bpaired[c(FALSE, TRUE)], fill = bpaired[c(TRUE, FALSE)])
ct$strip.background$col <- "grey90"
ct$strip.border$col <- "grey50"
ct$axis.line$col <- "grey50"
lattice.options(default.theme = ct)
```
```{r}
TARGET.cases <- "time_series_covid19_confirmed_global.csv"
TARGET.deaths <- "time_series_covid19_deaths_global.csv"
SURL <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series"
## To download the latest version, delete the files and run again
for (target in c(TARGET.cases, TARGET.deaths))
{
if (!file.exists(target))
download.file(sprintf("%s/%s", SURL, target), destfile = target)
}
covid.cases <- read.csv(TARGET.cases, check.names = FALSE, stringsAsFactors = FALSE)
covid.deaths <- read.csv(TARGET.deaths, check.names = FALSE, stringsAsFactors = FALSE)
if (!identical(dimnames(covid.cases), dimnames(covid.deaths)))
stop("Cases and death data have different structure... check versions.")
keep <- covid.deaths[[length(covid.deaths)]] > 100 # at least 100 deaths
covid.cases <- covid.cases[keep, ]
covid.deaths <- covid.deaths[keep, ]
```
[This note was last updated using data downloaded on
`r {as.Date(file.mtime(TARGET.deaths))}`. Here is the
[source](https://github.com/deepayan/deepayan.github.io/blob/master/covid-19/deaths.rmd)
of this analysis. Click
<a href="#" data-toggle="collapse"
data-target="div.sourceCode" aria-expanded="true">here</a> to show /
hide the R code used. ]
To understand how the COVID-19 pandemic has spread over time in
various countries, the number of cases (or deaths) over time is not
necessarily the best quantity to visualize. Instead, we have used the
[doubling time](doubling), which is the number of days it took for the
total number of cases to reach the current number from when it was
half the current number. The doubling time is a measure of the rate of
growth: if it does not change with time in a country, the number of
cases is growing exponentially. With measures to control spread, the
doubling time should systematically increase with time.
Another very important aspect of the COVID-19 pandemic is its
mortality rate. That is, once an individual is infected, how likely is
he/she to die? This of course depends on various attributes of the
individual such as age and other underlying conditions, but even a
crude (marginal) country-specific mortality rate is rather difficult
to estimate.
# The problem with the naive death rate estimate
The main difficulty is in knowing the size of the vulnerable
population. Although it is possible that deaths due to COVID-19 are
not attributed to it, this is not very likely to happen, especially in
countries where the virus has already spread enough to cause a
substantial number of deaths. On the other hand, the reported number
of confirmed cases may only be a fraction of the true number of cases,
either due to lack of testing, or changes in testing protocols, or
various other reasons. Thus, the best we can hope for is to estimate
the mortality rate among _identified_ cases.
There is yet another problem. The cumulative daily number of confirmed
cases and deaths are avaiable (e.g., from [JHU
CSSE](https://github.com/CSSEGISandData/COVID-19/) on Github), and one
could naively estimate the mortality rate by dividing the (current)
total number of deaths by the total number of infections. However,
doing this will always underestimate the mortality rate, because the
denominator contains (many) recently diagnosed patients that have not
yet died, but many of whom will actually die. As pointed out
[here](https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca),
the correct way to estimate the death rate is to consider only
_closed_ cases (confirmed cases who have either died or been
cured). But that data is not easily available (the number of
_recovered_ patients by date is available, but that is not enough).
The plot below shows how the naive death rates (proportion of deaths
over number of confirmed cases), on any given day, have changed over
time for countries with at least 100 deaths.[^1] Although the estimate
is eventually (asymptotically) supposed to stabilize (as the fraction
of "active" cases decreases), this has clearly not happened yet for
most countries.
[^1]: As [before](doubling), we apply a crude "smoothing" to account
for lags in updating data: If two consecutive days have the same
total count followed by a large increase on the following day,
then the most likely explanation is that data was not updated on
the second day. In such cases, the count of the middle day is
replaced by the geometric mean of its neighbours.
```{r, warning=FALSE}
correctLag <- function(x)
{
n <- length(x)
stopifnot(n > 2)
for (i in seq(2, n-1))
if (x[i] == x[i-1])
x[i] <- sqrt(x[i-1] * x[i+1])
x
}
extractCasesTS <- function(d)
{
x <- t(data.matrix(d[, -c(1:4)]))
x[x == -1] <- NA
colnames(x) <-
with(d, ifelse(`Province/State` == "", `Country/Region`,
paste(`Country/Region`, `Province/State`,
sep = "/")))
apply(x, 2, correctLag)
}
xcovid.cases <- extractCasesTS(covid.cases)
xcovid.deaths <- extractCasesTS(covid.deaths)
D <- nrow(xcovid.deaths)
total.deaths <- xcovid.deaths[D, , drop = TRUE]
```
```{r}
porder <- rev(order(total.deaths))
death.rate <- 100 * (xcovid.deaths / xcovid.cases)
dr.naive.latest <- death.rate[D, , drop = TRUE]
start.date <- as.Date("2020-01-22")
xat <- pretty(start.date + c(0, D-1))
(dr.naive <-
xyplot(ts(death.rate[, porder], start = start.date),
type = "o", grid = TRUE, layout = c(6, 6),
par.settings = simpleTheme(pch = 16, cex = 0.5),
scales = list(alternating = 3, rot = 45,
x = list(at = xat, labels = format(xat, format = "%d %b")),
y = list(relation = "same")),
xlab = NULL, ylab = "Death rate (per cent)",
as.table = TRUE, between = list(x = 0.5, y = 0.5),
ylim = extendrange(range(tail(death.rate, 10))))
)
```
# The lag-adjusted death rate
A very simple alternative is to assume that the correct denominator
for the proportion of deaths of not the _current_ number of cases, but
rather the number of cases a few days ago (representing the population
of patients that would have died by now if they were to die at all).
But what should that lag be? We know that two weeks is probably long
enough (probably too long), but otherwise we do not have enough
information to guess. So, we tried lags of one day, two days, three
days, and so on, and finally settled on a lag of __7 days__ because it
is the _smallest_ lag for which the lag-adjusted death rate stabilized
in Hubei (China); being the region with the longest history, we expect
it to give the most stable estimate. The following plot shows how this
lag-adjusted death rates has changed over time, and compares it with
the naive death rate.
```{r}
LAG <- 7 # lowest lag for which Hubei estimates flatten out
death.rate <- 100 * tail(xcovid.deaths, -LAG) / head(xcovid.cases, -LAG)
dr.adjusted.latest <- tail(death.rate, 1)
start.date <- as.Date("2020-01-22") + LAG
xat <- pretty(start.date + c(0, nrow(death.rate)-1))
dr.adjusted <-
xyplot(ts(death.rate[, porder], start = start.date),
type = "o", layout = c(6, 6),
par.settings = simpleTheme(pch = 16, cex = 0.5),
col = ct$superpose.symbol$col[2],
as.table = TRUE, between = list(x = 0.5, y = 0.5),
ylim = c(0, 30))
update(dr.naive + dr.adjusted, ylim = c(0, 30),
auto.key = list(lines = TRUE, points = FALSE, columns = 2, type = "o",
text = c("Naive estimate", "One week lag-adjusted estimate")))
```
The adjusted death rates have less systematic trends than the naive
estimate, but clearly there is still a lot of instability.
Why are the rates so different across countries? That is difficult to
answer at this point. It possibly depends on how overwhelmed the
health-care system is, but it is also important to remember that this
is not the death rate among all _infected_ individuals, but rather
only among all _identified_ individuals. In countries where possibly
infected patients are not tested if their symptoms are mild, then the
estimated death rate will be artificially high. Countries may also not
be consistent in which deaths they are attributing to COVID-19.
A useful, but not surprising, proxy for how efficiently a country is
testing turns out to be its GDP. The following plot shows, only for
European countries (which should be otherwise more or less
homogeneous), the relationship between per capita GDP as of 2018
(source: [World Bank](https://data.worldbank.org/)) and adjusted
mortality rate.
```{r, fig.height=8}
ppgdp <- read.csv("PPGDP_2018.csv") # world bank data
data(UN, package = "carData")
latest.dr <- as.data.frame(t(rbind(dr.naive.latest,
dr.adjusted.latest,
tail(xcovid.deaths, 1))))
names(latest.dr) <- c("naive", "adjusted", "tdeaths")
## adjust some names to match
mapNamesUN <- function(s)
{
s[s == "US"] <- "United States"
s[s == "Korea, South"] <- "Republic of Korea"
s[substring(s, 1, 5) == "China"] <- "China"
s
}
mapNamesWB <- function(s)
{
s[s == "US"] <- "United States"
s[s == "Iran"] <- "Iran, Islamic Rep."
s[s == "Korea, South"] <- "Korea, Rep."
s[substring(s, 1, 5) == "China"] <- "China"
s
}
latest.dr <- cbind(latest.dr,
country = rownames(latest.dr),
ppgdp[mapNamesWB(rownames(latest.dr)), , drop = FALSE],
UN[mapNamesUN(rownames(latest.dr)), , drop = FALSE])
with(subset(latest.dr, region == "Europe"),
xyplot(adjusted ~ (PPGDP_2018/1000), pch = 16, cex = 1.5, aspect = 0.75,
xlab = "GDP per capita (x 1000 USD) in 2018",
ylab = "One week lag-adjusted COVID-19 mortality rate (per cent)",
xlim = extendrange(range(PPGDP_2018/1000), f = 0.15),
panel = function(x, y, ...) {
## panel.abline(lm(y ~ x, weights = tdeaths), col.line = "grey90", lwd = 3)
panel.xyplot(x, y, ...)
panel.text(x, y, labels = country, pos = 2, srt = 0)
}))
```
<!-- The grey line is a weighted linear regression line fit with the number of deaths as weights. -->
<!-- Some extra analysis reported in doubling.rmd -->
<!-- Total deaths -->
```{r tdeath, fig.height=9, fig.show="hide",echo=FALSE,results=FALSE}
deaths.1weekago <- xcovid.deaths[D - 7, , drop = TRUE]
dotplot(total.deaths[rev(porder)], total.1 = deaths.1weekago[rev(porder)],
xlim = c(40, NA),
panel = function(x, y, ..., total.1, col) {
col <- trellis.par.get("superpose.line")$col
dot.line <- trellis.par.get("dot.line")
panel.abline(h = unique(y), col = dot.line$col, lwd = dot.line$lwd)
panel.segments(log10(total.1), y, x, y, col = col[2], lwd = 3)
panel.points(x, y, pch = 16, col = col[1])
},
xlab = "Total deaths (NOTE: log scale)",
ylab = "Countries / regions (ordered by deaths per cases)",
scales = list(x = list(alternating = 3, log = TRUE,
equispaced.log = FALSE)))
```
<!--
The doubling time of deaths across these countries. The grey line
represents the corresponding doubling time of the number of cases.
-->
```{r ddeath, warning=FALSE,fig.height=9, fig.show="hide",echo=FALSE,results=FALSE}
tdouble <- function(n, x, min = 50)
{
if (x[n] < min) return (NA_real_)
x <- head(x, n)
x <- c(0, x[x > 0])
i <- seq_along(x)
f <- approxfun(x, i)
diff(f(max(x) * c(0.5, 1)))
}
doubling.ts <- function(region, d, min = 50)
{
t <- seq(as.Date("2020-01-22"), by = 1, length.out = nrow(d))
td <- sapply(1:nrow(d), tdouble,
x = d[, region, drop = TRUE], min = min)
data.frame(region = region, date = t, tdouble = td)
}
regions <- colnames(xcovid.deaths)[porder]
devolution.deaths <-
droplevels(na.omit(do.call(rbind,
lapply(regions, doubling.ts,
d = xcovid.deaths, min = 50))))
devolution.cases <-
droplevels(na.omit(do.call(rbind,
lapply(regions, doubling.ts,
d = xcovid.cases, min = 50))))
p.deaths <-
xyplot(tdouble ~ date | factor(region, levels = regions),
data = devolution.deaths, type = "o", pch = ".", cex = 2, grid = TRUE,
xlab = "Date", ylab = "Doubling time for deaths (days)",
scales = list(alternating = 3, x = list(rot = 45)),
ylim = c(0, 30), as.table = TRUE, between = list(x = 0.5, y = 0.5))
p.cases <-
xyplot(tdouble ~ date | factor(region, levels = regions),
data = devolution.cases, type = "l", pch = 16, col = "grey50",
xlab = "Date", ylab = "Doubling time for deaths (days)",
scales = list(alternating = 3, x = list(rot = 45)),
ylim = c(0, 30), as.table = TRUE, between = list(x = 0.5, y = 0.5))
update(p.deaths + p.cases, layout = c(5, 5))
```
# How fast are deaths increasing?
The number of deaths is a better measure of how serious the COVID-19
epidemic is in a country, compared to the number of detected cases,
because deaths are less likely to be missed. The following plot,
inspired by the very nice visualizations in the [Financial
Times](https://www.ft.com/coronavirus-latest), shows the growth in the
number of deaths, starting from the day the count exceeded 10, for
countries with at least 2000 deaths.
```{r, fig.height=8}
deathsSince10 <- function(region)
{
x <- xcovid.deaths[, region, drop = TRUE]
x <- x[x > 10]
data.frame(region = region, day = seq_along(x),
deaths = x, total = tail(x, 1))
}
deaths.10 <- do.call(rbind, lapply(colnames(xcovid.deaths), deathsSince10))
panel.glabel <- function(x, y, group.value, col.symbol, ...) # x,y vectors; group.value scalar
{
n <- length(x)
panel.text(x[n], y[n], label = group.value, pos = 4, col = col.symbol, srt = 40)
}
xyplot(deaths ~ day, data = subset(deaths.10, total >= 2000), grid = TRUE,
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
xlab = "Days since number of deaths exceeded 10",
groups = region, type = "o") +
glayer_(panel.glabel(x, y, group.value = group.value, ...))
```
Compare these with other countries:
```{r}
bg <- xyplot(deaths ~ day, data = subset(deaths.10, total >= 2000), grid = TRUE,
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
col = "grey", groups = region, type = "l")
fg <-
xyplot(deaths ~ day | reorder(region, -total), data = subset(deaths.10, total < 2000),
xlab = "Days since number of deaths exceeded 10", pch = ".", cex = 3,
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
as.table = TRUE, between = list(x = 0.5, y = 0.5), layout = c(5, 5),
type = "o", ylim = c(NA, 12500))
fg + as.layer(bg, under = TRUE)
```
# Which countries have reached their peak?
Another important landmark for any country is when it has passed its
peak, that is, the daily number of new deaths (or cases) has started
going down. The following plots show this trend, after applying a
little bit of smoothing on the cumulative death counts.
```{r, fig.height=8,warning=FALSE}
newDeathsSince10 <- function(region)
{
x <- xcovid.deaths[, region, drop = TRUE]
x <- x[x > 10]
days <- seq_along(x)
xsmooth <- fitted(loess(x ~ days, span = 0.35))
## xsmooth <- sinh(predict(smooth.spline(days, asinh(x)), days)$y)
data.frame(region = region, day = days[-1],
deaths = diff(xsmooth), total = tail(x, 1))
}
new.deaths.10 <- do.call(rbind, lapply(colnames(xcovid.deaths), newDeathsSince10))
panel.glabel <- function(x, y, group.value, col.symbol, ...) # x,y vectors; group.value scalar
{
n <- length(x)
panel.text(x[n], y[n], label = group.value, pos = 4, col = col.symbol, srt = 40)
}
xyplot(deaths ~ day, data = subset(new.deaths.10, total >= 2000), grid = TRUE,
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
xlab = "Days since number of deaths exceeded 10", ylim = c(1, NA),
groups = region, type = "o") +
glayer_(panel.glabel(x, y, group.value = group.value, ...))
```
There's a lot of overlap, so let's look at these countries separately:
```{r,warning=FALSE,fig.height=8}
bg <- xyplot(deaths ~ day, data = subset(new.deaths.10, total >= 2000), grid = TRUE,
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
col = "grey", groups = region, type = "l", ylim = c(1, NA))
fg1 <-
xyplot(deaths ~ day | reorder(region, -total), data = subset(new.deaths.10, total >= 2000),
xlab = "Days since number of deaths exceeded 10", pch = ".", cex = 3, ylim = c(1, NA),
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
as.table = TRUE, between = list(x = 0.5, y = 0.5),
type = "o")
fg1 + as.layer(bg, under = TRUE)
```
As of May 12, most of these countries appear to have passed their
peaks, except Mexico and Brazil.
Compare these with other countries (But beware of potential artifacts
of the smoothing):
```{r,warning=FALSE}
fg2 <-
xyplot(deaths ~ day | reorder(region, -total),
data = subset(new.deaths.10, total < 2000),
xlab = "Days since number of deaths exceeded 10",
pch = ".", cex = 3, ylim = c(1, NA),
scales = list(alternating = 3, y = list(log = 10, equispaced.log = FALSE)),
as.table = TRUE, between = list(x = 0.5, y = 0.5), layout = c(5, 5),
type = "o")
fg2 + as.layer(bg, under = TRUE)
```