/
modeltools.Rmd
226 lines (184 loc) · 8.41 KB
/
modeltools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: Get started with `modeltools`
description: An introductory tutorial with examples.
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Get started with modeltools}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This package provides tools for building COVID-19 models, with a focus on
forecasting and hotspot prediction models. It is designed to be used in
conjunction with the `covidcast` and `evalcast` packages.
## Installing
This package is not on CRAN yet, so it can be installed using the
[`devtools`](https://cran.r-project.org/package=devtools) package:
```{r, eval = FALSE}
devtools::install_github("cmu-delphi/covidcast", ref = "main",
subdir = "R-packages/modeltools")
```
Building the vignettes, such as this getting started guide, takes a significant
amount of time. They are not included in the package by default. If you want to
include vignettes, then use this modified command:
```{r, eval = FALSE}
devtools::install_github("cmu-delphi/covidcast", ref = "main",
subdir = "R-packages/modeltools",
build_vignettes = TRUE, dependencies = TRUE)
```
For this getting started vignette, we'll fetch COVID-19 case rates, both raw and
smoothed via 7-day trailing averages, from the USAFacts data source. We'll look
at data at the state level for four states, between the start of June and mid
November.
```{r}
library(covidcast)
start_day <- "2020-06-01"
end_day <- "2020-11-15"
geo_values <- c("ca", "fl", "ny", "tx")
signals <- suppressMessages(
covidcast_signals(data_source = "usa-facts",
signal = c("confirmed_incidence_prop",
"confirmed_7dav_incidence_prop"),
start_day = start_day, end_day = end_day,
geo_type = "state", geo_values = geo_values))
summary(signals[[1]])
summary(signals[[2]])
```
## Slide with a formula
One of the most basic tools in the `modeltools` package is `slide_by_geo()`,
which is based on the family of functions provided by the `slider` package. In
`modeltools`, to "slide" means to apply the function or formula over a trailing
window of `n` days of data, grouped by `geo_value`. Many other functions in the
package, such as `pct_change()` and `estimate_deriv()`, use `slide_by_geo()` as
their workhorse.
For example, to apply a 7-day trailing average to the raw values in the first
`covidcast_signal` data frame, we can specify a formula via the `slide_fun`
argument of `slide_by_geo()`:
```{r, message = FALSE, warning = FALSE}
library(modeltools)
library(dplyr)
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7) %>%
select(geo_value, time_value, value, slide_value) %>%
arrange(geo_value) %>%
head(10)
```
The formula specified via `slide_fun` has access to all columns present in the
original `covidcast_signal` data frame, and must refer to them with the prefix
`.x$`. Here the function `Mean()` is a simple wrapper around `mean()` that omits
`NA` values by default (provided by the `modeltools` package).
Notice that `slide_by_geo()` returns a data frame with a new column appended
that contains the results (from sliding the formula), named "slide_value" by
default. We can instead specify a name up front using the `col_name` argument:
```{r}
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7,
col_name = "7dav") %>%
select(geo_value, time_value, value, `7dav`) %>%
arrange(geo_value) %>%
head(10)
```
As a simple sanity check, we compare the 7-day trailing average computed via
`slide_by_geo()` to the values of the smoothed signal from the API:
```{r}
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7,
col_name = "7dav") %>%
full_join(signals[[2]], by = c("geo_value", "time_value")) %>%
mutate(difference = `7dav` - value.y) %>%
filter(time_value >= "2020-06-07") %>%
summarize(max(abs(difference)))
```
Note that this check would *fail* before June 7 because the API access to data
before June 1, whereas `slide_by_geo()` does not (when a trailing window of `n`
days isn't available, `slide_by_geo()` just applies the function or formula to
whatever data is available).
## Slide with a function
We can also pass a function for the `slide_fun` argument in `slide_by_geo()`. In
this case, the passed function must have the following argument structure: `x`,
a data frame the same column names as the original data frame; followed by any
number of named additional arguments; and ending with `...`, to capture general
additional arguments. Recreating the last example of a 7-day trailing average:
```{r}
slide_by_geo(signals[[1]], slide_fun = function(x, ...) Mean(x$value), n = 7,
col_name = "7dav") %>%
select(geo_value, time_value, value, `7dav`) %>%
arrange(geo_value) %>%
head(10)
```
As a more sophisticated example, here we show how to *sensorize* the doctor
visits signal. The sensor values are defined by the predicted values from a
local (in time) regression of past case rates (response) on past doctor visits
(covariate).
```{r}
# Regression for doctor visits sensorization
dv_regression = function(x, m = 3, ...) {
n = nrow(x)
if (n <= m+1) return(NA) # Take care of trivial case
return(tryCatch(suppressWarnings(suppressMessages({
# Fit a regression, leaving out the last m days of data
lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])
# Form prediction for the last day of data, and return
predict(lm_obj, newdata = x[n, ])
})),
error = function(e) return(NA)))
}
# Fetch doctor visits signal and join it to case rates
joined = suppressMessages(
covidcast_signal(data_source = "doctor-visits",
signal = "smoothed_adj_cli",
start_day = start_day, end_day = end_day,
geo_type = "state", geo_values = geo_values)) %>%
full_join(signals[[2]], by = c("geo_value", "time_value"))
# Perform sensorization for each state; use the last n = 56 days (8 weeks) of
# data, minus the last m = 3 days, for the regression
slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3,
col_name = "sensor") %>%
select(geo_value, time_value, value.x, value.y, sensor) %>%
arrange(geo_value) %>%
head(10)
```
Above, the first 4 elements are `NA` because of insufficient training data (we
need at least 2 training samples for simple linear regression, and we omit the
last 3 days of data).
## Returning complex objects
As a final example, we show how to use `slide_by_geo()` to output more complex
objects. We amend the last example so that `dv_regression()` returns both the
predicted value (sensor) and the fitted linear model object. In order to use
this with `slide_by_geo()`, we need to set the `col_type` argument to be "list":
```{r}
dv_regression = function(x, m = 3, ...) {
n = nrow(x)
if (n <= m+1) return(list(lm_obj = NA, sensor = NA)) # Trivial case
return(tryCatch(suppressWarnings(suppressMessages({
# Fit a regression, leaving out the last days of data
n = nrow(x)
lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])
# Form prediction for the last day of data
sensor = predict(lm_obj, newdata = x[n, ])
# Return the fitted lm object and prediction
list(lm_obj = lm_obj, sensor = sensor)
})),
error = function(e) return(list(lm_obj = NA, sensor = NA))))
}
joined <- slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3,
col_name = "sensor_obj", col_type = "list")
class(joined$sensor_obj)
names(joined$sensor_obj[[5]]) # The first 4 are lists filled with NAs
joined$sensor_obj[[5]]$lm_obj
joined$sensor_obj[[5]]$sensor
```
This allows for post-inspection of the sensor models, which may be helpful for
diagnostic purposes. Note the way in which we structure the return argument in
`dv_regression()` in its failure cases: it is always a list with the same named
arguments. This helps keep the post-inspection code just a bit more simple (it
is already a hairy enough, to protect against extracting coefficients from `NA`
objects).
```{r}
joined %>%
rowwise() %>%
mutate(sensor = sensor_obj$sensor,
intercept = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA,
coef(sensor_obj$lm_obj)[1]),
slope = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA,
coef(sensor_obj$lm_obj)[2])) %>%
select(geo_value, time_value, sensor, intercept, slope) %>%
arrange(geo_value) %>%
head(20)
```