-
Notifications
You must be signed in to change notification settings - Fork 28
/
quantgen.R
287 lines (267 loc) · 14.3 KB
/
quantgen.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
#' Simple quantile autoregressive forecaster based on `quantgen`
#'
#' A simple quantile autoregressive forecaster based on `quantgen`, to be used
#' with `evalcast`, via [evalcast::get_predictions()]. See the [quantgen
#' forecast
#' vignette](https://cmu-delphi.github.io/covidcast/modeltoolsR/articles/quantgen-forecast.html)
#' for examples.
#'
#' @param df_list List of data frames of signal values to use for forecasting, of the format
#' that is returned by [covidcast::covidcast_signals()].
#' @param forecast_date Date object or string of the form "YYYY-MM-DD",
#' indicating the date on which forecasts will be made. For example, if
#' `forecast_date = "2020-05-11"`, `incidence_period = "day"`, and `ahead =
#' 3`, then, forecasts would be made for "2020-05-14".
#' @param signals Tibble with columns `data_source` and `signal` that specifies
#' which variables are being fetched from the COVIDcast API, and populated in
#' `df`. Each row of `signals` represents a separate signal, and first row is
#' taken to be the response. An optional column `start_day` can also be
#' included. This can be a Date object or string in the form "YYYY-MM-DD",
#' indicating the earliest date of data needed from that data source.
#' Importantly, `start_day` can also be a function (represented as a list
#' column) that takes a forecast date and returns a start date for model
#' training (again, Date object or string in the form "YYYY-MM-DD"). The
#' latter is useful when the start date should be computed dynamically from
#' the forecast date (e.g., when the forecaster only trains on the most recent
#' 4 weeks of data).
#' @param incidence_period One of "day or "epiweek", indicating the period over
#' which forecasts are being made. Default is "day".
#' @param ahead Vector of ahead values, indicating how many days/epiweeks ahead
#' to forecast. If `incidence_period = "day"`, then `ahead = 1` means the day
#' after forecast date. If `incidence_period = "epiweek"` and the forecast
#' date falls on a Sunday or Monday, then `ahead = 1` means the epiweek that
#' includes the forecast date; if `forecast_date` falls on a Tuesday through
#' Saturday, then it means the following epiweek.
#' @param n Size of the local training window (in days/weeks, depending on
#' `incidence_period`) to use. For example, if `n = 14`, and `incidence_period
#' = "day"`, then to make a 1-day-ahead forecast on December 15, we train on
#' data from December 1 to December 14.
#' @param lags Vector of lag values to use as features in the autoregressive
#' model. For example, when `incidence_period = "day"`, setting `lags = c(0,
#' 7, 14)`means we use the current value of each signal (defined by a row of
#' the `signals` tibble), as well as the values 7 and 14 days ago, as the
#' features. Recall that the response is defined by the first row of the
#' `signals` tibble. Note that `lags` can also be a list of vectors of lag
#' values, this list having the same length as the number of rows of
#' `signals`, in order to apply a different set of shifts to each signal.
#' Default is 0, which means no additional lags (only current values) for each
#' signal.
#' @param tau Vector of quantile levels for the probabilistic forecast. If not
#' specified, defaults to the levels required by the COVID Forecast Hub.
#' @param transform,inv_trans Transformation and inverse transformations to use
#' for the response/features. The former `transform` can be a function or a
#' list of functions, this list having the same length as the number of rows
#' in the `signals` tibble, in order to apply the same transformation or a
#' different transformation to each signal. These transformations will be
#' applied before fitting the quantile model. The latter argument `inv_trans`
#' specifies the inverse transformation to use on the response variable
#' (inverse of `transform` if this is a function, or of `transform[[1]]` if
#' `transform` is a list), which will be applied post prediction from the
#' quantile model. Several convenience functions for transformations exist as
#' part of the `quantgen` package. Default is `NULL` for both `transform` and
#' `inv_trans`, which means no transformations are applied.
#' @param featurize Function to construct custom features before the quantile
#' model is fit. As input, this function must take a data frame with columns
#' `geo_value`, `time_value`, then the transformed, lagged signal values. This
#' function must return a data frame with columns `geo_value`, `time_value`,
#' then any custom features. The rows of the returned data frame *must not* be
#' reordered.
#' @param noncross Should noncrossing constraints be applied? These force the
#' predicted quantiles to be properly ordered across all quantile levels being
#' considered. The default is `FALSE`. If `TRUE`, then noncrossing constraints
#' are applied to the estimated quantiles at all points specified by the next
#' argument.
#' @param noncross_points One of "all", "test", "train" indicating which points
#' to use for the noncrossing constraints: the default "all" means to use both
#' training and testing sets combined, while "test" or "train" means to use
#' just one set, training or testing, respectively.
#' @param cv_type One of "forward" or "random", indicating the type of
#' cross-validation to perform. If "random", then `nfolds` folds are chosen by
#' dividing training data points randomly (the default being `nfolds = 5`). If
#' "forward", the default, then we instead use a "forward-validation" approach
#' that better reflects the way predictions are made in the current time
#' series forecasting context. Roughly, this works as follows: the data points
#' from the first `n - nfolds` time values are used for model training, and
#' then predictions are made at the earliest possible forecast date after this
#' training period. We march forward one time point at a time and repeat. In
#' either case ("random" or "forward"), the loss function used for computing
#' validation error is quantile regression loss (read the documentation for
#' `quantgen::cv_quantile_lasso()` for more details); and the final quantile
#' model is refit on the full training set using the validation-optimal tuning
#' parameter.
#' @param verbose Should progress be printed out to the console? Default is
#' `FALSE`.
#' @param ... Additional arguments. Any parameter accepted by
#' `quantgen::cv_quantile_lasso()` (for model training) or by
#' `quantgen:::predict.cv_quantile_genlasso()` (for model prediction) can be
#' passed here. For example, `nfolds`, for specifying the number of folds used
#' in cross-validation, or `lambda`, for specifying the tuning parameter
#' values over which to perform cross-validation (the default allows
#' `quantgen::cv_quantile_lasso()` to set the lambda sequence itself). Note
#' that fixing a single tuning parameter value (such as `lambda = 0`)
#' effectively disables cross-validation and fits a quantile model at the
#' given tuning parameter value (here unregularized quantile autoregression).
#'
#' @return Data frame with columns `ahead`, `geo_value`, `quantile`, and
#' `value`. The `quantile` column gives the probabilities associated with
#' quantile forecasts for that location and ahead.
#'
#' @importFrom dplyr filter select pull summarize between bind_cols
#' @importFrom tidyr pivot_longer
#' @export
quantgen_forecaster = function(df_list, forecast_date, signals, incidence_period,
ahead, geo_type,
n = 4 * ifelse(incidence_period == "day", 7, 1),
lags = 0, tau = modeltools::covidhub_probs,
transform = NULL, inv_trans = NULL,
featurize = NULL, noncross = FALSE,
noncross_points = c("all", "test", "train"),
cv_type = c("forward", "random"),
verbose = FALSE, ...) {
# Check lags vector or list
if (any(unlist(lags) < 0)) stop("All lags must be nonnegative.")
if (!is.list(lags)) lags = rep(list(lags), nrow(signals))
else if (length(lags) != nrow(signals)) {
stop(paste("If `lags` is a list, it must have length equal to the number",
"of signals."))
}
# Check transform function or list, and apply them if we need to
if (!is.null(transform)) {
if (!is.list(transform)) {
transform = rep(list(transform), nrow(signals))
}
else if (length(transform) != nrow(signals)) {
stop(paste("If `transform` is a list, it must have length equal to the",
"number of signals."))
}
if (is.null(inv_trans)) {
stop("If `transform` is specified, then `inv_trans` must be as well.")
}
for (i in 1:length(df_list)) {
df_list[[i]] = df_list[[i]] %>% mutate(value = transform[[i]](value))
}
}
# Define dt by flipping the sign of lags, include dt = +ahead as a response
# shift, for each ahead value, for convenience later
dt = lapply(lags, "-")
dt[[1]] = c(dt[[1]], ahead)
# Append shifts, and aggregate into wide format
df_wide = covidcast::aggregate_signals(df_list, dt = dt, format = "wide")
# Separate out into feature data frame, featurize if we need to
df_features = df_wide %>%
select(geo_value, time_value, tidyselect::matches("^value(\\+0|-)"))
feature_end_date = df_features %>%
summarize(max(time_value)) %>% pull()
if (!is.null(featurize)) df_features = featurize(df_features)
# Identify params for quantgen training and prediction functions
params = list(...)
params$tau = tau
params$noncross = noncross
# perform CV if (i) lambda not provided, or (ii) more than one lambda
# value provided
cv = is.null(params$lambda) || length(params$lambda) > 1
if (cv) {
train_fun = quantgen::cv_quantile_lasso
predict_fun = quantgen:::predict.cv_quantile_genlasso
}
else {
train_fun = quantgen::quantile_lasso
predict_fun = quantgen:::predict.quantile_genlasso
}
train_names = names(as.list(args(train_fun)))
predict_names = names(as.list(args(predict_fun)))
train_params = params[names(params) %in% train_names]
predict_params = params[names(params) %in% predict_names]
# Check noncross_points and cv_type
if (noncross) noncross_points = match.arg(noncross_points)
if (cv) cv_type = match.arg(cv_type)
# Test objects that remain invariant over ahead values
test_geo_value = df_features %>%
filter(time_value == max(time_value)) %>%
select(geo_value) %>% pull()
newx = df_features %>%
filter(time_value == max(time_value)) %>%
select(-c(geo_value, time_value)) %>% as.matrix()
# Loop over ahead values, fit model, make predictions
result = vector(mode = "list", length = length(ahead))
for (i in 1:length(ahead)) {
a = ahead[i]
if (verbose) cat(sprintf("%s%i", ifelse(i == 1, "\nahead = ", ", "), a))
# Training end date
response_end_date = df_wide %>%
select(time_value, tidyselect::starts_with(sprintf("value+%i:", a))) %>%
tidyr::drop_na() %>%
summarize(max(time_value)) %>% pull()
train_end_date = min(feature_end_date, response_end_date)
# Training x and y
x = df_features %>%
filter(between(time_value,
train_end_date - n + 1,
train_end_date)) %>%
select(-c(geo_value, time_value)) %>% as.matrix()
y = df_wide %>%
filter(between(time_value,
train_end_date - n + 1,
train_end_date)) %>%
select(tidyselect::starts_with(sprintf("value+%i:", a))) %>% pull()
# Define noncrossing constraints, if we need to
if (noncross) {
if (noncross_points == "all") x0 = rbind(x, newx)
else if (noncross_points == "test") x0 = newx
else if (noncross_points == "train") x0 = x
x0 = na.omit(x0) # Just in case, since quantgen won't check this
train_params$x0 = x0
}
# Define forward-validation folds, if we need to
if (cv && cv_type == "forward") {
# Training time values
train_time_value = df_wide %>%
filter(between(time_value,
train_end_date - n + 1,
train_end_date)) %>%
select(time_value) %>% pull()
# Training and test folds
nfolds = ifelse(!is.null(params$nfolds), params$nfolds, 5)
train_test_inds = list(train = vector(mode = "list", length = nfolds),
test = vector(mode = "list", length = nfolds))
for (k in 1:nfolds) {
train_test_inds$train[[k]] = which(
between(train_time_value,
train_end_date - n + k,
train_end_date - nfolds + k - 1))
train_test_inds$test[[k]] = which(
train_time_value == train_end_date - nfolds + k)
}
train_params$train_test_inds = train_test_inds
}
# Add training x and y to training params list, fit model
train_params$x = x
train_params$y = y;
train_obj = do.call(train_fun, train_params)
# Add training object and newx to test params list, make predictions
predict_params$object = train_obj
predict_params$object$inv_trans = inv_trans # Let quantgen handle this
predict_params$newx = newx
predict_mat = do.call(predict_fun, predict_params)
# Do some wrangling to get it into evalcast "long" format
colnames(predict_mat) = tau
predict_df = bind_cols(geo_value = test_geo_value,
predict_mat) %>%
pivot_longer(cols = -geo_value,
names_to = "quantile",
values_to = "value") %>%
mutate(ahead = a)
# TODO: allow train_obj to be appended to forecaster's output. This would
# be nice for post-hoc analysis of the fitted models, etc. Two options for
# doing so: 1. append as a column to returned df; but this would be a big
# waste of memory since it'd get repeated for each forecast made from the
# same fitted model, 2. include it as an attribute of the returned df, but
# unless we're super careful this would get squashed by downstream uses of
# rbind(), map(), etc. We could be careful here, but then we'd also have
# to keep track of what evalcast does
result[[i]] = predict_df
}
if (verbose) cat("\n")
# Collapse predictions into one big data frame, and return
return(do.call(rbind, result))
}