R-packages/modeltools/vignettes/modeltools.Rmd

---
title: Get started with `modeltools`
description: An introductory tutorial with examples.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Get started with modeltools}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This package provides tools for building COVID-19 models, with a focus on 
forecasting and hotspot prediction models. It is designed to be used in 
conjunction with the `covidcast` and `evalcast` packages. 

## Installing

This package is not on CRAN yet, so it can be installed using the
[`devtools`](https://cran.r-project.org/package=devtools) package:

```{r, eval = FALSE}
devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/modeltools")
```

Building the vignettes, such as this getting started guide, takes a significant
amount of time. They are not included in the package by default. If you want to
include vignettes, then use this modified command:

```{r, eval = FALSE}
devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/modeltools",
                         build_vignettes = TRUE, dependencies = TRUE)
```

For this getting started vignette, we'll fetch COVID-19 case rates, both raw and
smoothed via 7-day trailing averages, from the USAFacts data source. We'll look
at data at the state level for four states, between the start of June and mid 
November.

```{r}
library(covidcast)

start_day <- "2020-06-01"
end_day <- "2020-11-15"
geo_values <- c("ca", "fl", "ny", "tx")

signals <- suppressMessages(
  covidcast_signals(data_source = "usa-facts", 
                    signal = c("confirmed_incidence_prop",
                               "confirmed_7dav_incidence_prop"),
                    start_day = start_day, end_day = end_day, 
                    geo_type = "state", geo_values = geo_values))

summary(signals[[1]])
summary(signals[[2]])
```

## Slide with a formula

One of the most basic tools in the `modeltools` package is `slide_by_geo()`, 
which is based on the family of functions provided by the `slider` package. In
`modeltools`, to "slide" means to apply the function or formula over a trailing
window of `n` days of data, grouped by `geo_value`. Many other functions in the
package, such as `pct_change()` and `estimate_deriv()`, use `slide_by_geo()` as
their workhorse. 

For example, to apply a 7-day trailing average to the raw values in the first
`covidcast_signal` data frame, we can specify a formula via the `slide_fun` 
argument of `slide_by_geo()`:

```{r, message = FALSE, warning = FALSE}
library(modeltools)
library(dplyr)

slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7) %>%
  select(geo_value, time_value, value, slide_value) %>%
  arrange(geo_value) %>% 
  head(10)
```

The formula specified via `slide_fun` has access to all columns present in the
original `covidcast_signal` data frame, and must refer to them with the prefix
`.x$`. Here the function `Mean()` is a simple wrapper around `mean()` that omits
`NA` values by default (provided by the `modeltools` package). 

Notice that `slide_by_geo()` returns a data frame with a new column appended 
that contains the results (from sliding the formula), named "slide_value" by 
default. We can instead specify a name up front using the `col_name` argument: 

```{r}
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7, 
             col_name = "7dav") %>%
  select(geo_value, time_value, value, `7dav`) %>%
  arrange(geo_value) %>% 
  head(10)
```

As a simple sanity check, we compare the 7-day trailing average computed via
`slide_by_geo()` to the values of the smoothed signal from the API: 

```{r}
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7, 
             col_name = "7dav") %>%
  full_join(signals[[2]], by = c("geo_value", "time_value")) %>%
  mutate(difference = `7dav` - value.y) %>%
  filter(time_value >= "2020-06-07") %>%
  summarize(max(abs(difference)))
```

Note that this check would *fail* before June 7 because the API access to data 
before June 1, whereas `slide_by_geo()` does not (when a trailing window of `n` 
days isn't available, `slide_by_geo()` just applies the function or formula to 
whatever data is available).

## Slide with a function 

We can also pass a function for the `slide_fun` argument in `slide_by_geo()`. In 
this case, the passed function must have the following argument structure: `x`, 
a data frame the same column names as the original data frame; followed by any
number of named additional arguments; and ending with `...`, to capture general 
additional arguments. Recreating the last example of a 7-day trailing average: 

```{r}
slide_by_geo(signals[[1]], slide_fun = function(x, ...) Mean(x$value), n = 7,
             col_name = "7dav") %>%
  select(geo_value, time_value, value, `7dav`) %>%
  arrange(geo_value) %>% 
  head(10)
```

As a more sophisticated example, here we show how to *sensorize* the doctor 
visits signal. The sensor values are defined by the predicted values from a 
local (in time) regression of past case rates (response) on past doctor visits 
(covariate).

```{r}
# Regression for doctor visits sensorization
dv_regression = function(x, m = 3, ...) {
  n = nrow(x)
  if (n <= m+1) return(NA) # Take care of trivial case
  
  return(tryCatch(suppressWarnings(suppressMessages({
    # Fit a regression, leaving out the last m days of data
    lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])
    
    # Form prediction for the last day of data, and return
    predict(lm_obj, newdata = x[n, ])
  })),
  error = function(e) return(NA)))
}

# Fetch doctor visits signal and join it to case rates
joined = suppressMessages(
  covidcast_signal(data_source = "doctor-visits", 
                   signal = "smoothed_adj_cli", 
                   start_day = start_day, end_day = end_day, 
                   geo_type = "state", geo_values = geo_values)) %>%
  full_join(signals[[2]], by = c("geo_value", "time_value"))
 
# Perform sensorization for each state; use the last n = 56 days (8 weeks) of 
# data, minus the last m = 3 days, for the regression
slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3, 
             col_name = "sensor") %>%
  select(geo_value, time_value, value.x, value.y, sensor) %>%
  arrange(geo_value) %>% 
  head(10)
```

Above, the first 4 elements are `NA` because of insufficient training data (we 
need at least 2 training samples for simple linear regression, and we omit the 
last 3 days of data).

## Returning complex objects

As a final example, we show how to use `slide_by_geo()` to output more complex
objects. We amend the last example so that `dv_regression()` returns both the
predicted value (sensor) and the fitted linear model object. In order to use 
this with `slide_by_geo()`, we need to set the `col_type` argument to be "list":

```{r}
dv_regression = function(x, m = 3, ...) {
  n = nrow(x)
  if (n <= m+1) return(list(lm_obj = NA, sensor = NA)) # Trivial case
  
  return(tryCatch(suppressWarnings(suppressMessages({
    # Fit a regression, leaving out the last days of data
    n = nrow(x)
    lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])
    
    # Form prediction for the last day of data
    sensor = predict(lm_obj, newdata = x[n, ])
    
    # Return the fitted lm object and prediction
    list(lm_obj = lm_obj, sensor = sensor)
  })),
  error = function(e) return(list(lm_obj = NA, sensor = NA))))
}

joined <- slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3, 
                       col_name = "sensor_obj", col_type = "list")

class(joined$sensor_obj)
names(joined$sensor_obj[[5]]) # The first 4 are lists filled with NAs
joined$sensor_obj[[5]]$lm_obj
joined$sensor_obj[[5]]$sensor
```

This allows for post-inspection of the sensor models, which may be helpful for
diagnostic purposes. Note the way in which we structure the return argument in 
`dv_regression()` in its failure cases: it is always a list with the same named
arguments. This helps keep the post-inspection code just a bit more simple (it
is already a hairy enough, to protect against extracting coefficients from `NA` 
objects).

```{r}
joined %>% 
  rowwise() %>%
  mutate(sensor = sensor_obj$sensor, 
         intercept = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA,
                            coef(sensor_obj$lm_obj)[1]),
         slope = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA, 
                        coef(sensor_obj$lm_obj)[2])) %>%
  select(geo_value, time_value, sensor, intercept, slope) %>%
  arrange(geo_value) %>% 
  head(20)
```