Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use incidence2 accessor functions or subset columns directly? #79

Open
Bisaloo opened this issue Apr 11, 2023 · 2 comments
Open

Use incidence2 accessor functions or subset columns directly? #79

Bisaloo opened this issue Apr 11, 2023 · 2 comments

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Apr 11, 2023

From #77 (review):

incidence() does allow you to specify the resulting count variable column name as an argument. This would allow you to define relevant columns in variables at the start and then treat everything as if a normal data frame for consistency of access. This is less good when used programatically but potentially better for more interactive use??

We then have two ways to extract count data, groups or dates from an incidence2 object:

  • use the dedicated accessor functions
  • use standard data.frame subsetting. It is possible because we rename columns to a stable name when we pass the input dataset though incidence().

Benefits accessor functions

The extra level of abstraction likely makes it more robust to possible future breaking changes. For example, even if future version of incidence2 chose to not rename the columns but instead to use a purely tag-based system, as in linelist, accessor would likely deal with the breaking change under the hood and provide a stable interface.

On the other hand, we are supposed to already deal with breaking changes by pinning specific version of our dependencies, as discussed in #69.

Benefits direct subsetting

  • Users are more likely to be familiar with the syntax of standard data.frame subsetting
  • It works on all objects, including those that drop their incidence2 class
  • It doesn't tie us so strongly to incidence2
@TimTaylor
Copy link

TimTaylor commented Apr 11, 2023

You've convinced me I need to write a "design" vignette (it's been planned for a while but ... ... time). I'll address accessors below and leave a separate comment on general use of {incidence2} below (so you can hide as a little off-topic for the issue).

On accessors (get_xxx()). These were mainly aimed at those wanting to provide methods for <incidence2> objects. I hadn't really thought of them being used in pipelines, but the {episoap} templates (and templates in general) are an interesting case and I can see why they could be useful. I think you have captured well the pros and cons above.

@TimTaylor
Copy link

TimTaylor commented Apr 11, 2023

Following from above, some thoughts on {incidence2} and where it is best used (and not used). I'll use a crude dichotomy of "interactive" to mean any sort of analysis pipeline and "programmatic" to mean in a package.

In interactive settings, the benefit of {incidence2} is most apparent for complex aggregations of linelist with multiple date indices, or pre-aggregated data with multiple count variables, e.g.

library(incidence2)
library(outbreaks)
library(dplyr)

# linelist example
ebola <- ebola_sim_clean$linelist
(grouped_inci <- incidence(
    ebola,
    date_index = c(
        onset = "date_of_onset",
        infection = "date_of_infection"
    ), 
    interval = "isoweek",
    groups = "gender"
))
#> # incidence:  218 x 4
#> # count vars: infection, onset
#> # groups:     gender
#>    date_index gender count_variable count
#>  * <isowk>    <fct>  <chr>          <int>
#>  1 2014-W12   f      infection          1
#>  2 2014-W15   f      onset              1
#>  3 2014-W15   m      infection          1
#>  4 2014-W16   f      infection          1
#>  5 2014-W16   m      onset              1
#>  6 2014-W17   f      infection          4
#>  7 2014-W17   f      onset              4
#>  8 2014-W17   m      onset              1
#>  9 2014-W18   f      infection          7
#> 10 2014-W18   f      onset              4
#> # ℹ 208 more rows

plot(grouped_inci, angle = 45, border_colour = "white")

# pre-aggregated example
covid <- covidregionaldataUK
(monthly_covid <- 
    covid |> 
    filter(!region %in% c("England", "Scotland", "Northern Ireland", "Wales")) |> 
    incidence(
        date_index = "date",
        groups = "region",
        counts = c("cases_new", "deaths_new"),
        interval = "yearmonth"
    ))
#> # incidence:  324 x 4
#> # count vars: cases_new, deaths_new
#> # groups:     region
#>    date_index region          count_variable count
#>  * <yrmon>    <chr>           <fct>          <dbl>
#>  1 2020-Jan   East Midlands   cases_new         NA
#>  2 2020-Jan   East Midlands   deaths_new        NA
#>  3 2020-Jan   East of England cases_new         NA
#>  4 2020-Jan   East of England deaths_new        NA
#>  5 2020-Jan   London          cases_new         NA
#>  6 2020-Jan   London          deaths_new        NA
#>  7 2020-Jan   North East      cases_new         NA
#>  8 2020-Jan   North East      deaths_new        NA
#>  9 2020-Jan   North West      cases_new         NA
#> 10 2020-Jan   North West      deaths_new        NA
#> # ℹ 314 more rows


# exlude deaths from plot due to scale
monthly_covid |> 
    subset(count_variable == "cases_new") |> 
    plot(nrow = 3, angle = 45, border_colour = "white")
#> Warning: Removed 26 rows containing missing values (`position_stack()`).

Where it may be preferable to use {grates} directly is for more simple aggregations of a single date_index and where you are not worried about the additional formatting of output and the default print methods:

# e.g. For some this may be sufficient
ebola |> 
    mutate(isoweek = as_isoweek(date_of_onset)) |> 
    count(isoweek, gender) |> 
    head(n = 10L)
#>     isoweek gender  n
#> 1  2014-W15      f  1
#> 2  2014-W16      m  1
#> 3  2014-W17      f  4
#> 4  2014-W17      m  1
#> 5  2014-W18      f  4
#> 6  2014-W19      f  9
#> 7  2014-W19      m  3
#> 8  2014-W20      f  7
#> 9  2014-W20      m 10
#> 10 2014-W21      f  8

# as opposed to
incidence(
    ebola,
    date_index = c(onset = "date_of_onset"),
    interval = "isoweek",
    groups = "gender"
)
#> # incidence:  109 x 4
#> # count vars: onset
#> # groups:     gender
#>    date_index gender count_variable count
#>  * <isowk>    <fct>  <chr>          <int>
#>  1 2014-W15   f      onset              1
#>  2 2014-W16   m      onset              1
#>  3 2014-W17   f      onset              4
#>  4 2014-W17   m      onset              1
#>  5 2014-W18   f      onset              4
#>  6 2014-W19   f      onset              9
#>  7 2014-W19   m      onset              3
#>  8 2014-W20   f      onset              7
#>  9 2014-W20   m      onset             10
#> 10 2014-W21   f      onset              8
#> # ℹ 99 more rows

For programatic use the benefits are more aparent and the knowledge of the objects invariants and structure do make it simple for developers to enable nice workflows such as

library(i2extras)

out <- 
    ebola |> 
    incidence(date_index = "date_of_onset", interval = "week", groups = "hospital") |> 
    slice_head(n = 120L) |> 
    fit_curve(model = "poisson", alpha = 0.05)

# plot with a prediction interval but not a confidence interval
plot(out, ci = FALSE, pi=TRUE, angle = 45, border_colour = "white")

# estimate growth rate
growth_rate(out)
#> # A tibble: 6 × 10
#>   count_variable hospital      model     r r_lower r_upper growth_or_decay  time
#>   <chr>          <fct>         <lis> <dbl>   <dbl>   <dbl> <chr>           <dbl>
#> 1 date_of_onset  Connaught Ho… <glm> 0.197   0.177   0.217 doubling         3.53
#> 2 date_of_onset  Military Hos… <glm> 0.173   0.147   0.200 doubling         4.00
#> 3 date_of_onset  other         <glm> 0.170   0.141   0.200 doubling         4.09
#> 4 date_of_onset  Princess Chr… <glm> 0.142   0.101   0.188 doubling         4.87
#> 5 date_of_onset  Rokupa Hospi… <glm> 0.178   0.133   0.228 doubling         3.89
#> 6 date_of_onset  <NA>          <glm> 0.184   0.164   0.205 doubling         3.77
#> # ℹ 2 more variables: time_lower <dbl>, time_upper <dbl>

Created on 2023-04-11 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants