Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum delay treated as observed and not nowcast #116

Closed
Tracked by #92
seabbs opened this issue Jul 14, 2022 · 5 comments
Closed
Tracked by #92

Maximum delay treated as observed and not nowcast #116

seabbs opened this issue Jul 14, 2022 · 5 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed high-priority todo 🗒️

Comments

@seabbs
Copy link
Collaborator

seabbs commented Jul 14, 2022

In develop, introduced in #113, we now model observations at the maximum delay by adding all of the probability that reports will occur after the maximum delay to the maximum delay. This works well for posterior predictions. Unfortunately, when we nowcast we treat all observed data as known and only nowcast unobserved dates. This means that we do not include reports beyond the maximum delay from either the model or from the data (as they are not yet observed in real-world data).

See here:

https://github.com/epiforecasts/epinowcast/blob/21b9311ca13d0939d9a9905f36e6d5284411ad48/inst/stan/epinowcast.stan#L218

The likely solution to this is to add 1 to the maximum delay, add the observations beyond this point to this date, and update the nowcast model code to nowcast this additional delay.

Another option is to alter the nowcast part of the generated quanities so that it uses a posterior prediction vs the observation for maximum delay.

This highlights another issue that we have potentially introduced. The number at the maximum delay now changes over time beyond the maximum delay. This means it is larger further away from the reported observation (see the posterior predictions below). This likely introduces bias to the likelihood. One potential solution to this is to exclde the date which has the bucketed additional reports in it (either the current max delay or one plus). Another option is to only include it after some additional time has passed (in effect having a second maximum delay).

Related to #114 (@adrian-lison).

The following code (using the feature-redesign-interface branch) reproduces this issue.

# Load packages
library(epinowcast)
library(data.table)
library(ggplot2)

# Use 2 cores
options(mc.cores = 2)

# Load and filter germany hospitalisations
nat_germany_hosp <- germany_covid19_hosp[location == "DE"][age_group %in% "00+"]
nat_germany_hosp <- enw_filter_report_dates(
  nat_germany_hosp, latest_date = "2021-10-01"
)

# Make sure observations are complete
nat_germany_hosp <- enw_complete_dates(
  nat_germany_hosp, by = c("location", "age_group")
)
# Make a retrospective dataset
retro_nat_germany <- enw_filter_report_dates(
  nat_germany_hosp, remove_days = 40
)
retro_nat_germany <- enw_filter_reference_dates(
  retro_nat_germany, include_days = 40
)

# Get latest observations for the same time period
latest_obs <- enw_latest_data(nat_germany_hosp)
latest_obs <- enw_filter_reference_dates(
  latest_obs, remove_days = 40, include_days = 20
)

# Preprocess observations (note this maximum delay is likely too short)
pobs <- enw_preprocess_data(retro_nat_germany, max_delay = 10)

# Fit the default nowcast model and produce a nowcast
# Note that we have reduced samples for this example to reduce runtimes
nowcast <- epinowcast(pobs,
  fit = enw_fit_opts(
    save_warmup = FALSE, pp = TRUE,
    chains = 2, iter_warmup = 500, iter_sampling = 500
  )
)

nowcast

# plot the nowcast vs latest available observations
plot(nowcast, latest_obs = latest_obs)

# posterior predictions
plot(nowcast, type = "posterior") +
  facet_wrap(vars(reference_date), scale = "free")

nowcast
pp

@seabbs seabbs added bug Something isn't working help wanted Extra attention is needed todo 🗒️ high-priority labels Jul 14, 2022
@seabbs seabbs mentioned this issue Jul 14, 2022
21 tasks
@seabbs seabbs changed the title Bug: Maximum delay treated as observed and not nowcast Maximum delay treated as observed and not nowcast Jul 14, 2022
@adrian-lison
Copy link
Collaborator

Hmm, interesting and important point! But I do no fully get it yet. Leaving your second points aside for a moment, I would have only expected that the "nowcast" for reference dates dmax days in the past or longer is biased. For reference dates closer to the present, the cases at the maximum delay should be nowcasted, and if we get the overall expected cases right, the nowcast for the maximum delay should cover all cases at the maximum delay and beyond, or not?

And yes, the second issue is also something to discuss further. Theoretically, the max delay should be chosen such that no or only very few cases are beyond it. But I agree that if it is chosen too shortly, it could bias the estimates for the expected cases over time. If we want to avoid this, I would agree that we have to switch to the "drop/ignore" strategy, meaning that we then risk biasing our case estimates downwards but with no differences in the bias over time. I don't think I am a big fan of the 2nd max delay approach, as I still think that the maximum delay should be chosen large enough in the first place, and if we have efficiency issues we should try to solve them in a different way.

@seabbs
Copy link
Collaborator Author

seabbs commented Jul 15, 2022

For reference dates closer to the present, the cases at the maximum delay should be nowcasted, and if we get the overall expected cases right, the nowcast for the maximum delay should cover all cases at the maximum delay and beyond, or not?

Yes exactly, this only impacts the final day (i.e at horizon -max_delay) as the observation on that day is treated as an overservation on that day (for dmax) vs being nowcast (you can see this in the first plot above where the last day is noticably different to all later days). That has been fine up to now as it hasn't changed but as we lump new reports beyond the maximum delay into this data point it now changes as the max delay shortens/lengthens.

, as I still think that the maximum delay should be chosen large enough in the first place

We don't want setting a short delay to introduce bias it currently does though we want it to make uncertainty larger? What upside do you think there is from including this in the likelihood at the moment if it is biased based on when we observe our data?

@seabbs
Copy link
Collaborator Author

seabbs commented Jul 16, 2022

So I am thinking we role back to the old definition of excluding counts beyond the maximum delay and normalising the probability distribution (as we want all expected cases to eventually be reported). If we don't do this I think we are open to bias (if things remain the same) or will have identifiability issues if the probability of report doesn't sum to 1 for observed data. Very happy to discuss this more though as really key to get right. We can discuss this in detail at the next community call?

@seabbs
Copy link
Collaborator Author

seabbs commented Jul 16, 2022

In #121 I have hot fixed this by rolling back to excluding beyond the maximum delay and updating documentation to this effect. The more I think about this I have become convinced this is actually the correct strategy in general but definitely think we need to discuss it as I know I am in the minority on this. My reasons are:

  1. If we allow observations beyond the max delay to be added to the max delay then observations change depending on when we look. This is a particular issue when we compare retrospective data with real-time data where performance will be different.
  2. If we exclude but don't normalise we end up with an identifiability problem. This is because the expectation or reporting distribution can be changed to achieve the same overall count.
  3. If we don't do this we are treating hazard effects and parametric effect differently. This is because the total probability of reporting in the hazard model sums to 1 in the observed period.

I think the way to frame this model and the use of a maximum delay in general is that are shifting the question from nowcasting what will ever be reported to what will be reported up to D. We should probably work on phrasing this better in the model vignette.

@seabbs
Copy link
Collaborator Author

seabbs commented Jul 18, 2022

Closed in favour of #122

@seabbs seabbs closed this as completed Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed high-priority todo 🗒️
Projects
None yet
Development

No branches or pull requests

2 participants