Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify stat_max #237

Merged
merged 17 commits into from
May 16, 2024
Merged

Clarify stat_max #237

merged 17 commits into from
May 16, 2024

Conversation

jamesmbaazam
Copy link
Member

@jamesmbaazam jamesmbaazam commented Mar 22, 2024

This PR closes #193 by renaming stat_max to stat_threshold and improves the documentation to help clarify some nuances between stat_threshold in simulate_chains() and simulate_summary(). It also renames the infinite argument (old name of stat_max) in rborel() to censor_at and improves its documentation. In particular,

  • stat_threshold in simulate_chains() is a stopping criterion for chains. When the cumulative statistic of chains reaches or surpasses it, they end
  • stat_threshold in simulate_summary() is both a stopping criterion for chains and a censoring limit. When the cumulative statistics of chains reach or surpass it, they end. Additionally, chain statistic values that are >= stat_threshold are set to Inf.
  • censor_at in the borel functions is passed to stat_threshold in simulate_summary() as a stopping criterion and for censoring chain sizes but does not refer to chain sizes as used in simulate_summary(), hence the name change and dedicated documentation.

@jamesmbaazam jamesmbaazam added the documentation Improvements or additions to documentation label Mar 22, 2024
@jamesmbaazam jamesmbaazam marked this pull request as ready for review March 22, 2024 18:13
@jamesmbaazam
Copy link
Member Author

jamesmbaazam commented May 2, 2024

The need to clarify this argument came up today in the following tutorial. A participant asked, "Why do some chains produce more cumulative cases than what stat_max is?" My answer was roughly, "Chains may still produce more cases than stat_max near the stopping criterion but will not continue to produce more cases after the stopping criterion is met. For example, if the stopping criterion is 10 cases, a chain may produce 12 cases near the stopping criterion." This raises the need to clarify stat_max further as it suggests that chains cannot produce outcomes above it.

@joshwlambert and @avallecam Do you have thoughts on how to clarify this further? I'm tagging you because we had a brief discussion about it after the tutorial session.

R/simulate.r Outdated Show resolved Hide resolved
Copy link
Contributor

@sbfnk sbfnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do some chains produce more cumulative cases than what stat_max is?

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

R/borel.r Outdated Show resolved Hide resolved
R/simulate.r Outdated Show resolved Hide resolved
@joshwlambert
Copy link
Member

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

My preference would be to either 1) keep the name stat_max and make it a hard limit for the simulation so the user cannot exceed this number, or 2) change the argument name to replace max with another word that doesn't imply a hard limit and then improve documentation to how this works.

This is where I think the confusion stemmed from in the tutorial because it is a max that does not behave as a max.

@joshwlambert
Copy link
Member

FYI we have the same functionality dilemma in {simulist} and have chosen to have a soft limit so the "maximum" can be exceeded and the function returns a warning to let the user know the "maximum" (either default 1e4 or user-specified) has been exceeded and states the number of cases returned.

https://github.com/epiverse-trace/simulist/blob/main/R/sim_linelist.R#L70

This discussion also makes it clear that the documentation in sim_linelist() could also be improved.

@jamesmbaazam
Copy link
Member Author

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

I think this might be the desirable behaviour. We could treat the last generation of offspring as potential offspring and sample the number needed to reach stat_max by finding the difference between the cumulative number before the last generation and stat_max.

@jamesmbaazam
Copy link
Member Author

Would an alternative be to ensure it doesn't, i.e. select cases in the final generation such that the stat is stat_max?

My preference would be to either 1) keep the name stat_max and make it a hard limit for the simulation so the user cannot exceed this number, or 2) change the argument name to replace max with another word that doesn't imply a hard limit and then improve documentation to how this works.

This is where I think the confusion stemmed from in the tutorial because it is a max that does not behave as a max.

Thanks, Josh.

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

@jamesmbaazam
Copy link
Member Author

FYI we have the same functionality dilemma in {simulist} and have chosen to have a soft limit so the "maximum" can be exceeded and the function returns a warning to let the user know the "maximum" (either default 1e4 or user-specified) has been exceeded and states the number of cases returned.

epiverse-trace/simulist@main/R/sim_linelist.R#L70

This discussion also makes it clear that the documentation in sim_linelist() could also be improved.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

@joshwlambert
Copy link
Member

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

Sounds good to me.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

Here's a reprex to show what I mean. The outbreak_size argument takes a vector of two elements a hard lower limit, and a soft upper limit for the outbreak size.

set.seed(2)
library(simulist)
linelist <- sim_linelist(
  contact_distribution = function(x) dpois(x = x, lambda = 2),
  infect_period = function(x) dgamma(x = x, shape = 3, scale = 3),
  prob_infect = 0.55,
  onset_to_hosp = function(x) dgamma(x = x, shape = 2, scale = 2),
  onset_to_death = function(x) dgamma(x = x, shape = 2, scale = 2),
  outbreak_size = c(5, 20)
)
#> Warning: Number of cases exceeds maximum outbreak size. 
#> Returning data early with 29 cases and 44 total contacts (including cases).

Created on 2024-05-03 with reprex v2.1.0

@jamesmbaazam
Copy link
Member Author

I think option 1 is better because it can be used to simulate desirable behavior. Allowing simulations to go beyond the limit set will make the results difficult to explain.

Sounds good to me.

I may be misinterpreting your definition of "soft" but from the code & output, it does seem like you don't return values above the max, so I would still interpret it as a "hard" limit.

Here's a reprex to show what I mean. The outbreak_size argument takes a vector of two elements a hard lower limit, and a soft upper limit for the outbreak size.

set.seed(2)
library(simulist)
linelist <- sim_linelist(
  contact_distribution = function(x) dpois(x = x, lambda = 2),
  infect_period = function(x) dgamma(x = x, shape = 3, scale = 3),
  prob_infect = 0.55,
  onset_to_hosp = function(x) dgamma(x = x, shape = 2, scale = 2),
  onset_to_death = function(x) dgamma(x = x, shape = 2, scale = 2),
  outbreak_size = c(5, 20)
)
#> Warning: Number of cases exceeds maximum outbreak size. 
#> Returning data early with 29 cases and 44 total contacts (including cases).

Created on 2024-05-03 with reprex v2.1.0

Ah! Thanks for this. That makes sense. I don't think we should have a hard lower limit in {epichains} as it is by definition the seeing cases, if the outbreak doesn't take off. The soft upper limit is what is already implemented as stat_max, so it's currently a matter of documentation which can be solved with a name change and a proper wording.

My concern is more about whether it is desirable behaviour. For example, if a user sets stat_max to 500 cases and gets 800 cases, that's quite undesirable. I think the solution would be what I suggested in #237 (comment), where we sample the remainder from the excess cases and return exactly stat_max.

@jamesmbaazam jamesmbaazam force-pushed the change-truncation-to-censoring branch from b517fee to 1a20941 Compare May 7, 2024 13:11
R/borel.r Outdated Show resolved Hide resolved
@jamesmbaazam
Copy link
Member Author

Upon further discussion with Seb, we've decided it might be better to rename stat_max to stat_threshold and improve the wording to indicate that it is not a hard limit and can be exceeded due to stochastic effects. The discussion about making it a hard limit has been logged in #243 for further brainstorming.

@jamesmbaazam jamesmbaazam force-pushed the change-truncation-to-censoring branch 2 times, most recently from 9e3627e to d7d32e9 Compare May 15, 2024 15:40
@jamesmbaazam jamesmbaazam reopened this May 15, 2024
@jamesmbaazam jamesmbaazam force-pushed the change-truncation-to-censoring branch from 91920fb to 5e12628 Compare May 15, 2024 16:02
@jamesmbaazam jamesmbaazam requested a review from sbfnk May 15, 2024 16:03
@jamesmbaazam jamesmbaazam merged commit a86b53d into main May 16, 2024
9 checks passed
@jamesmbaazam jamesmbaazam deleted the change-truncation-to-censoring branch May 16, 2024 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix disparities in meaning of stat_max in various functions
4 participants