Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible API for models #160

Closed
TimTaylor opened this issue Feb 1, 2024 · 25 comments
Closed

Possible API for models #160

TimTaylor opened this issue Feb 1, 2024 · 25 comments
Assignees
Labels
Discussion Issues kept open for discussions in the comments

Comments

@TimTaylor
Copy link
Contributor

TimTaylor commented Feb 1, 2024

Overview

Expose two functions to users:

  • An unsafe (little / no input checking) one that works on individual runs and interventions. This is more like an internal function but we expose it for "advanced" users. Only returns results and is not classed.

  • A vectorised one that allows comparisons of interventions:

    • Vectorised over parameters (e.g. alpha, beta, gamma, contact_matrix).
    • Takes a list of of interventions to compare as well as a number of iterations to run for each parameter combination.
    • Optionally (tbd) allows specificiation of random_seed to ensure each individual run can be matched (not needed when a list of interventions to compare is given just when a user is handling comparisons) - see challenges/questions section below.
    • May be classed (tbd)

Motivation for this API

  • Input checking overhead adds up over a large number of runs. As we expect users to explore different levels of uncertainty then the number of runs we expect them to perform adds up. We want to provide a safe way for users to do multiple runs.

  • For stochastic models, comparing interventions across multiple runs involves careful handling of random number streams. This is easy to ignore and get wrong. To this end we do want to support users in managing this. Whether this should be part of a scenarios package or not can be addressed later once we have experience of the best API.

SEIR Example

# unsafe
.seir <- function(alpha, beta, gamma, contact_matrix, ..., intervention = NULL) {
    # implementation
    # ...
    # ...
    
    # just return the results
}

# safe
seir <- function(
        alpha, beta, gamma, contact_matrix,
        ...,
        intervention = list(NULL),
        n = 1,
        random_seeds = NULL
) {
    # implementation
    # ...
    # ...
    
    # return results linked to parameters, interventions and run number (for
    # stocahastic models). Do we also want to return the seeds for each
    # iteration?
}

Challenges / questions

  • Do interventions combine nicely (e.g. vaccinations and contact reductions) in to a single intervention or will we need to take lists of lists. Implementation wise this is not a problem but will it feel unwieldy for users?

  • If a user wants to do separate function calls to compare interventions then we need to provide a way for them to match random seeds for each internal iteration. This means that we would need to return the random seed for each iteration. For the default random number generator in R this corresponds to an integer of length 626. Having to return length(param) * n vectors of length 626) could get unwieldy (and large) - thoughts???

  • What format for output of the vectorised function? List or nested data.frame / tibble / data.table.?

  • Do we class the output?

  • Are the random seed challenges the things that could motivate a scenarios package? Would handling it in epidemics be too much? Too early to say?

@TimTaylor
Copy link
Contributor Author

@pratikunterwegs @bahadzie @BlackEdder @rozeggo @Bisaloo @chartgerink. Just a sketch / brain dump after our discussions yesterday. Thoughts appreciated.

@pratikunterwegs
Copy link
Collaborator

pratikunterwegs commented Feb 1, 2024

Thanks @TimTaylor for this sketch.

Broadly, the external user-facing 'safe' and internal 'unsafe' implementation already exists for all C++ implementations (default, Vacamole, diphtheria), where the C++/Rcpp function .model_*_cpp() is called by the corresponding exported function model_*_cpp(). Implementing this structure for the ebola model (in R) shouldn't be too challenging. The exported functions could check inputs as vectors or lists rather than single objects, combine them into a parameter table, and call the internal function on those parameters.

I'm happy to hear suggestions from any who were not at the meeting regarding this.

Do interventions combine nicely (e.g. vaccinations and contact reductions) in to a single intervention or will we need to take lists of lists. Implementation wise this is not a problem but will it feel unwieldy for users?

Contact interventions and vaccinations combine well with objects of the same type. Rate interventions also combine, but the combination must target the same parameter; e.g. combining an intervention on $\beta$ and $\gamma$ would work but would be wrong.

Users aiming to run two different scenarios, differing in their interventions, would indeed need to pass a list of lists. While unwieldy, we do need to see it from a sustainability perspective - the codebase is already unapproachable for the majority of the team so it would be good to avoid further complication (as bundling interventions and vaccinations together would require unbundling the object once passed to C++).

If a user wants to do separate function calls to compare interventions then we need to provide a way for them to match random seeds for each internal iteration. This means that we would need to return the random seed for each iteration. For the default random number generator in R this corresponds to an integer of length 626. Having to return length(param) * n vectors of length 626) could get unwieldy (and large) - thoughts???

This case assumes parameter uncertaintly, hence the seed. I would say the user could/should pass a vector of parameter values, which could be drawn from a distribution and passed to each function call. Eventually, functionality could be added to e.g. {epiparameter} <epidist> to sample with a seed - have been speaking with @joshwlambert about this.

What format for output of the vectorised function? List or nested data.frame / tibble / data.table.?

Suggestions on what works here would be great. The key consideration is that the output would eventually be ingested by a function that calculates differences between discrete scenarios (e.g. intervention 1 vs intervention 2), and there would need to be a way to identify these scenarios within the output.

Do we class the output?

I think we decided yesterday to eventually class the output. Happy to hear suggestions.

@TimTaylor
Copy link
Contributor Author

This case assumes parameter uncertaintly, hence the seed. I would say the user could/should pass a vector of parameter values, which could be drawn from a distribution and passed to each function call. Eventually, functionality could be added to e.g. {epiparameter} <epidist> to sample with a seed - have been speaking with @joshwlambert about this.

Unless I misunderstood the seed issue is more about stochastic models where you need to compare interventions across hundreds of realisations so need to start with the same seeds???

@pratikunterwegs
Copy link
Collaborator

This case assumes parameter uncertaintly, hence the seed. I would say the user could/should pass a vector of parameter values, which could be drawn from a distribution and passed to each function call. Eventually, functionality could be added to e.g. {epiparameter} <epidist> to sample with a seed - have been speaking with @joshwlambert about this.

Unless I misunderstood the seed issue is more about stochastic models where you need to compare interventions across hundreds of realisations so need to start with the same seeds???

Well those too, my omission.

@TimTaylor
Copy link
Contributor Author

Let's ensure we're on the same page re: seeds and stochasticity: AFAICT

simple case: list (of list) multiple interventions within function call

seir(
    alpha, beta, gamma, contact_matrix,
    intervention = list(interventions_1, interventions_2),
     n = 1000
)

in this case we can easily ensure .Random.seed is matched across the two sets of interventions for each of the 1000 repetitions.

more complicated case: different interventions across two function calls

res1 <- seir(
    alpha, beta, gamma, contact_matrix,
    intervention = list(interventions_1),
     n = 1000
)

res2 <- seir(
    alpha, beta, gamma, contact_matrix,
    intervention = list(interventions_2),
     n = 1000,
    res1$random_seeds # list of the seeds - something along the lines of this???
)

We need some way to ensure comparibility for the user. Now (arguably) this is where {scenarios} could come in but I think it's easier to think about it within {epidemics} first as I'm not sure how easy making a generic {scenarios} package would be.

@pratikunterwegs
Copy link
Collaborator

Yes, this is basically what we would need - if we go for this approach of parameter and seed management for users.

@pratikunterwegs pratikunterwegs self-assigned this Feb 1, 2024
@pratikunterwegs
Copy link
Collaborator

I'm tackling the various sub-issues in this thread one by one:

  1. Parameter uncertainty,
  2. Stochastic models needed multiple runs,
  3. Return type.

First up is the API for parameter uncertainty, which is relatively easy to implement if users draw parameter values and pass them to the function. A draft of the internal workings for the default model, using {data.table}, is in R/model_default.R on the feature/replicates branch. @TimTaylor @BlackEdder is this in line with your needs/ideas? Also happy to hear from @Bisaloo and @chartgerink.

Some questions, happy to discuss here or on Slack next week:

  1. I'm debating whether to have checks for only one parameter being allowed to have uncertainty per (exported) function call.
  2. I'm also unsure how to prevent questionable use of the functionality by people passing draws from multiple different distributions, e.g. N(mu = 0.18, sd = 0.01), and N(mu = 0.25, sd = 0.03).
  3. The return type is a nested <data.table>: expanding this for every row in the model output would make a very large object (as list-inheriting , are also replicated for each row). Happy to hear solutions to this, including whether it's solved by having a class.

Attaching a small reprex here:

library(epidemics)
library(socialmixr)
#> 
#> Attaching package: 'socialmixr'
#> The following object is masked from 'package:utils':
#> 
#>     cite
library(microbenchmark)

polymod <- socialmixr::polymod
contact_data <- socialmixr::contact_matrix(
  polymod,
  countries = "United Kingdom",
  age.limits = c(0, 20, 40),
  symmetric = TRUE
)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> Removing participants that have contacts without age information. To change this behaviour, set the 'missing.contact.age' option
contact_matrix <- t(contact_data$matrix)
demography_vector <- contact_data$demography$population

# Prepare some initial objects
uk_population <- population(
  name = "UK population",
  contact_matrix = contact_matrix,
  demography_vector = demography_vector,
  initial_conditions = matrix(
    c(0.9999, 0.0001, 0, 0, 0),
    nrow = nrow(contact_matrix), ncol = 5L,
    byrow = TRUE
  )
)

data = model_default_cpp(
  uk_population,
  transmissibility = rnorm(1000, 1.3/7, 0.01)
)

data
#>            population transmissibility infectiousness_rate recovery_rate
#>                <list>            <num>               <num>         <num>
#>    1: <population[4]>        0.1992952                 0.5     0.1428571
#>    2: <population[4]>        0.1688840                 0.5     0.1428571
#>    3: <population[4]>        0.1958680                 0.5     0.1428571
#>    4: <population[4]>        0.1798174                 0.5     0.1428571
#>    5: <population[4]>        0.1896007                 0.5     0.1428571
#>   ---                                                                   
#>  996: <population[4]>        0.1779147                 0.5     0.1428571
#>  997: <population[4]>        0.1905737                 0.5     0.1428571
#>  998: <population[4]>        0.1799411                 0.5     0.1428571
#>  999: <population[4]>        0.1824868                 0.5     0.1428571
#> 1000: <population[4]>        0.1764790                 0.5     0.1428571
#>       intervention vaccination time_dependence time_end increment
#>             <list>      <list>          <list>    <num>     <num>
#>    1:                                               100         1
#>    2:                                               100         1
#>    3:                                               100         1
#>    4:                                               100         1
#>    5:                                               100         1
#>   ---                                                            
#>  996:                                               100         1
#>  997:                                               100         1
#>  998:                                               100         1
#>  999:                                               100         1
#> 1000:                                               100         1
#>                       data run_id
#>                     <list>  <int>
#>    1: <data.frame[1515x4]>      1
#>    2: <data.frame[1515x4]>      2
#>    3: <data.frame[1515x4]>      3
#>    4: <data.frame[1515x4]>      4
#>    5: <data.frame[1515x4]>      5
#>   ---                            
#>  996: <data.frame[1515x4]>    996
#>  997: <data.frame[1515x4]>    997
#>  998: <data.frame[1515x4]>    998
#>  999: <data.frame[1515x4]>    999
#> 1000: <data.frame[1515x4]>   1000

# benchmarking
microbenchmark(
  model_default_cpp(
    uk_population,
    transmissibility = rnorm(1000, 1.3/7, 0.01)
  ),
  times = 10
)
#> Warning in microbenchmark(model_default_cpp(uk_population, transmissibility =
#> rnorm(1000, : less accurate nanosecond times to avoid potential integer
#> overflows
#> Unit: seconds
#>                                                                                expr
#>  model_default_cpp(uk_population, transmissibility = rnorm(1000,      1.3/7, 0.01))
#>       min       lq     mean  median       uq      max neval
#>  1.781095 1.816511 1.839991 1.84262 1.865887 1.878012    10

Created on 2024-02-02 with reprex v2.0.2

@pratikunterwegs
Copy link
Collaborator

pratikunterwegs commented Feb 5, 2024

Adding some thoughts on this API for @TimTaylor.

Re: a safe/unsafe implementation; composable elements are cross-checked against each other. E.g. interventions and vaccinations must have the same number of coefficients as demography groups in the population. Allowing a list of (lists of) interventions requires cross checking for each element using check_model_args_*(), for a function that fails informatively.

This could be avoided if each model function call allowed only a single list of interventions (on contacts and parameters; i.e., current functionality), as this would reduce input cross-checking to 1 instance. This still allows parameter uncertainty, as these are not needed for the cross-checks on composable elements. The user would have to run the model for each intervention scenario, but it would make for a faster run time.

@chartgerink
Copy link
Member

I'm still getting familiar with the package specifics, so happy to follow this discussion but unable to provide meaningful insights on the detailed elements.

For what it's worth, I would recommend to do as little seed handling within the functions as possible. If at all possible, it would be helpful to rely on the seed setting outside of the function, so as a given. I do not know to what degree it is possible in this specific use case, of course 😊

If you end up including seed setting, please make sure to set the seeds in a non-global manner (as is best practice). You can do this with withr::with_seed().

@pratikunterwegs
Copy link
Collaborator

For what it's worth, I would recommend to do as little seed handling within the functions as possible. If at all possible, it would be helpful to rely on the seed setting outside of the function, so as a given. I do not know to what degree it is possible in this specific use case, of course

I agree with this. Setting a seed in R is pretty easy so I think this should be part of training and not built into the tool. Still, it's a balance between building a robust package and a tool whose convenience promotes uptake, I've found.

@TimTaylor
Copy link
Contributor Author

@pratikunterwegs / @chartgerink - I won't really have too much time to look at this this week (I've not yet worked through the reprex you showed). But to add some more context / commentary to your comments:

This could be avoided if each model function call allowed only a single list of interventions (on contacts and parameters; i.e., current functionality), as this would reduce input cross-checking to 1 instance. This still allows parameter uncertainty, as these are not needed for the cross-checks on composable elements.

Would allowing parameter uncertainty and different interventions but restricting to a single population matrix be a solution? I think allowing multiple lists of interventions is likely to be useful within the call as managing the random number streams will be awkward otherwise for users. If we went the single list of intervention route then we would need to return a list of n multiple .Random.seeds and likely (if we are being careful) the random number generator used. This would then need passing in to the next call the user performs. Arguably this may need to be returned anyway to allow for multiple calls but this is one area where I do think we can help users with it being too ott.

For what it's worth, I would recommend to do as little seed handling within the functions as possible. If at all possible, it would be helpful to rely on the seed setting outside of the function, so as a given. I do not know to what degree it is possible in this specific use case, of course 😊

At a high level users will still be able to set.seed() at the start of their scripts. The issue here is that managing streams across multiple stochastic realisations with multiple interventions and parameters is much more fiddly. If it's just one repetition at a time with just one list of interventions then yes it wouldn't be too onerous on the user (they'd still need to manage seeds between repetitions and have a good way of doing so). Basically how much we need to manage the seeds is very linked to whether we go with this sort of API or not.

If you end up including seed setting, please make sure to set the seeds in a non-global manner (as is best practice). You can do this with withr::with_seed().

We wouldn't be setting the initial seed - only ensuring the seeds align between replications so AFAICT nothing would need restoring. I'd be worried if a stochastic function didn't change the seed 😅

I agree with this. Setting a seed in R is pretty easy so I think this should be part of training and not built into the tool.

Generally yes, but here there is likely a lot of book keeping for people to do (and possibly get wrong) involving both getting and setting seeds across interventions and parameters.

Repeating sentence from earlier as very relevant here:

how much we need to manage the seeds is very linked to whether we go with this sort of API or not.

@pratikunterwegs
Copy link
Collaborator

Thanks @TimTaylor -w e (@rozeggo, @bahadzie and I) just met again, and I'll be working through the various sub-issues here bit by bit. So rather than one giant PR with a fully thought-through implementation, I'll make several smaller PRs where it would be great if you could take a look once you have time.

Would allowing parameter uncertainty and different interventions but restricting to a single population matrix be a solution?

No - each intervention would have to be checked against the population matrix (or the vector of demography group size), as this determines whether they are compatible. Consider a single <population> with multiple intervention lists - even one with a poorly specified intervention (e.g. reducing contacts of a non-existent group) would cause an error - hence cross checking.

If we went the single list of intervention route then we would need to return a list of n multiple .Random.seeds and likely (if we are being careful) the random number generator used. This would then need passing in to the next call the user performs. Arguably this may need to be returned anyway to allow for multiple calls but this is one area where I do think we can help users with it being too ott.

This is still quite ott imo. The simpler way to start imo is to allow passing a vector of parameters for parameter uncertainty, and having users reuse that vector for alternative scenarios.

At a high level users will still be able to set.seed() at the start of their scripts. The issue here is that managing streams across multiple stochastic realisations with multiple interventions and parameters is much more fiddly.

The issue of combined parameter uncertainty and stochastic uncertainty is one I'll set aside for now while focusing on parameter uncertainty, but definitely to be taken up in the coming weeks.

Generally yes, but here there is likely a lot of book keeping for people to do (and possibly get wrong) involving both getting and setting seeds across interventions and parameters.

This strikes me as a nice-to-have, hand-holding feature, rather than critically necessary for the tool. I don't mind working towards this but perhaps something to keep to one side for now.

how much we need to manage the seeds is very linked to whether we go with this sort of API or not.

We have decided to go with this API - but seed management can be left to one side until we get to a doing-comparisons stage. That's still a while away. I don't personally see the benefits of storing or returning seeds just yet, but perhaps they will become clearer as this effort progresses.

@TimTaylor
Copy link
Contributor Author

Thanks @TimTaylor -w e (@rozeggo, @bahadzie and I) just met again, and I'll be working through the various sub-issues here bit by bit. So rather than one giant PR with a fully thought-through implementation, I'll make several smaller PRs where it would be great if you could take a look once you have time.

Yeah that's cool. Think we are getting confused at the mo.

No - each intervention would have to be checked against the population matrix (or the vector of demography group size), as this determines whether they are compatible. Consider a single with multiple intervention lists - even one with a poorly specified intervention (e.g. reducing contacts of a non-existent group) would cause an error - hence cross checking.

Yeah I don't see this as a problem. The main saving is coming from not having to check e.g. 3000 different parameters just 3. Think it's fine to check multiple intervention lists against the population matrix as I'm doubting there is much overhead there.

That's still a while away. I don't personally see the benefits of storing or returning seeds just yet, but perhaps they will become clearer as this effort progresses.

The issue manifests if you allow users to do multiple repetitions for a set of parameters and wanted to compare interventions across multiple calls (not a problem if done within one call). Whilst a user could match the seed for the first repetition they couldn't for subsequent repetitions as the intervention will effect the random number stream

@Bisaloo
Copy link
Member

Bisaloo commented Feb 5, 2024

Thanks for sharing the previous discussion and the interesting points raised here.

Generally, I'm in favour of having a dual API which allows us to solve the simplicity - flexibility trade-off by placing two points on the scale: a access to the full flexilibity via a low-level API and a simplified, less flexible high-level API.

The way I had imagined it however is that the high-level API / the vectorised function would be part of scenarios. This may seem like a minor point but I believe it would be helpful to ensure we follow good design patterns and limit coupling between both APIs. On this note, the current proposal doubles the size of the namespace, which may not be ideal. An alternative approach would be to have a single generic function for the high level API (e.g., run_scenarios(model, ...)). I understand: 1. this means going back to the initial idea of epidemics, which had a generic epidemics() function; 2. run_scenarios() is not much more than a glorified replicate() but I think this is inevitable and if we are able to add some hand-holding in terms of input checking, output formatting, seed management, etc., it may still be worth it.

On a related note, I think we're probably in agreement about this but it's worth explicitly mentioning it: we should follow the vctrs strategy on recycling to avoid unexpected behaviour.

In terms of class, I agree a nested data.frame is probably the best choice, rather than a list. Even though it's more complex than a non-nested data.frame, it sill has a similar feel for users, and tidyverse tools are directly usable with it.

The return type is a nested <data.table>

While I appreciate the need for data.table internally, it is possible to convert it to a data.frame before returning so non-data.table users don't get confused?

I'm also quite uncomfortable with the seed management as I suspect it contains a couple of issues and edge cases. Such as if we want to allow parallel processing in the future. But I'll need to think about it more and it seems anyways that we don't really have a choice.

@TimTaylor
Copy link
Contributor Author

TimTaylor commented Feb 5, 2024

I strongly suggest that there is another meeting/discussion to agree the API to work towards before any further development is done as think there's a risk of going in circles otherwise.

@pratikunterwegs
Copy link
Collaborator

Thanks @Bisaloo, just a few clarifications below.

On this note, the current proposal doubles the size of the namespace, which may not be ideal.

I agree that the internal function should not be exported; otherwise, the two level structure already exists for Rcpp model functions, which are called by R-only exported functions. Additionally, I'm opening an issue to propose removal of the R-only versions of ODE model code in favour of the C++ implementation, which would reduce the namespace and the codebase. This is because our new understanding of the use case (1000s of runs per modelling script/task) suggests that a slower implementation is no longer useful, adds to maintenance, and is too deeply buried to be a good teaching tool.

On a related note, I think we're probably in agreement about this but it's worth explicitly mentioning it: we should follow the vctrs strategy on recycling to avoid unexpected behaviour.

I agree with this, although users might be hoping for base R style recyling, and this might need to be addressed - something for later.

The return type is a nested <data.table>

While I appreciate the need for data.table internally, it is possible to convert it to a data.frame before returning so non-data.table users don't get confused?

Agreed - all {epidemics} outputs are data.frames; this is only for an initial prototype on the branch.

@TimTaylor - happy to have another meeting, but I think this would be more productive if we were discussing a prototype rather than a hypothetical. I'll get an initial version of parameter uncertainty in the default model up and running this week, and would be happy to meet to discuss that. Will put this in the Slack now.

@Bisaloo
Copy link
Member

Bisaloo commented Feb 7, 2024

An alternative approach would be to have a single generic function for the high level API (e.g., run_scenarios(model, ...)). I understand: 1. this means going back to the initial idea of epidemics, which had a generic epidemics() function; 2. run_scenarios() is not much more than a glorified replicate() but I think this is inevitable and if we are able to add some hand-holding in terms of input checking, output formatting, seed management, etc., it may still be worth it.

For future reference, I am now convinced the approach proposed in this issue first message is the best path forward. In particular, @TimTaylor mentioned the following benefits:

  • Allow vector input to reduce the number of input checks required for repeated iterations and/or scenario comparisons.
  • Opportunity to move the looping in to C++ (only if ever deemed necessary)

Also for future reference, I should clarify that the main benefit of this implementation IMO is not really performance or sparing users from writing their own loops (this level of R proficiency will likely be necessary to use any of our packages anyways), but it allows us to have a stable and predictable output format (or class) for scenario modelling, which facilitates integration with downstream packages.

I have mentioned reservations about how parallel computing would work in this framework but we agreed:

  • we can cross that bridge when we get there. Due to the very high performance of existing epidemics code, we may never had a need for parallel processing
  • if we preserve the low-level interface, users that require flexibility in how the loop is implemented (sequential vs parallel, specific framework, etc.) can still write their own loop.

@TimTaylor
Copy link
Contributor Author

it allows us to have a stable and predictable output format (or class) for scenario modelling, which facilitates integration with downstream packages.

The structure of the classed output probably warrants an additional issue raising. Points to consider:

  • stochastic versus deterministic model output.
  • Whether we want to capture warnings and errors (could require additional columns to store the information)

@pratikunterwegs
Copy link
Collaborator

Thanks both, I'm opening some related issues now, which will be collected under this project.

The structure of the classed output is already open for discussion under #156 if you have any thoughts.

@pratikunterwegs
Copy link
Collaborator

pratikunterwegs commented Feb 7, 2024

I have mentioned reservations about how parallel computing would work in this framework but we agreed:

  • we can cross that bridge when we get there. Due to the very high performance of existing epidemics code, we may never had a need for parallel processing.

The code that this issue really affects is the stochastic Ebola model Ebola, which is also the slowest one as it is written in pure R. A good way forward here could be to refactor model_ebola_r() to C++. I am especially looking for a good way to implement prob_discrete_erlang() in C++ without including more libraries. I would appreciate if there's a mathematical hack for the Erlang distribution that allows replacing with the Gamma distribution - any thoughts @adamkucharski?

Edit: an alternative is of course to calculate the output of prob_discrete_erlang() once for each parameter set in R, and pass it to an internal function that is in C++ - this is straightforward to implement.

@TimTaylor
Copy link
Contributor Author

TimTaylor commented Feb 7, 2024

@pratikunterwegs - I've submitted a small PR (#171) that may improve prob_discrete_erlang() performance. I'm Unclear why an external library is needed to implement it in C++ though? Am I missing something obvious?

@pratikunterwegs
Copy link
Collaborator

pratikunterwegs commented Feb 7, 2024

Thanks @TimTaylor, will take a look and merge soon. My initial thought was to check whether I could find this functionality in GSL through RcppGSL; my next thought was to replace the Erlang with the Gamma distribution. But the simpler solution is of course to peel off these initial parts from the later iteration step.

One other speed bottleneck is usage of stats::multinom() - if there were a faster option that would be a boost too, as this is called $2T*N_\text{replicates}$ times for any $T$ model timesteps. I'll try to profile this code in the coming week for better understanding.

@TimTaylor
Copy link
Contributor Author

TimTaylor commented Feb 7, 2024

If prob_discrete_erlang() remains a bottle neck we can probably knock up a C version (but agree it would be nice to get from elsewhere). factorial() calls gamma() under the hood and this is exposed in Rmath.h AFAIK so should be relatively straightforward to port over (if necessary etc etc).

@pratikunterwegs
Copy link
Collaborator

I did actually have a C++ version which iirc was not much faster. Was removed in #140. Perhaps speeding up stats::multinomial() could help more - will take a look once the vectorisation project stabilises.

@pratikunterwegs pratikunterwegs added Discussion Issues kept open for discussions in the comments C++ Related to C++ code labels Feb 16, 2024
@pratikunterwegs
Copy link
Collaborator

I'm closing this issue as many of the discussion items have been implemented in #176 and upcoming #211, but will pin it for future reference as there are some issues that do require further consideration (epidemics class, seed management, parallelisation, among others).

@pratikunterwegs pratikunterwegs removed the C++ Related to C++ code label Apr 23, 2024
@pratikunterwegs pratikunterwegs pinned this issue Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issues kept open for discussions in the comments
Development

No branches or pull requests

4 participants