Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing Stan dependency versions to package users #400

Open
zsusswein opened this issue Nov 24, 2023 · 13 comments
Open

Exposing Stan dependency versions to package users #400

zsusswein opened this issue Nov 24, 2023 · 13 comments
Labels
help wanted Extra attention is needed infrastructure

Comments

@zsusswein
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I use {epinowcast} in a containerized pipeline where we try to maximize reproducibility. When building new versions of the image, we build {epinowcast} from source from this repository pinned to a specific commit. But it can get a bit messy with layer caching and moving versions to get the latest features.

We try to manually keep track of the associated versions of the Stan dependencies and install the correct versions into the container, but it would be a helpful feature to expose those versions programatically.

Describe the solution you'd like
I'm not sure what the most idiomatic solution is, but ideally we'd be able to call a simple function or variable to get the needed information. Something like epinowcast::get_cmdstan_version().

Describe alternatives you've considered

Even a simple message to the user that printed on package load would be a nice quality-of-life improvement.

@seabbs
Copy link
Collaborator

seabbs commented Nov 24, 2023

This is a good idea and would also help us in development I think. The only thing we want to be careful of is to not make this overly restrictive as stan is pretty stable so its likely our model won't actually depend on a specific version and we don't want people to install new versions for no reason.

I'm not sure we want to supply a function to do this. We could encode the stan dependency into the DESCRIPTION but that might be overly proscriptive. We could also use the new #386 to add the version of cmdstan to main we are use the package work with and then use that in a function to serve the latest recent version?

That would be good as it would be automated but it would be bad as it would always be the least version of cmdstan and we wouldn't know how far back our model would support.

@seabbs seabbs added help wanted Extra attention is needed infrastructure labels Nov 24, 2023
@pearsonca
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

I use {epinowcast} in a containerized pipeline where we try to maximize reproducibility. When building new versions of the image, we build {epinowcast} from source from this repository pinned to a specific commit. But it can get a bit messy with layer caching and moving versions to get the latest features.

Prying a bit: how are you defining "reproducibility"? In the scientific sense (i.e. prioritizing ability to specify these precise conditions => get these precise results) or in the engineering sense (i.e. on a budget, reliably churning out parts within quality tolerance standards + minimal disruptions to production line with maintenance/upgrades)?

In my opinion, these are antagonistic, so the answer can't be "both". So: what we might actually do to solve the problem depends on which perspective you're arguing to prioritize.

@zsusswein
Copy link
Contributor Author

zsusswein commented Nov 27, 2023

Prying a bit: how are you defining "reproducibility"?

This is a good question and not something I've thought much about before. Perhaps $p < 0.05$?1

In my opinion, these are antagonistic, so the answer can't be "both".

Thinking out loud: seeing the engineering and scientific perspectives as orthogonal makes sense to me, but I'd be interested to hear what you see as actively antagonistic. I'm not sure I'm totally following?

In practice, I see my goals here as closer to your engineering perspective. The priorities are to (1) deliver "correct"2 results (2) on time each week (3) within a reasonable budget of time and computational resources.

Aspirationally, the results are both deterministic for a particular model x dataset x random seed combination and also in some sense "true" about the world. I think containerization helps with the former, but not at all with the latter.

Does that help at all?

Footnotes

  1. Kidding! We're all Bayesians here.

  2. "Correct" is a total weasel word here and you'd be right to call me on it.

@pearsonca
Copy link
Collaborator

Re "antagonistic" - getting runs to produce the precise same results, alongside a fully interpretable audit trail, is going to cost (money, time, sanity, etc) that could be better spent on "useful" features (the phrase I would use instead of "correct", though also a bit vague) e.g. visualizations of results or open-source contributions to improve package diagnostics/messages/etc (hint-hint).

Making results precisely match is useful for algorithmic diagnosis - that can be good for figuring out what's wrong with your machinery, but also...that's not the point of the machine, right?

Aside: if results are sensitive to random number seed (or more broadly, generation), then they aren't results. Which is to say: the specific quantities whatever pipeline you build spits out aren't the result - "do X over Y" is the result.

@pearsonca
Copy link
Collaborator

pearsonca commented Nov 29, 2023

@zsusswein calling your attention to this discussion as well: https://community.epinowcast.org/t/epinowcast-command-line-interface/207/3, as that seems pertinent to what you're doing. If you could provide example use cases / user stories / etc, the potential outcome of that work would likely be more useful to you et al.

(not that it solves the how-do-i-record-environment-state-so-i-can-later-diagnosis-problems, but it might place it in a different, more manageable box).

@seabbs
Copy link
Collaborator

seabbs commented Nov 29, 2023

This is a nice conversation but perhaps best suited for the community forum or a epinowcast discussion item. Getting back to the question at hand. @zsusswein do you have a suggested implementation that would meet your use case and do you think your use case has a wider scope than you/your team (and can you give evidence to this effect).

@zsusswein
Copy link
Contributor Author

do you have a suggested implementation that would meet your use case

I think the simplest workable solution from my perspective is to expose the versions used in CI for {epinowcast}? I don't mind updating frequently and I would think the type of person who cares about this would feel similarly.

I'm not sure what the right approach is from a package/CRAN perspective. It feels idiomatic to set an option or environment flag, but I'm not sure that's necessarily best practice?

do you think your use case has a wider scope than you/your team (and can you give evidence to this effect).

I like to think it would, but I don't know and have no evidence.

@adrian-lison
Copy link
Collaborator

Just as a side note related to this, I think we currently need stan 2.33 (the most recent version) as a minimum to be able to compile the models. If people have an older version of cmdstan installed, they will get an error complaining about the model syntax etc. This is likely confusing, so adding functionality to inform the user about the minimum version required would be valuable.

@adrian-lison
Copy link
Collaborator

To always be aware of our minimum version, we could also create a tweak of #386 to always run with a specified minimum version (e.g. starting with 2.33 for now). When this one fails but the #386 github action with the most recent stan version does not, we know it's because our code includes some new features and we need to update the minimum version.

@seabbs
Copy link
Collaborator

seabbs commented Jan 9, 2024

@adrian-lison do you think we should aim to support a range of versions or should we just try and say which version for certain works?

@seabbs
Copy link
Collaborator

seabbs commented Jun 13, 2024

@zsusswein unfortunatly on one has had capacity to get to this. If this is still useful for your team would you have time to take a crack at it with some support.

@zsusswein
Copy link
Contributor Author

Sure! Happy to give it a go. A few questions:

  1. Do you want to capture the cmdstan version present on the system at package build, the version in DESCRIPTION.SystemRequirements, or vend it yourself (in R/data.R or R/sysdata.rda perhaps)?
    • Present at build time is simplest but means that it won't work if a user installs from source. Probably a dealbreaker.
    • Relying on SystemRequirements feels most idiomatic but it's unclear to me as a non-dev on this project whether that's a reliable source of information
    • Vending yourself is easy and reliable (track Issue 385: GitHub Action to check compilation and correctness of cmdstan syntax. #386 maybe?), but perhaps confusing if it conflicts with SystemRequirements
  2. How do you want to provide the information? It might be nice to echo {cmdstanr}'s syntax, where they have an internal function latest_released_version() that tracks the cmdstan version the user could update to and then suggest an update on load. I'm not sure a suggested update is needed here, but perhaps a similar latest_supported_cmdstan_version() function? You could group it with Utilities in the roxygen2 function reference?
  3. What do you want the latest supported version to actually be? My (source-free) understanding of "supported" is something along the lines of "we will fix bugs coming from this version but if you're on an earlier version please upgrade first to see if the bug goes away." If that very rough definition works here, then I'd suggest you track Issue 385: GitHub Action to check compilation and correctness of cmdstan syntax. #386 and stick to the version you test under (which is latest?).

@seabbs
Copy link
Collaborator

seabbs commented Jun 13, 2024

It does seem like SystemRequirements would make sense though I think we want to make this a soft requirement (i.e know this works but other versions may also work) so based on that vending it ourselves based on the version used with a specific release makes sense.

How do you want to provide the information? It might be nice to echo {cmdstanr}'s syntax, where they have an internal function

I think this might be a bit heavy in the first instance. We could give an .onload message telling users a version or write it as a header in the stan files?

What do you want the latest supported version to actually be?

For now I am thinking of it less as supported and more built with. In general we will always be "supporting" the latest cmdstan release for any bug fixes etc.

and stick to the version you test under (which is latest?).

yes it is and agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed infrastructure
Projects
Status: No status
Development

No branches or pull requests

4 participants