Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create a long-term reproducible environment YAML file? #787

Closed
fgp opened this issue May 17, 2019 · 19 comments
Closed

How to create a long-term reproducible environment YAML file? #787

fgp opened this issue May 17, 2019 · 19 comments

Comments

@fgp
Copy link

fgp commented May 17, 2019

I'm shipping a data-analysis pipeline that we developed (https://github.com/Cibiv/ipoolseq-pipeline) with an environment definition YAML file (ipoolseq.yaml), and would like that file to produce the same conda environment now, in a month, and in a year. However, every ipoolseq.yaml that I create breaks after a few months. Usually because someone removes the "main" label from some of the package versions referenced by my file, causing "conda env create -f ..." to fail.

I also tried using "explicit" package lists, with contain full package URLs, but those seem to be even less stable than the YAML files listing package versions. So, given that one of the purposes of conda is precisely my use-case, namely to provide reproducible software environments, how do I get it to actually DO that?

@fgp fgp changed the title How to create an long-term reproducible environment YAML file? How to create a long-term reproducible environment YAML file? May 17, 2019
@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

So, given that one of the purposes of conda is precisely my use-case,

No tool can promise exact and complete reproducibility over time b/c the packages must be moved around for many reasons: keep pinning consistent, actual fixes in the packaging, new versions that are incompatible with the stack, storage, etc.

However, we do try our best to never remove a package and it should be always available for download via the full URL but, as you already know, that can be unstable.

The only way to get what you want isto freeze the environment with a docker image or conda-pack. The enviroment.yaml can only promise "some short term" reproducibility. It is meant to evolve the package listing over time.

@ocefpaf ocefpaf closed this as completed May 17, 2019
@fgp
Copy link
Author

fgp commented May 17, 2019

OK, this is an extremely disappointing answer. At least for Bioconda, exact reproducibility is an important selling point. Hearing that conda-forge is not actually interested in such a use-case honestly is very surprising to say the least!

And of course conda could promise exact and complete reproducibility, claiming that such a thing is impossible is ludicrous! It's simply due to some subtle design issues that it doesn't do so today!

There is absolutely no reason why e.g. Anaconda cloud couldn't provide additional stable URLs for packages, and why e.g. conda export --explicit couldn't list those stable URLs instead of whatever URL was originally used to fetch the package. In none of the cases where I had issues with reproducibility, the packages were actually removed, so storage space seems to be a non-issue.

Or conda env create could simply search across all labels, which would probably fix 90% of the issues.

Or there are probably tons of other solutions.

@isuruf
Copy link
Member

isuruf commented May 17, 2019

I guess we could provide all packages in a separate channel. @scopatz, @mariusvniekerk, can this be done with metachannel? We also want to avoid repodata patching for older packages to resolve correctly.

@isuruf isuruf reopened this May 17, 2019
@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

Hearing that conda-forge is not actually interested in such a use-case honestly is very surprising to say the least!

No one said conda-forge is not interested in it. It is just not possible to provide a stable environment by keeping all the packages ever built on main. Besides, like I said, exact reproducibility will always be impossible unless you freeze your env when you create it.

There is absolutely no reason why e.g. Anaconda cloud couldn't provide additional stable URLs for packages, and why e.g. conda export --explicit couldn't list those stable URLs instead of whatever URL was originally used to fetch the package. In none of the cases where I had issues with reproducibility, the packages were actually removed, so storage space seems to be a non-issue.

That exists already. Some changes happened during the move to CDN based mirrors but you can access the exact URL. Still, that will never be guaranteed to work over the years b/c services changes, technologies changes, etc. Again, long term reproducibibilty is only obtained by freezing the environment.

I guess we could provide all packages in a separate channel. @scopatz, @mariusvniekerk, can this be done with metachannel?

Why? Exact URLs are available and probably better than a channel where an env solution will be really slow and/or impossible.

We also want to avoid repodata patching for older packages to resolve correctly.

+100 on that!

@fgp
Copy link
Author

fgp commented May 17, 2019

Hearing that conda-forge is not actually interested in such a use-case honestly is very surprising to say the least!

No one said conda-forge is not interested in it. It is just not possible to provide a stable environment by keeping all the packages ever built on main. Besides, like I said, exact reproducibility will always be impossible unless you freeze your env when you create it.

Let me rephrase that. Currently, the half-life of my environment definitions is something like 3 months. If it were 3 years, I'd already be completely happy. Linux distributions manage that easily, why shouldn't conda?

There is absolutely no reason why e.g. Anaconda cloud couldn't provide additional stable URLs for packages, and why e.g. conda export --explicit couldn't list those stable URLs instead of whatever URL was originally used to fetch the package. In none of the cases where I had issues with reproducibility, the packages were actually removed, so storage space seems to be a non-issue.

That exists already. Some changes happened during the move to CDN based mirrors but you can access the exact URL. Still, that will never be guaranteed to work over the years b/c services changes, technologies changes, etc. Again, long term reproducibibilty is only obtained by freezing the environment.

If it exists, how do I use it? I agree 100% that the most sensible way to reproduce an environment would be to re-download the exact packages, and re-install them without even checking dependencies. But I never managed to create an URL list from an environment that is sufficiently table. As I said, usually some package lable would change (usually bet set to 'broken', even though the package works perfectly in my env), which changes the URL.

@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

Let me rephrase that. Currently, the half-life of my environment definitions is something like 3 months. If it were 3 years, I'd already be completely happy. Linux distributions manage that easily, why shouldn't conda?

First let us disambiguate a few things: conda is the package manager that gives you access to the channels defaults (curated by AnacondaInc) and conda-forge (curated by the community) and many others. Some Linux distributions aim for long term stability at the sacrifice of latest packages and fixes, that is why Linux distributions have third party repositores/unstable repos, so people can have latest packages that are not available by default. BTW, conda was created partially to solve that problem too, otherwise we would all be using our systems package managers. With that said, defaults will be a more "conservative" alternative then conda-forge but both will move faster than Linux distributions.

If it exists, how do I use it?

I never used that myself b/c I prefer, and recommend, the "freeze your env" option. @msarahan can problem comment on ways to use the exact URLs.

I agree 100% that the most sensible way to reproduce an environment would be to re-download the exact packages, and re-install them without even checking dependencies.

What I meant by freezing your env is not to re-download the packages based on a list. I mean a docker image or conda-pack which will save the packages at creation time and freeze them, protecting you from repodata patches, services changes, packages being moving around, etc.

@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

@isuruf I'm closing this b/c we (conda-forge and defaults) are moving to epochs releases that will remove old packages from main as we move along. This was discussed at length in our meetings and it was based on offering a stable environment and fast solution vs. a bloated main.

Also, the packages are never removed, the exact URL option exists (just need docs on how to use it but that is a question for AnacondaInc and not conda-forge), and there are better solutions for long term environment stability.

I guess that the TL;DR is that there is no action for us to do here. (Maybe add a reference to docs on how to use the exact reference in our docs?)

@ocefpaf ocefpaf closed this as completed May 17, 2019
@fgp
Copy link
Author

fgp commented May 17, 2019

Also, the packages are never removed, the exact URL option exists (just need docs on how to use it but that is a question for AnacondaInc and not conda-forge), and there are better solutions for long term environment stability.

Sorry but this conversation is getting a bit frustrating.

First, I actually tried using conda export --explicit before even opening this issue (months ago, actually), and the URLs I got were never stable - they always depended on the package's label, which as I discovered WILL change, its only a matter of time. You keep mentioning stable URLs - so could you please point me to documentation on how to ACTUALLY generate a list of stable URLs for a set of packages? Or to source code, I don't care, just SOMEWHERE...

Second, I believe I'm raising a genuine concern here. Maybe one that conda-forge doesn't want to adress or cannot address. But until there is consensus here that this is the case, it's not particularly helpful for you to keep closing this issue.

@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

Sorry but this conversation is getting a bit frustrating.

It is always frustrating when technology limitations prevent us from having what we want. You are missing the point though: having all packages in a single channel lead to unsolvable envs and/or really slow solve times.

You keep mentioning stable URLs - so could you please point me to documentation on how to ACTUALLY generate a list of stable URLs for a set of packages? Or to source code, I don't care, just SOMEWHERE...

Like I said above @msarahan probably knows more but I do recommend you to open and issue or ask that upstream with AnacondaInc as that is a anaconda.org question.

Second, I believe I'm raising a genuine concern here.

Indeed. We are limited to solutions like freeze the environment at creation time.

Maybe one that conda-forge doesn't want to adress or cannot address.

Again, not true. I explain at length our efforts in keeping the artifacts, never removing them and avoiding repodata patch.

But until there is consensus here that this is the case, it's not particularly helpful for you to keep closing this issue.

There are no actions we can take, that is why I'm closing the issue. What do you want us to do? Move all packages to main and grind conda-forge to halt? Breaking it for everyone? Creating a new label with everything is possible but won't solve your problem b/c it will be unsolvable! Please try to understand the technical details here instead of trying to force your will into something that cannot be solved.

PS: check the conda docs for the exact URL usage, if there is none open an issue with AnacondaInc. That is the path forward.

@fgp
Copy link
Author

fgp commented May 17, 2019

Sorry but this conversation is getting a bit frustrating.

It is always frustrating when technology limitations prevent us from having what we want.

Please don't be patronizing, thanks.

You are missing the point though: having all packages in a single channel lead to unsolvable envs and/or really slow solve times.

I never suggested having all packages in a single channel.

You keep mentioning stable URLs - so could you please point me to documentation on how to ACTUALLY generate a list of stable URLs for a set of packages? Or to source code, I don't care, just SOMEWHERE...

Like I said above @msarahan probably knows more but I do recommend you to open and issue or ask that upstream with AnacondaInc as that is a anaconda.org question.

I actually did. I never heard back. Point taken, though - I' on my own here.

Second, I believe I'm raising a genuine concern here.

Indeed. We are limited to solutions like freeze the environment at creation time.

With freezing you mean "conda-pack", right? I'll probably do that in the long run (once I figure out where to host the env), for lack of other options. Ironically though, the conda-pack homepage has the following to say on the issue (under Use Cases):

"Archiving an environment in a functioning state. Note that a more sustainable way to do this is to specify your environment as a environment.yml, and recreate the environment when needed."

But until there is consensus here that this is the case, it's not particularly helpful for you to keep closing this issue.

There are no actions we can take, that is why I'm closing the issue. What do you want us to do? Move all packages to main and grind conda-forge to halt? Breaking it for everyone? Creating a new label with everything is possible but won't solve your problem b/c it will be unsolvable! Please try to understand the technical details here instead of trying to force your will into something that cannot be solved.

No, what I'd actually like is for conda export --explicit to work the way it should, i.e. produce URLs that work until packages are deleted for good due to space constraints. That might be an Anaconda issue, not a conda-forge issue, but I believe that if conda-forge where to propose a solution that the Anaconda people would probably listen, whereas it seems my trying to convince them that their URLs are not stable enough fell on deaf ears.

PS: check the conda docs for the exact URL usage, if there is none open an issue with AnacondaInc. That is the path forward.

Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is), I wouldn't have raised the issue here in the first place. As I mentioned multiple times. Still thanks though, I guess. At least I know now that generating --explicitpackage lists was in principle that right thing to do, so the fact that it doesn't actually work is a bug somewhere. And I learned that here is not the right place to have any sort of meaningful discussion about this issue, so I'l refrain from bringing it up again.

@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

Please don't be patronizing, thanks.

I'm not. I am as frustrate as you b/c reproducilbility is a big concern of mine. I believe your frustration is preventing you from understanding the technical issues I'm describing.

I never suggested having all packages in a single channel.

That is the only alternative to the exact URLs and that was what @isuruf suggested. Please read all the messages.

I actually did. I never heard back. Point taken, though - I' on my own here.

It may take a while for people to respond. Please be patient. You are not on your own. I just asked about the exact URL usage in may forums and I'm waiting on answers. Meanwhile I'm digging into the docs and looking into it.

No, what I'd actually like is for conda export --explicit to work the way it should, i.e. produce URLs that work until packages are deleted for good due to space constraints.

Again, feature request to conda, please ask AnacondaInc and not conda-forge.

Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is),

You wrote that you found but it seems that you did not? I'm confused. Can you cross-reference the issues you opened and post the links you found?

And I learned that here is not the right place to have any sort of meaningful discussion about this issue, so I'l refrain from bringing it up again.

Right. Not about conda feature requests. Anything about conda-forge is fair game though.

@ocefpaf
Copy link
Member

ocefpaf commented May 17, 2019

Also, please take a look at conda-pack. It does what you want and it is safer, faster, and more reliable.

@fgp
Copy link
Author

fgp commented May 19, 2019

No, what I'd actually like is for conda export --explicit to work the way it should, i.e. produce URLs that work until packages are deleted for good due to space constraints.

Again, feature request to conda, please ask AnacondaInc and not conda-forge.

I did that, as I already pointed out. I just checked, it was assigned case number Case 10717. Never received an answer though (other than "i forwarded your request to the appropriate team" or something like that), which is why I raised the issue here, believing that I might find people here willing to discuss options. Seems I picked the wrong forum, though. Sorry for that.

Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is),

You wrote that you found but it seems that you did not? I'm confused. Can you cross-reference the issues you opened and post the links you found?

Um, you truncated my sentence to the point of mangling the meaning. What I wrote was "Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is), I wouldn't have raised the issue here in the first place." In plain english: NO, I didn't find a solution. And because I DIDN'T, I raised the issue here.

@fgp
Copy link
Author

fgp commented May 19, 2019

Also, please take a look at conda-pack. It does what you want and it is safer, faster, and more reliable.

As I said, I will do that, for lack of a better (short-term) option. But I can't resist to once more quote conda-pack's own definition of its use-cases to you:

https://conda.github.io/conda-pack/ (Use Cases):

  • Archiving an environment in a functioning state. Note that a more sustainable way to do this is to specify your environment as a environment.yml, and recreate the environment when needed.

This standss in direct conflict to your claim that conda and conda-forge are only meant to, and will only be ever able to, provide what you call short-term reproducability.

@xylar
Copy link
Contributor

xylar commented May 19, 2019

@fgp, it's clear you've found the process get to reproducible packages quite frustrating. I can understand that. However, I think the tone you've taken in this discussion is really counterproductive. You're trying to enlist help from a community that could, indeed, put its weight behind what you're trying to accomplish. By responding in an angry and antagonistic way, you are more likely to drive those who could help you away than enlist their help.

I realize that you felt shut down by @ocefpaf initially closing the issue quickly as outside the scope of the community. He felt that this wasn't the right place to raise this issue but you were frustrated because you'd tried the right venue already and got no response. I think that meant the whole discussion got off on the wrong foot. I'm hoping it can return to a more civil tone and/or be moved to a different venue.

As stated several times above, that's probably conda, not conda-forge but it would be appropriate to point to this discussion and perhaps try to bring conda-forge team members into the discussion as they have the time and interest to engage.

@bgruening
Copy link
Contributor

@fgp the bio* community has written up this paper to solve the underlying problem in a different way. We do freeze long-term reproducible pipelines in containers. You can read more about this here https://www.biorxiv.org/content/10.1101/200683v2 and if you like check out biocontainers.

If you can create a PR similar to this one: https://github.com/BioContainers/multi-package-containers/pull/707/files
We create you a minimal container for Docker, rkt and Singularity.

@ocefpaf
Copy link
Member

ocefpaf commented May 19, 2019

@fgp the bio* community has written up this paper to solve the underlying problem in a different way. We do freeze long-term reproducible pipelines in containers. You can read more about this here https://www.biorxiv.org/content/10.1101/200683v2 and if you like check out biocontainers.

The bioconda containerization is the state of the art for this IMO. However, as I mentioned above, I'm OK with the conda-pack solution b/c it is more practical in the short term.

@fgp
Copy link
Author

fgp commented May 19, 2019

@fgp, it's clear you've found the process get to reproducible packages quite frustrating. I can understand that. However, I think the tone you've taken in this discussion is really counterproductive. You're trying to enlist help from a community that could, indeed, put its weight behind what you're trying to accomplish. By responding in an angry and antagonistic way, you are more likely to drive those who could help you away than enlist their help.

Let me say that it was never my intention to come across as antagonistic, and I'm truely sorry if I did! I am indeed frustrated by my experience with constantly breaking conda environments, but I do of course respect the work people do here and know that everybody generally tries and does their best. In fact, I believe my frustration is rooted in my enthusiasm for the conda universe, and in particular conda-forge and bioconda, projects to which I have contributed in the past, and which I consider to be the single best solution for making bioinformatical software available to a non-programmer audience. Seeing conda missing the mark by so little (from my perspective) towards the goal of solving all my software installation problems is thus painful for me.

I realize that you felt shut down by @ocefpaf initially closing the issue quickly as outside the scope of the community. He felt that this wasn't the right place to raise this issue but you were frustrated because you'd tried the right venue already and got no response. I think that meant the whole discussion got off on the wrong foot. I'm hoping it can return to a more civil tone and/or be moved to a different venue.

Re-reading the thread I agree with your reading of the situation. I'll try to raise this issue over at the conda project, although I'm not sure that is the right place either -- solving the problem of URL stability probably required cooperation from Anaconda as well, but we'll see.

Unfortunately, any further action on my part will have to wait now anyway -- it became clear to me that the only realistic short-term solution is either conda-pack or containerization (thank you @bgruening for the pointer! I wasn't aware of the paper, and I'm sure it'll be an interesting read!), so I'll have to work on implementing that for my pipeline.

@msarahan
Copy link
Member

@fgp

I did that, as I already pointed out. I just checked, it was assigned case number Case 10717. Never received an answer though (other than "i forwarded your request to the appropriate team" or something like that), which is why I raised the issue here, believing that I might find people here willing to discuss options. Seems I picked the wrong forum, though. Sorry for that.

It sounds like you tried to email support@anaconda.com. That's customer support, not support for Anaconda projects in general. I'm not surprised that you didn't get anywhere. This is a better place to ask questions: https://github.com/ContinuumIO/anaconda-issues but we're still very limited on time.

As @ocefpaf mentions, the technical reasons for the mess have a long and sordid history.

repo.continuum.io / repo.anaconda.com have always been served from an s3 bucket, which looks to the world like a static URL. URLS for these packages don't change, unless a package gets yanked. That happens very rarely - maybe once a year on average. Problems with metadata are what cause packages to be removed, but we usually "hotfix" the repodata instead of removing the package. There is no notion of labels on defaults.

Anything on anaconda.org is served by a much more complicated web app. That web app is backed by mongodb, where the packages are stored. Unfortunately, a single, static URL for every object was not designed into the system. I think PyPI among others have shown that it's a good idea, and I hope we can work it into future systems. The current anaconda.org is in maintenance mode, and no feature like this can be added right now.

Now, to make things more complicated: conda-forge and bioconda are mirrored to a CDN. That means that things could perhaps be more stable than the web app, but we're mirroring whatever these channels choose to do. If people move packages to other labels, there's no unique URL for a package. The huge advantage that the CDN gives conda-forge and bioconda is the ability to hotfix repodata. Now they can fix broken metadata, rather than just get rid of it by removing packages. However, this hotfixing/CDN is currently only available for the main label, not any other labels.

As for epochs: this is pretty close to working. Conda-build 3.18.0 introduced a new current_repodata.json file that is strictly the newest packages. This will allow people to keep packages in the same place, and have an ever-growing repodata.json that on its own would be horribly slow or broken for using in solves, while keeping long-term stability. As long as you used exact package specs, solves with the full repodata are still fast. There's more work to do to have epochs - time slices, but the idea is the same.

Conda 4.7 is the other part of this implementation, of course. It's on conda-canary. There's still quite a few kinks to work out, but it should be pretty good by the end of this week. The concept of restricted views on metadata (pioneered by conda-metachannel) are what will help us here. Since there has been no way to view only a subset until now, and since the main label needs to always be a self-consistent set, conda-forge has been stuck with re-labeling or removing files. Hopefully we can do better soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants