New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create a long-term reproducible environment YAML file? #787
Comments
No tool can promise exact and complete reproducibility over time b/c the packages must be moved around for many reasons: keep pinning consistent, actual fixes in the packaging, new versions that are incompatible with the stack, storage, etc. However, we do try our best to never remove a package and it should be always available for download via the full URL but, as you already know, that can be unstable. The only way to get what you want isto freeze the environment with a docker image or conda-pack. The |
|
OK, this is an extremely disappointing answer. At least for Bioconda, exact reproducibility is an important selling point. Hearing that conda-forge is not actually interested in such a use-case honestly is very surprising to say the least! And of course conda could promise exact and complete reproducibility, claiming that such a thing is impossible is ludicrous! It's simply due to some subtle design issues that it doesn't do so today! There is absolutely no reason why e.g. Anaconda cloud couldn't provide additional stable URLs for packages, and why e.g. Or Or there are probably tons of other solutions. |
|
I guess we could provide all packages in a separate channel. @scopatz, @mariusvniekerk, can this be done with metachannel? We also want to avoid repodata patching for older packages to resolve correctly. |
No one said conda-forge is not interested in it. It is just not possible to provide a stable environment by keeping all the packages ever built on
That exists already. Some changes happened during the move to CDN based mirrors but you can access the exact URL. Still, that will never be guaranteed to work over the years b/c services changes, technologies changes, etc. Again, long term reproducibibilty is only obtained by freezing the environment.
Why? Exact URLs are available and probably better than a channel where an env solution will be really slow and/or impossible.
+100 on that! |
Let me rephrase that. Currently, the half-life of my environment definitions is something like 3 months. If it were 3 years, I'd already be completely happy. Linux distributions manage that easily, why shouldn't conda?
If it exists, how do I use it? I agree 100% that the most sensible way to reproduce an environment would be to re-download the exact packages, and re-install them without even checking dependencies. But I never managed to create an URL list from an environment that is sufficiently table. As I said, usually some package lable would change (usually bet set to 'broken', even though the package works perfectly in my env), which changes the URL. |
First let us disambiguate a few things:
I never used that myself b/c I prefer, and recommend, the "freeze your env" option. @msarahan can problem comment on ways to use the exact URLs.
What I meant by freezing your env is not to re-download the packages based on a list. I mean a docker image or conda-pack which will save the packages at creation time and freeze them, protecting you from repodata patches, services changes, packages being moving around, etc. |
|
@isuruf I'm closing this b/c we (conda-forge and defaults) are moving to epochs releases that will remove old packages from Also, the packages are never removed, the exact URL option exists (just need docs on how to use it but that is a question for AnacondaInc and not I guess that the TL;DR is that there is no action for us to do here. (Maybe add a reference to docs on how to use the exact reference in our docs?) |
Sorry but this conversation is getting a bit frustrating. First, I actually tried using Second, I believe I'm raising a genuine concern here. Maybe one that conda-forge doesn't want to adress or cannot address. But until there is consensus here that this is the case, it's not particularly helpful for you to keep closing this issue. |
It is always frustrating when technology limitations prevent us from having what we want. You are missing the point though: having all packages in a single channel lead to unsolvable envs and/or really slow solve times.
Like I said above @msarahan probably knows more but I do recommend you to open and issue or ask that upstream with AnacondaInc as that is a anaconda.org question.
Indeed. We are limited to solutions like freeze the environment at creation time.
Again, not true. I explain at length our efforts in keeping the artifacts, never removing them and avoiding repodata patch.
There are no actions we can take, that is why I'm closing the issue. What do you want us to do? Move all packages to PS: check the conda docs for the exact URL usage, if there is none open an issue with AnacondaInc. That is the path forward. |
Please don't be patronizing, thanks.
I never suggested having all packages in a single channel.
I actually did. I never heard back. Point taken, though - I' on my own here.
With freezing you mean "conda-pack", right? I'll probably do that in the long run (once I figure out where to host the env), for lack of other options. Ironically though, the conda-pack homepage has the following to say on the issue (under Use Cases): "Archiving an environment in a functioning state. Note that a more sustainable way to do this is to specify your environment as a environment.yml, and recreate the environment when needed."
No, what I'd actually like is for
Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is), I wouldn't have raised the issue here in the first place. As I mentioned multiple times. Still thanks though, I guess. At least I know now that generating |
I'm not. I am as frustrate as you b/c reproducilbility is a big concern of mine. I believe your frustration is preventing you from understanding the technical issues I'm describing.
That is the only alternative to the exact URLs and that was what @isuruf suggested. Please read all the messages.
It may take a while for people to respond. Please be patient. You are not on your own. I just asked about the exact URL usage in may forums and I'm waiting on answers. Meanwhile I'm digging into the docs and looking into it.
Again, feature request to
You wrote that you found but it seems that you did not? I'm confused. Can you cross-reference the issues you opened and post the links you found?
Right. Not about |
|
Also, please take a look at |
I did that, as I already pointed out. I just checked, it was assigned case number Case 10717. Never received an answer though (other than "i forwarded your request to the appropriate team" or something like that), which is why I raised the issue here, believing that I might find people here willing to discuss options. Seems I picked the wrong forum, though. Sorry for that.
Um, you truncated my sentence to the point of mangling the meaning. What I wrote was "Right. Only that I did that, and had I found a solution in the docs (or any documentation on the URLs at all ,that is), I wouldn't have raised the issue here in the first place." In plain english: NO, I didn't find a solution. And because I DIDN'T, I raised the issue here. |
As I said, I will do that, for lack of a better (short-term) option. But I can't resist to once more quote conda-pack's own definition of its use-cases to you:
This standss in direct conflict to your claim that conda and conda-forge are only meant to, and will only be ever able to, provide what you call short-term reproducability. |
|
@fgp, it's clear you've found the process get to reproducible packages quite frustrating. I can understand that. However, I think the tone you've taken in this discussion is really counterproductive. You're trying to enlist help from a community that could, indeed, put its weight behind what you're trying to accomplish. By responding in an angry and antagonistic way, you are more likely to drive those who could help you away than enlist their help. I realize that you felt shut down by @ocefpaf initially closing the issue quickly as outside the scope of the community. He felt that this wasn't the right place to raise this issue but you were frustrated because you'd tried the right venue already and got no response. I think that meant the whole discussion got off on the wrong foot. I'm hoping it can return to a more civil tone and/or be moved to a different venue. As stated several times above, that's probably conda, not conda-forge but it would be appropriate to point to this discussion and perhaps try to bring conda-forge team members into the discussion as they have the time and interest to engage. |
|
@fgp the bio* community has written up this paper to solve the underlying problem in a different way. We do freeze long-term reproducible pipelines in containers. You can read more about this here https://www.biorxiv.org/content/10.1101/200683v2 and if you like check out biocontainers. If you can create a PR similar to this one: https://github.com/BioContainers/multi-package-containers/pull/707/files |
The bioconda containerization is the state of the art for this IMO. However, as I mentioned above, I'm OK with the conda-pack solution b/c it is more practical in the short term. |
Let me say that it was never my intention to come across as antagonistic, and I'm truely sorry if I did! I am indeed frustrated by my experience with constantly breaking conda environments, but I do of course respect the work people do here and know that everybody generally tries and does their best. In fact, I believe my frustration is rooted in my enthusiasm for the conda universe, and in particular conda-forge and bioconda, projects to which I have contributed in the past, and which I consider to be the single best solution for making bioinformatical software available to a non-programmer audience. Seeing conda missing the mark by so little (from my perspective) towards the goal of solving all my software installation problems is thus painful for me.
Re-reading the thread I agree with your reading of the situation. I'll try to raise this issue over at the conda project, although I'm not sure that is the right place either -- solving the problem of URL stability probably required cooperation from Anaconda as well, but we'll see. Unfortunately, any further action on my part will have to wait now anyway -- it became clear to me that the only realistic short-term solution is either conda-pack or containerization (thank you @bgruening for the pointer! I wasn't aware of the paper, and I'm sure it'll be an interesting read!), so I'll have to work on implementing that for my pipeline. |
It sounds like you tried to email support@anaconda.com. That's customer support, not support for Anaconda projects in general. I'm not surprised that you didn't get anywhere. This is a better place to ask questions: https://github.com/ContinuumIO/anaconda-issues but we're still very limited on time. As @ocefpaf mentions, the technical reasons for the mess have a long and sordid history. repo.continuum.io / repo.anaconda.com have always been served from an s3 bucket, which looks to the world like a static URL. URLS for these packages don't change, unless a package gets yanked. That happens very rarely - maybe once a year on average. Problems with metadata are what cause packages to be removed, but we usually "hotfix" the repodata instead of removing the package. There is no notion of labels on defaults. Anything on anaconda.org is served by a much more complicated web app. That web app is backed by mongodb, where the packages are stored. Unfortunately, a single, static URL for every object was not designed into the system. I think PyPI among others have shown that it's a good idea, and I hope we can work it into future systems. The current anaconda.org is in maintenance mode, and no feature like this can be added right now. Now, to make things more complicated: conda-forge and bioconda are mirrored to a CDN. That means that things could perhaps be more stable than the web app, but we're mirroring whatever these channels choose to do. If people move packages to other labels, there's no unique URL for a package. The huge advantage that the CDN gives conda-forge and bioconda is the ability to hotfix repodata. Now they can fix broken metadata, rather than just get rid of it by removing packages. However, this hotfixing/CDN is currently only available for the main label, not any other labels. As for epochs: this is pretty close to working. Conda-build 3.18.0 introduced a new current_repodata.json file that is strictly the newest packages. This will allow people to keep packages in the same place, and have an ever-growing repodata.json that on its own would be horribly slow or broken for using in solves, while keeping long-term stability. As long as you used exact package specs, solves with the full repodata are still fast. There's more work to do to have epochs - time slices, but the idea is the same. Conda 4.7 is the other part of this implementation, of course. It's on conda-canary. There's still quite a few kinks to work out, but it should be pretty good by the end of this week. The concept of restricted views on metadata (pioneered by conda-metachannel) are what will help us here. Since there has been no way to view only a subset until now, and since the main label needs to always be a self-consistent set, conda-forge has been stuck with re-labeling or removing files. Hopefully we can do better soon. |
I'm shipping a data-analysis pipeline that we developed (https://github.com/Cibiv/ipoolseq-pipeline) with an environment definition YAML file (ipoolseq.yaml), and would like that file to produce the same conda environment now, in a month, and in a year. However, every ipoolseq.yaml that I create breaks after a few months. Usually because someone removes the "main" label from some of the package versions referenced by my file, causing "conda env create -f ..." to fail.
I also tried using "explicit" package lists, with contain full package URLs, but those seem to be even less stable than the YAML files listing package versions. So, given that one of the purposes of conda is precisely my use-case, namely to provide reproducible software environments, how do I get it to actually DO that?
The text was updated successfully, but these errors were encountered: