New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unify environment specifications among conda, conda-env, and anaconda-project #7248
Comments
|
Okay, so from trying out a bunch of these, as well as working with pipenv, looking longingly at piptools and trying to check if things are correct, even in conda envs, features I'm really after:
For the abstract requirements file format:
For the concrete requirements file format:
Nice to haves:
Other observations:
|
|
Observing the development of the python ecosystem for some years, this is one of the most important features and conda is - IMHO - the only project with the potential to bridge this feature gap to ecosystems such as in php (composer), ruby (bundler) or java (gradle). While pip is talking about "requirements" and conda about "environments" I'd propose to use a clear terminology for the aspect of describing desired dependencies (requirements) and the aspect of locking and sharing a fixed and solved dependency list (environment): maybe s.th. like "environment.yml" and "environment.yml.lock" (in analogy to pyproject.toml and .lock) or "requirements.yml" and "environment.yml" in analogy to existing files. Of course, the dependencies would still be conda dependencies and conda could still wrap pip dependencies. Having backwards-compatibility is a very important thing, as well. Supporting the same workflow and description format as used by now must be ensured to meet slower project cycles. Nevertheless it would make sense to give a more expressive format enough thought to support similar expressiveness than existing description files (not just a list of dependencies). If such a workflow would be designed the right way I would not see any necessity for files like pyproject.toml (see pypa/pip#4144 or https://github.com/sdispater/poetry) because conda - as a more general package manager - is way better suited for this job. |
|
Just adding a related reference from poetry - a package publishing tool that uses pyproject.toml and manages dependencies. |
|
Just adding more info about reproducibility of mixed conda+pip multichannel environments: conda cannot auto-export-reproduce fixed version conda+pip combination |
|
Hey, I'm working on an EU project called PaNOSC which (broadly) aims to improve the reproducibility of data analysis for European photon and neutron research facilities. One of the major requirements is to be able to identically reproduce a python environment in the 'long-term' future, on the order of multiple years. I would like to use conda to manage these environments as many scientists are already familiar with it, but it does not have a few features required to make the environment setup truly reproducible for many of the use cases I have looked at (as the issues mentioned here show); so right now I'm leaning towards creating a tool which combines a deterministic python environment manager (e.g. poetry, pyenv, dotlock) with Spack for 'external' non-python dependency management. The work mentioned here would help make conda a viable alternative to my current plans, and I would much rather stick to just using conda instead of two glued-together solutions So I'm wondering if there is a rough timeline for this integration, or a timeline for the work on anaconda-project? |
|
near to mid-term. 3-6 months. |
|
@RobertRosca Actually conda is almost capable to reproduce the environment. See this option. But there is a problem with packages changing their sha-256 hashes (without changing version and build). See this issue: #7882 If this won't be addressed (and I haven't seen even any plans to fix it) then I would not recommend conda for reproducible environments. |
|
@RobertRosca But if to think about it a bit more: conda packages can change hashes. But pip packages can be deleted by mainteners from pypi. I guess this almost never happens. But still they can do it. So the only way to guarantee the identity of the envs is to keep your own mirror of the used packages. In this case conda is as good as pip (but better in other regards). |
|
The new package format discusses in #7882 is done. There's support for it in conda-build as of conda/conda-build#3334, and #8265 will add it for conda. We really never want to change packages. We change repodata to correct issues. Whether we bake those changes back into the packages is something we currently do not do, but might consider. It determines whether one can recreate a valid index with only packages (not involving some extra repodata patch info). Package signing will also affect the sha256 of the outer container, but not the inner one. The new package format will require new tooling to ensure reproducibility, but it should improve things. |
|
Sounds like reproducibility is almost here. Great! By the way: Am I right that conda packages is never deleted from main channels and conda-forge? Or there can be some cases for this? |
|
In very rare cases, we may remove things. We reserve that for when a package is greatly breaking the system, and we cannot otherwise remove it. Generally, we prefer to hotfix repodata such that the package is unsatisfiable, and users may add a special channel to unblock that and make it satisfiable. |
|
Thanks for the help and quick answers =D One of the problems for my use case is that not all dependencies will be installed from conda, or pypi, as a lot of the packages which need to be tracked are in-house tools developed by scientists, so they're only available as a git repository Understandably this isn't dealt with too well by conda, as the point of conda is to manage conda packages. For example if you make a conda environment file and put in a github url (as a pip requirement), install it, then use Which is where the ability of poetry/pipenv to handle arbitrary VCS dependencies comes in very handy, and I put quotes around only since that seems like it might not be done because it's out of the scope of the anaconda project, as I guess from the anaconda point of view it would be odd to support non-conda dependencies to the same level as conda ones, and that the suggested approach would be to use Is that right or are there plans to expand |
|
I started this conversation with @kalefranz but want to open the discussion here. I have developed an internal corporate package that helps automate and make conda environments more portable. My plan is to open source my project, but ideally the features would be merged into conda. I want to describe the intention so that we can figure out the right way to implement them later. Automatic Environment ActivationEssentially Or maybe it prompts before activating. The typical usage is to have a git repo and whenever I cd into the git repo (or a subdirectory) it either prompts me or automatically runs conda activate. Cding to a new project directory conda deactivates and conda activates the new environment tied to that project. Automatic Environment SyncingMaintain an environment file (that is portable across platforms) that is tied to a project and automatically updates when the user runs a conda command (install, remove, etc.) for that environment. The goal is to have all collaborators on the same environment at the same time. If another user updates the file then conda prompts you to re-solve your local environment (similar to R package supportR package interoperability support similar to pip interoperability support.
In terms of reproducibility of the R packages there are good examples such as mybinder with install.R and checkpoint using MRAN. Another challenge I have seen for this kind of functionality is how to handle corporate mirrors and installing from GitHub and/or enterprise GitHub. Pip is generally flexible enough to handle this (though I would echo the request above for automatically tracking github url), but we have custom built solutions for R at my company. |
|
Automatic Environment Activation The dependencies of a project are specified in the ..and later keep it up to date with |
|
I think the automatic environment activation would be a nice feature to have if it was something you opt in to with i.e. a line in .condarc |
|
Let's try to keep this issue focused on the actual environment spec, and not how activation works. That's a separate part of the process, and we definitely don't want to force a particular activation behavior on anyone, though I agree that having both automatic and manual activation are useful to have. |
|
Just a comment to say that I find @dhirschfeld idea of reusing the meta.yaml interesting (#7248 (comment)), a single-source of truth for the conda dependencies souds appealing to me (be it meta.yaml or something else that meta.yaml can load). Although, I can imagine having dev-only dependencies (linter, code formatter, ...) being useful in the environment file, but less so in the meta.yaml. |
|
Yeah, my suggestion would work better if there were a better way to specify optional dependencies: |
|
Could conda benefit from plugging into the pyproject.toml efforts? I'm a little bit confused about the differences between frontends and backends, but couldn't the deps and dev-deps just be put inside the toml, obviating the need for the meta.yaml and the environment.yaml? |
|
meta.yaml and environment.yaml differ by a lot more than just deps vs dev-deps. meta.yaml supports a lot more templating and otherwise dynamic behavior at build time. I am also always confused by what constitutes frontend vs backend. conda-build and conda will probably have to support pyproject.toml somehow at some point, but I don't think pyproject.toml alone will be sufficient to replace conda's spec format(s). Conda is much more than just python, and a spec file for just python may not be enough for conda. |
|
Look forward to this streamlining. I would expect usage to be split majorly between
|
|
We often find that we have two or three environment definitions for our projects:
It would be good if there was a nice way to either combine multiple environment specs |
Aren't these issues largely solved by conda devenv? If you follow the discussions there, they don't want to become 'conda build' as conda build is full-featured and geared towards making packages. I'm interested in having one spec that goes to package, build, run, dev..etc. But I think it should primarily be interpreted as an 'env spec' (not package) so you don't have to face the complexity of conda build and packaging just to have some structure around dev envs. |
Well, no. I think we're mixing some issues up. At least with the successful upcoming of poetry a major gap in the python package-development-build-cycle is about to close when comparing it to java and php. And I agree, this is not the focus of conda which acts more as a "cross-platform binary package manager". What I was missing back in 2018 was an easier python package build process and separately from that a more expressive versionable environment for python projects beyond libraries (e.g. builds for pypi). The first aspect will be clearly fixed by poetry and further python language developments. A question is then: in which cases we're going to see broader usage of conda. Really looking forward to some conda specification(s) which
There are quite some similarities to package build dependency managers .. |
|
I'd like to share the current technique I'm using to provide reproducible conda environments. I doubt it's cross platform but I've only got to support one. I'm using GitLab CI on a repo which includes an environment.yml [1]. The CI job installs the environment and then spits out a spec file [2]. That spec file gets a name based on the commit hash and the time (because our channels are always updating), it gets uploaded to a http server, and we build a docker image which has it installed (again versioned w/ the hash and time). The spec file is quick to install because the solve is already done in CI, you're just downloading files specified in the spec file. We have to consider the date/time because the state of the channels determines how our environment will solve. For example, an environment which only includes The versioning allows you to keep environments loose, allowing updates. Sometimes you may have to manually pin packages due to failed tests downstream. Or if you don't care about updates just keep using the environment solved however long ago. This approach requires the desired files to be present and it assumes that the files won't change. I've read that might not be always strictly true with the conda mirrors. An improvement I'd like to make to this is make a mirror of my channels (conda/conda-forge) which is on ZFS/btrfs/something so I can serve historical daily snapshots. Then I could include the specific daily channel snapshot in my environment and it should solve the same way no matter when it's solved. If I'm making any assumptions or mistakes I'd love to know. 1: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually |
I'm in general agreement with your well-thought out reply (thanks). But I still see conda envs, at its core, as something that you 'do stuff' in. So I see the 'types' of projects you enumerated can all be contained in conda env(s). Now, we may talk about how much a conda env spec 'sees'/interprets specific build and package system specs. (I don't believe poetry should be in the business of venvs!, and as for 2, people who think just filling Dockerfiles saves them from pkg mgt are dumb...uggh you should not have to do system-level containment unless you are dealing with the system..but that's another discussion) I'm more coming from number 3 where package and versioning formality is not as important as pinning (external) dep vers. Again, you can hack together something based on conda devenv that does a good chunk of what is being described in this thead; You can create a dep tree via includes and invoke whatever build/installation/setup process you want. For me, this was key to finding that balance b/w a (complicated) build dep mgr. and a (unstructured) prototyping env(s); I need multiple envs that can exist simultaneously and can potentially communicate (Docker here would introduce unnecessary complications and overhead). As an analogy, it's sort of like Makefiles but on the level of environments where conda is doing the hard work of solving envs. What the 'directives' of the makefile are is what we're discussing here. The system that accommodates 3, is the same one accommodates 1 and 2. I just don't see why there should be a distinction; you create an env and develop towards your specific needs in one seamless process. Start with a simple environment.yml and evolve it into more structured environment ymls that could contain poetry elements. Same process beginning with simple envs as the development 'module' instead of having to create a package. |
We ran into the same problems with Our solution is similar to yours, only instead of saving the environment spec file, we package the environment into a .tar.gz archive using conda-pack. See here for an example: https://gitlab.com/conda-envs/default/-/pipelines. |
|
Reproducibility would mean not depending upon network resources. So I vote for conda-pack for 100% reproducibility It ties you down to OS/machine, but we have docker and VMs for taking care of running code decades later :) Not uncommon in scientific settings |
|
I don't understand the link with network resources, if your lock file includes hashes of the sources, how would that not be reproducible? |
|
Well there's reproducibility within the life of conda (really anaconda.org) and reproducibility external to the life of conda where the package repository has disappeared, format has been deprecated, etc... conda-pack solves the latter issue. It's a valid concern for many people working on long term data preservation but I assume it is out of scope of such a proposal, but could be integrated as a feature for lock files (e.g. materialize a lock to a pack). Speaking as an engineer from the Rubin Observatory, where we have a hundred or more downstream consumers of our conda environments (and now conda-forge metapackage), we had previously used a conda-env specification and periodically would do installs and export a One thing I would really like to see is for the lock generation mechanism to work cross platform. I would like to be able to create a lock file(s) for a list of supported system architectures without necessarily having access to those architectures. I'm sure others have though the same recently with the introduction of the Apple M1. I believe when I looked this was not quite trivial because conda relies on python reporting of the OS directly ( Following from that, I do not believe that the lock file should not be simply named Today we still use conda as our default, but often tell users to install mamba in many scenarios. We have been trying to keep up to date with the latest packages partly to make sure conda solves faster. I know this is probably a very hard problem to solve, and there's several intertwined issues I've mentioned which are not directly tied to a specification. But ideally unified platform/architecture-agnostic specification (similar to dependencies section in a meta.yaml today) that can be used directly by conda-build, or conda, and you should be able produce a lock files for all platforms/arch (or a unified lock file) from that specification. Bonus points for thinking of a way of letting a user create a new environment, in a way which is likely to be compatible with an existing environment, if given the environment specification, a lock file, and the new packages they would like to add to the environment. P.S. Sorry if rambling (I can blame DST), we do appreciate the system and ecosystem around it very much, I just wanted to document some of the experience around shipping environments to users. |
… conda-env See conda/conda#7248 for a list of issues with conda-env. The biggest issue is that it does not respect the channels list.

The plugins::envpertains to conda-env
tag marks some 45 open issues related to feature requests, bug reports, and usability problems with environment specification.
We currently have three different tools for environment specification, all of which have slightly different functionality.
conda
There are two primary commands:
conda list --exportandconda list --explicit. Both print packages to stdout, and the user is expected to use io redirection to write the output to a file. The file can later be used withconda createorconda installas the parameter associated with a--fileflag.conda list --export, which prints to stdout an alphabetized list of packages in name=version=build formatconda list --explicit, optionally interacting with an--md5flag, which prints to stdout a toposorted list of package urls (optionally with an md5 checksum as the url fragment)A somewhat hidden feature of conda is the ability to roll back to a previous environment state, facilitated by
conda list --revisionsandconda install --revision. To support this capability, conda keeps a full history of the environment inconda-meta/history. This history ends up being incredibly powerful, and in the future it can be more fully leveraged to produce an environment specification that captures only the user's top-level requested packages.conda-env
Conda-env uses an
environment.ymlfile as its environment spec, and introduces confusingly duplicatedconda env createandconda env updateandconda env removecommands. The single capability introduced by conda-env beyond what's available in conda is the ability to specify packages to be installed by pip, in addition to packages to be installed by conda.anaconda-project
The design concepts for anaconda-project are described here. Capabilities introduced in anaconda-project that were not already included in conda or conda-env are
Capabilities of anaconda-project that are beyond the scope of what conda will include are
unification
We propose unifying the capabilities outlined here--and many/most of the features requests under the environment-spec tag--into and under conda proper. This unification project will result in two new environment specification file formats, one that maximizes portability and one that maximizes reproducibility. The formats will be backward-incompatible with previous versions of conda, and we will also stop support for all existing environment specification formats (that means full deprecation of conda-env in particular, and likely anaconda-project as well). This decision is deliberate, with the purpose being that this unification project does not simply create n+1 ways of doing the same thing (and thus n+1 formats and workflows to support indefinitely). This backward-incompatible change means we will facilitate migration paths from the old ways to the new way. The work will be marked by a major version bump to conda (e.g. conda 5.0).
When designing the user experience for the new environment specification, we should carefully study, learn from, and reference other package management tools. While perhaps not an exact "condafied" re-implementation of pipenv, the workflow should feel natural and intuitive to users familiar with the best parts of pipenv. We will also study package management tools outside of the python ecosystem, such as yarn, npm, bundler, and cargo.
The text was updated successfully, but these errors were encountered: