Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify environment specifications among conda, conda-env, and anaconda-project #7248

Open
Tracked by #11341
kalefranz opened this issue May 3, 2018 · 32 comments
Open
Tracked by #11341
Labels
epic a highlevel collection of smaller related issues plugins::env pertains to conda-env source::governance created by members of the conda governance (https://github.com/conda-incubator/governance) type::feature request for a new feature or capability

Comments

@kalefranz
Copy link
Contributor

kalefranz commented May 3, 2018

The plugins::env pertains to conda-env tag marks some 45 open issues related to feature requests, bug reports, and usability problems with environment specification.

We currently have three different tools for environment specification, all of which have slightly different functionality.

conda

There are two primary commands: conda list --export and conda list --explicit. Both print packages to stdout, and the user is expected to use io redirection to write the output to a file. The file can later be used with conda create or conda install as the parameter associated with a --file flag.

  1. conda list --export, which prints to stdout an alphabetized list of packages in name=version=build format
anaconda-client=1.6.14=py27_0
asn1crypto=0.24.0=py27_0
attrs=17.4.0=py27_0
beautifulsoup4=4.6.0=py27h9416283_1
ca-certificates=2018.03.07=0
certifi=2018.4.16=py27_0
  1. conda list --explicit, optionally interacting with an --md5 flag, which prints to stdout a toposorted list of package urls (optionally with an md5 checksum as the url fragment)
@EXPLICIT
https://repo.anaconda.com/pkgs/main/osx-64/ca-certificates-2018.03.07-0.tar.bz2#8a61ffe6635a912082978bb8127b5f4b
https://repo.continuum.io/pkgs/main/osx-64/conda-env-2.6.0-h36134e3_0.tar.bz2#0876f412b6123634607ae9c22a7bc805
https://repo.continuum.io/pkgs/main/osx-64/libcxxabi-4.0.1-hebd6815_0.tar.bz2#73ee4e71dae58f5173d649952c4c2b35
https://repo.continuum.io/pkgs/main/osx-64/tk-8.6.7-h35a86e2_3.tar.bz2#5694507135f7eb1c7b59f6c08f01fc0c
https://repo.continuum.io/pkgs/main/osx-64/yaml-0.1.7-hc338f04_2.tar.bz2#dab654341f57e56b615a678800262b0e

A somewhat hidden feature of conda is the ability to roll back to a previous environment state, facilitated by conda list --revisions and conda install --revision. To support this capability, conda keeps a full history of the environment in conda-meta/history. This history ends up being incredibly powerful, and in the future it can be more fully leveraged to produce an environment specification that captures only the user's top-level requested packages.

conda-env

Conda-env uses an environment.yml file as its environment spec, and introduces confusingly duplicated conda env create and conda env update and conda env remove commands. The single capability introduced by conda-env beyond what's available in conda is the ability to specify packages to be installed by pip, in addition to packages to be installed by conda.

anaconda-project

The design concepts for anaconda-project are described here. Capabilities introduced in anaconda-project that were not already included in conda or conda-env are

  1. The ability to manage environment variables as part of an environment specification.
  2. The concept of a lock file designed to maximize reproducibility of an environment at the potential cost of portability of the environment. This is in contract to a generic environment description that contains the minimal amount of information necessary to recreate the environment, which is designed to maximize portability of an environment (e.g. across operating systems) at the potential cost of exact bit-level reproducibility of an environment.

Capabilities of anaconda-project that are beyond the scope of what conda will include are

  1. the ability to manage individual, non-conda-package data files
  2. process initialization or management, such as running a database server

unification

We propose unifying the capabilities outlined here--and many/most of the features requests under the environment-spec tag--into and under conda proper. This unification project will result in two new environment specification file formats, one that maximizes portability and one that maximizes reproducibility. The formats will be backward-incompatible with previous versions of conda, and we will also stop support for all existing environment specification formats (that means full deprecation of conda-env in particular, and likely anaconda-project as well). This decision is deliberate, with the purpose being that this unification project does not simply create n+1 ways of doing the same thing (and thus n+1 formats and workflows to support indefinitely). This backward-incompatible change means we will facilitate migration paths from the old ways to the new way. The work will be marked by a major version bump to conda (e.g. conda 5.0).

When designing the user experience for the new environment specification, we should carefully study, learn from, and reference other package management tools. While perhaps not an exact "condafied" re-implementation of pipenv, the workflow should feel natural and intuitive to users familiar with the best parts of pipenv. We will also study package management tools outside of the python ecosystem, such as yarn, npm, bundler, and cargo.

@kalefranz kalefranz added type::feature request for a new feature or capability epic a highlevel collection of smaller related issues source::governance created by members of the conda governance (https://github.com/conda-incubator/governance) tag:environment-spec labels May 3, 2018
@kalefranz kalefranz added this to the 5.0 milestone May 3, 2018
@cjw296
Copy link

cjw296 commented May 3, 2018

I can't resist including this:
xkcd 1987

@cjw296
Copy link

cjw296 commented May 3, 2018

Okay, so from trying out a bunch of these, as well as working with pipenv, looking longingly at piptools and trying to check if things are correct, even in conda envs, features I'm really after:

  • an abstract requirements file, where I can specify the packages I want
  • a concrete requirements file, often called a lock file, that specifies the full and exact versions of all the packages and their dependencies.
  • management of both conda and pip packages. conda-forge is nice, but it's very far from complete.
  • binaries ans cripts that are usable without activating the environment (pipenv gets this spectacularly wrong), specifically so things like setuptools console_scripts can be included in crontabs.
  • conda install/uninstall should update both the abstract and concrete requirements files whenever a packages is installed, removed or updated.
  • a 'check' command that lets you check the current environment against the concrete requirements file, must be usable for CI systems. (eg: picky check)

For the abstract requirements file format:

  • which channels to use, in priority order
  • version ranges for packages (eg: pandas<0.20, python=3, etc)

For the concrete requirements file format:

  • which channel a package should be installed from
  • the exact versions required for all operating systems (like anaconda-project)
  • the exact versions of packages only required for certain OS's (eg: appnope)

Nice to haves:

  • a way to get explanation of why each package is in an env (ie: what packages depend on it)
  • a way to upgrade one of more packages, with confirmation, updating the abstract and concrete requirements files as needed
  • a way to prune packages that are no longer needed, based on the packages listed in the abstract requirements file

Other observations:

  • anaconda-project has a tonne of fluff that didn't seem very necessary; my hope is the solution to all this will be minimal and as simple as possible
  • conda env feels close, and is pointed to by most of the docs. Sadly, I think it gets the channels stuff quite wrong: no priorities, doesn't encapsulate which channel a packages came from
  • conda list looks "best" for conda packages but not sure how good it is for round tripping, particularly for pip packages.

@JulianStier
Copy link

JulianStier commented Jun 28, 2018

Observing the development of the python ecosystem for some years, this is one of the most important features and conda is - IMHO - the only project with the potential to bridge this feature gap to ecosystems such as in php (composer), ruby (bundler) or java (gradle).
Composer schema is a good inspiration for what a dependency manager can include. Rethinking this could also help creating a simpler way for build tools.

While pip is talking about "requirements" and conda about "environments" I'd propose to use a clear terminology for the aspect of describing desired dependencies (requirements) and the aspect of locking and sharing a fixed and solved dependency list (environment): maybe s.th. like "environment.yml" and "environment.yml.lock" (in analogy to pyproject.toml and .lock) or "requirements.yml" and "environment.yml" in analogy to existing files. Of course, the dependencies would still be conda dependencies and conda could still wrap pip dependencies.

Having backwards-compatibility is a very important thing, as well. Supporting the same workflow and description format as used by now must be ensured to meet slower project cycles. Nevertheless it would make sense to give a more expressive format enough thought to support similar expressiveness than existing description files (not just a list of dependencies).
Providing features to import from files such as requirements.txt and Pipfile (see #4205) could also draw on people working with those into using conda.

If such a workflow would be designed the right way I would not see any necessity for files like pyproject.toml (see pypa/pip#4144 or https://github.com/sdispater/poetry) because conda - as a more general package manager - is way better suited for this job.

@pylang
Copy link

pylang commented Oct 30, 2018

Just adding a related reference from poetry - a package publishing tool that uses pyproject.toml and manages dependencies.

@kiwi0fruit
Copy link

Just adding more info about reproducibility of mixed conda+pip multichannel environments:

conda cannot auto-export-reproduce fixed version conda+pip combination

@RobertRosca
Copy link

Hey, I'm working on an EU project called PaNOSC which (broadly) aims to improve the reproducibility of data analysis for European photon and neutron research facilities.

One of the major requirements is to be able to identically reproduce a python environment in the 'long-term' future, on the order of multiple years.

I would like to use conda to manage these environments as many scientists are already familiar with it, but it does not have a few features required to make the environment setup truly reproducible for many of the use cases I have looked at (as the issues mentioned here show); so right now I'm leaning towards creating a tool which combines a deterministic python environment manager (e.g. poetry, pyenv, dotlock) with Spack for 'external' non-python dependency management.

The work mentioned here would help make conda a viable alternative to my current plans, and I would much rather stick to just using conda instead of two glued-together solutions

So I'm wondering if there is a rough timeline for this integration, or a timeline for the work on anaconda-project?

@msarahan
Copy link
Contributor

near to mid-term. 3-6 months.

@kiwi0fruit
Copy link

kiwi0fruit commented Mar 14, 2019

@RobertRosca Actually conda is almost capable to reproduce the environment. See this option.

But there is a problem with packages changing their sha-256 hashes (without changing version and build). See this issue: #7882

If this won't be addressed (and I haven't seen even any plans to fix it) then I would not recommend conda for reproducible environments.

@kiwi0fruit
Copy link

kiwi0fruit commented Mar 14, 2019

@RobertRosca But if to think about it a bit more: conda packages can change hashes. But pip packages can be deleted by mainteners from pypi. I guess this almost never happens. But still they can do it. So the only way to guarantee the identity of the envs is to keep your own mirror of the used packages. In this case conda is as good as pip (but better in other regards).

@msarahan
Copy link
Contributor

The new package format discusses in #7882 is done. There's support for it in conda-build as of conda/conda-build#3334, and #8265 will add it for conda.

We really never want to change packages. We change repodata to correct issues. Whether we bake those changes back into the packages is something we currently do not do, but might consider. It determines whether one can recreate a valid index with only packages (not involving some extra repodata patch info). Package signing will also affect the sha256 of the outer container, but not the inner one. The new package format will require new tooling to ensure reproducibility, but it should improve things.

@kiwi0fruit
Copy link

kiwi0fruit commented Mar 14, 2019

Sounds like reproducibility is almost here. Great!

By the way: Am I right that conda packages is never deleted from main channels and conda-forge? Or there can be some cases for this?

@msarahan
Copy link
Contributor

In very rare cases, we may remove things. We reserve that for when a package is greatly breaking the system, and we cannot otherwise remove it. Generally, we prefer to hotfix repodata such that the package is unsatisfiable, and users may add a special channel to unblock that and make it satisfiable.

@RobertRosca
Copy link

Thanks for the help and quick answers =D

One of the problems for my use case is that not all dependencies will be installed from conda, or pypi, as a lot of the packages which need to be tracked are in-house tools developed by scientists, so they're only available as a git repository

Understandably this isn't dealt with too well by conda, as the point of conda is to manage conda packages. For example if you make a conda environment file and put in a github url (as a pip requirement), install it, then use conda env export, you only see the name and tag of the repository. So you lose the URL and don't have a commit hash

Which is where the ability of poetry/pipenv to handle arbitrary VCS dependencies comes in very handy, and anaconda-project looks to be a great step towards bringing similar functionality to conda, albeit (from what I understand) 'only' for conda based dependencies

I put quotes around only since that seems like it might not be done because it's out of the scope of the anaconda project, as I guess from the anaconda point of view it would be odd to support non-conda dependencies to the same level as conda ones, and that the suggested approach would be to use conda-build as an easy way of creating a conda package which can be used with anaconda-project

Is that right or are there plans to expand anaconda-project to also lock non-conda/VCS dependencies in a similar way to poetry, pipenv, or dotlock?

@quartox
Copy link

quartox commented May 9, 2019

I started this conversation with @kalefranz but want to open the discussion here. I have developed an internal corporate package that helps automate and make conda environments more portable. My plan is to open source my project, but ideally the features would be merged into conda. I want to describe the intention so that we can figure out the right way to implement them later.

Automatic Environment Activation

Essentially

(base) $ cd /path/to/project/
(project-env) $

Or maybe it prompts before activating. The typical usage is to have a git repo and whenever I cd into the git repo (or a subdirectory) it either prompts me or automatically runs conda activate. Cding to a new project directory conda deactivates and conda activates the new environment tied to that project.

Automatic Environment Syncing

Maintain an environment file (that is portable across platforms) that is tied to a project and automatically updates when the user runs a conda command (install, remove, etc.) for that environment. The goal is to have all collaborators on the same environment at the same time. If another user updates the file then conda prompts you to re-solve your local environment (similar to conda env update --prune -f environment.yml). The typical use case is to store this file in a git repo and have the user manually commit the changes when they add/remove packages. Git pull typically prompts the update of the local environment.

R package support

R package interoperability support similar to pip interoperability support.

  • conda list shows R packages installed by the user.
  • section of the environment spec file for R packages.

In terms of reproducibility of the R packages there are good examples such as mybinder with install.R and checkpoint using MRAN. Another challenge I have seen for this kind of functionality is how to handle corporate mirrors and installing from GitHub and/or enterprise GitHub. Pip is generally flexible enough to handle this (though I would echo the request above for automatically tracking github url), but we have custom built solutions for R at my company.

@dhirschfeld
Copy link
Contributor

Automatic Environment Activation
I don't know how you could hijack cd to automatically activate an environment but that's certainly not something I'd want and I wouldn't want to be annoyed by prompts either. IMHO activating an environment is something which should be done explicitly.


The dependencies of a project are specified in the meta.yaml so it might be nice if conda env create could take a meta.yaml file as well as an environment.yml. The user could then:

git clone org/repo
cd repo
conda env create --name repo --file ./conda-recipe/meta.yaml
conda activate repo

..and later keep it up to date with

git pull
conda env update --file  ./conda-recipe/meta.yaml

@benlindsay
Copy link

I think the automatic environment activation would be a nice feature to have if it was something you opt in to with i.e. a line in .condarc

@msarahan
Copy link
Contributor

Let's try to keep this issue focused on the actual environment spec, and not how activation works. That's a separate part of the process, and we definitely don't want to force a particular activation behavior on anyone, though I agree that having both automatic and manual activation are useful to have.

@Sarcasm
Copy link

Sarcasm commented May 10, 2019

Just a comment to say that I find @dhirschfeld idea of reusing the meta.yaml interesting (#7248 (comment)), a single-source of truth for the conda dependencies souds appealing to me (be it meta.yaml or something else that meta.yaml can load).

Although, I can imagine having dev-only dependencies (linter, code formatter, ...) being useful in the environment file, but less so in the meta.yaml.
poetry and cargo, have something called dev-dependencies.

@dhirschfeld
Copy link
Contributor

Yeah, my suggestion would work better if there were a better way to specify optional dependencies:
#7502

@3tilley
Copy link

3tilley commented Aug 11, 2019

Could conda benefit from plugging into the pyproject.toml efforts? I'm a little bit confused about the differences between frontends and backends, but couldn't the deps and dev-deps just be put inside the toml, obviating the need for the meta.yaml and the environment.yaml?

@msarahan
Copy link
Contributor

meta.yaml and environment.yaml differ by a lot more than just deps vs dev-deps. meta.yaml supports a lot more templating and otherwise dynamic behavior at build time. I am also always confused by what constitutes frontend vs backend. conda-build and conda will probably have to support pyproject.toml somehow at some point, but I don't think pyproject.toml alone will be sufficient to replace conda's spec format(s). Conda is much more than just python, and a spec file for just python may not be enough for conda.

@GuSuku
Copy link

GuSuku commented Apr 4, 2020

Look forward to this streamlining.

I would expect usage to be split majorly betweenconda and conda-env, with anaconda-project a distant third. The following is the main reason for me to go with conda-env instead of conda. Fixing this disparity would consolidate two major usages:

The single capability introduced by conda-env beyond what's available in conda is the ability to specify packages to be installed by pip, in addition to packages to be installed by conda

@analog-cbarber
Copy link
Contributor

We often find that we have two or three environment definitions for our projects:

  • just the runtime dependencies for indirect inclusion in meta.yaml and setup.py
  • build/test dependencies for CI builds
  • development environment dependencies - adds packages that developers may want to
    use locally (e.g. jupyter) but are not needed for the CI

It would be good if there was a nice way to either combine multiple environment specs
or to have a single spec that provides a way to classify types of dependencies.

@majidaldo
Copy link
Contributor

Observing the development of the python ecosystem for some years, this is one of the most important features and conda is - IMHO - the only project with the potential to bridge this feature gap to ecosystems such as in php (composer), ruby (bundler) or java (gradle).
Composer schema is a good inspiration for what a dependency manager can include. Rethinking this could also help creating a simpler way for build tools.

While pip is talking about "requirements" and conda about "environments" I'd propose to use a clear terminology for the aspect of describing desired dependencies (requirements) and the aspect of locking and sharing a fixed and solved dependency list (environment): maybe s.th. like "environment.yml" and "environment.yml.lock" (in analogy to pyproject.toml and .lock) or "requirements.yml" and "environment.yml" in analogy to existing files. Of course, the dependencies would still be conda dependencies and conda could still wrap pip dependencies.

Having backwards-compatibility is a very important thing, as well. Supporting the same workflow and description format as used by now must be ensured to meet slower project cycles. Nevertheless it would make sense to give a more expressive format enough thought to support similar expressiveness than existing description files (not just a list of dependencies).
Providing features to import from files such as requirements.txt and Pipfile (see #4205) could also draw on people working with those into using conda.

If such a workflow would be designed the right way I would not see any necessity for files like pyproject.toml (see pypa/pip#4144 or https://github.com/sdispater/poetry) because conda - as a more general package manager - is way better suited for this job.

We often find that we have two or three environment definitions for our projects:

  • just the runtime dependencies for indirect inclusion in meta.yaml and setup.py
  • build/test dependencies for CI builds
  • development environment dependencies - adds packages that developers may want to
    use locally (e.g. jupyter) but are not needed for the CI

It would be good if there was a nice way to either combine multiple environment specs
or to have a single spec that provides a way to classify types of dependencies.

Aren't these issues largely solved by conda devenv? If you follow the discussions there, they don't want to become 'conda build' as conda build is full-featured and geared towards making packages. I'm interested in having one spec that goes to package, build, run, dev..etc. But I think it should primarily be interpreted as an 'env spec' (not package) so you don't have to face the complexity of conda build and packaging just to have some structure around dev envs.

@JulianStier
Copy link

Aren't these issues largely solved by conda devenv? If you follow the discussions there, they don't want to become 'conda build' as conda build is full-featured and geared towards making packages. I'm interested in having one spec that goes to package, build, run, dev..etc. But I think it should primarily be interpreted as an 'env spec' (not package) so you don't have to face the complexity of conda build and packaging just to have some structure around dev envs.

Well, no. I think we're mixing some issues up. At least with the successful upcoming of poetry a major gap in the python package-development-build-cycle is about to close when comparing it to java and php. And I agree, this is not the focus of conda which acts more as a "cross-platform binary package manager". What I was missing back in 2018 was an easier python package build process and separately from that a more expressive versionable environment for python projects beyond libraries (e.g. builds for pypi).

The first aspect will be clearly fixed by poetry and further python language developments.
The second aspect should be solved by (something like) conda.
In my opinion the current environment specification is not expressive enough (e.g. version ranges) which might be good for exact reproducibility but lacks more collaborative working capabilities. E.g. when I am sharing my project with colleagues and would like to differentiate between environment requirements and a reproducible freeze or something like that.
A lot of that requirements are summarized above by cjw296

A question is then: in which cases we're going to see broader usage of conda.
From my own experience I see at least three different types of projects: 1) python package development, 2) application development and 3) scientific experiments / reproducible environments / working environments.
The first clearly will be covered mostly by tools such as poetry (especially with further refinements in upcoming python versions). It makes python build, test and deployment incredibly easy and as long as there are no further requirements for c-libraries there will be no need for further package management and this is actually very good because the development life cycle got quite easier.
The second type (applications) might need more packaging/deployment structure with requirements going beyond capabilities exclusively limited to python packages. Those builds could be based on conda but I see lots of projects which rather go with full docker builds and tests. Not sure how strong conda can define its feature set here.
For the third type of projects I still see a huge potential and this is where anaconda-project lived in and got strong. Especially the brutal R ecosystem used in data science projects will still be long stand on conda and I guess still a lot of people are about to transition to this type of prototyping/development. Having a reproducible scientific experiment project is quite a challenge and is often not only based on python/R packages. All the major autodiff-frameworks such as pytorch and tensorflow gain a lot of ease by directly providing both cuda-driver-support and python-library through conda.

Really looking forward to some conda specification(s) which

  • focus on semantic versioning
  • can be frozen (similar to the list export / list explicit)
  • allows broad development version ranges, e.g. numpy>1.16
  • is maybe based on a canonical name or can be easily activated within a project/directory (currently I have skeleton files "sur-xxx" for it to use conda in bash with auto-completion)
  • could itself be versioned to detect changes
  • supports/wraps pip and poetry (and considers future python changes)

There are quite some similarities to package build dependency managers ..

@etcet
Copy link

etcet commented Oct 14, 2020

I'd like to share the current technique I'm using to provide reproducible conda environments. I doubt it's cross platform but I've only got to support one.

I'm using GitLab CI on a repo which includes an environment.yml [1]. The CI job installs the environment and then spits out a spec file [2]. That spec file gets a name based on the commit hash and the time (because our channels are always updating), it gets uploaded to a http server, and we build a docker image which has it installed (again versioned w/ the hash and time). The spec file is quick to install because the solve is already done in CI, you're just downloading files specified in the spec file.

We have to consider the date/time because the state of the channels determines how our environment will solve. For example, an environment which only includes python would give us a different version if solved today compared to last year. Environments with dozens of packages will solve differently day to day.

The versioning allows you to keep environments loose, allowing updates. Sometimes you may have to manually pin packages due to failed tests downstream. Or if you don't care about updates just keep using the environment solved however long ago.

This approach requires the desired files to be present and it assumes that the files won't change. I've read that might not be always strictly true with the conda mirrors.

An improvement I'd like to make to this is make a mirror of my channels (conda/conda-forge) which is on ZFS/btrfs/something so I can serve historical daily snapshots. Then I could include the specific daily channel snapshot in my environment and it should solve the same way no matter when it's solved.

If I'm making any assumptions or mistakes I'd love to know.

1: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
2: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments

@majidaldo
Copy link
Contributor

majidaldo commented Oct 14, 2020

Aren't these issues largely solved by conda devenv? If you follow the discussions there, they don't want to become 'conda build' as conda build is full-featured and geared towards making packages. I'm interested in having one spec that goes to package, build, run, dev..etc. But I think it should primarily be interpreted as an 'env spec' (not package) so you don't have to face the complexity of conda build and packaging just to have some structure around dev envs.

Well, no. I think we're mixing some issues up. At least with the successful upcoming of poetry a major gap in the python package-development-build-cycle is about to close when comparing it to java and php. And I agree, this is not the focus of conda which acts more as a "cross-platform binary package manager". What I was missing back in 2018 was an easier python package build process and separately from that a more expressive versionable environment for python projects beyond libraries (e.g. builds for pypi).

The first aspect will be clearly fixed by poetry and further python language developments.
The second aspect should be solved by (something like) conda.
In my opinion the current environment specification is not expressive enough (e.g. version ranges) which might be good for exact reproducibility but lacks more collaborative working capabilities. E.g. when I am sharing my project with colleagues and would like to differentiate between environment requirements and a reproducible freeze or something like that.
A lot of that requirements are summarized above by cjw296

A question is then: in which cases we're going to see broader usage of conda.
From my own experience I see at least three different types of projects: 1) python package development, 2) application development and 3) scientific experiments / reproducible environments / working environments.
The first clearly will be covered mostly by tools such as poetry (especially with further refinements in upcoming python versions). It makes python build, test and deployment incredibly easy and as long as there are no further requirements for c-libraries there will be no need for further package management and this is actually very good because the development life cycle got quite easier.
The second type (applications) might need more packaging/deployment structure with requirements going beyond capabilities exclusively limited to python packages. Those builds could be based on conda but I see lots of projects which rather go with full docker builds and tests. Not sure how strong conda can define its feature set here.
For the third type of projects I still see a huge potential and this is where anaconda-project lived in and got strong. Especially the brutal R ecosystem used in data science projects will still be long stand on conda and I guess still a lot of people are about to transition to this type of prototyping/development. Having a reproducible scientific experiment project is quite a challenge and is often not only based on python/R packages. All the major autodiff-frameworks such as pytorch and tensorflow gain a lot of ease by directly providing both cuda-driver-support and python-library through conda.

Really looking forward to some conda specification(s) which

  • focus on semantic versioning
  • can be frozen (similar to the list export / list explicit)
  • allows broad development version ranges, e.g. numpy>1.16
  • is maybe based on a canonical name or can be easily activated within a project/directory (currently I have skeleton files "sur-xxx" for it to use conda in bash with auto-completion)
  • could itself be versioned to detect changes
  • supports/wraps pip and poetry (and considers future python changes)

There are quite some similarities to package build dependency managers ..

I'm in general agreement with your well-thought out reply (thanks). But I still see conda envs, at its core, as something that you 'do stuff' in. So I see the 'types' of projects you enumerated can all be contained in conda env(s). Now, we may talk about how much a conda env spec 'sees'/interprets specific build and package system specs. (I don't believe poetry should be in the business of venvs!, and as for 2, people who think just filling Dockerfiles saves them from pkg mgt are dumb...uggh you should not have to do system-level containment unless you are dealing with the system..but that's another discussion)

I'm more coming from number 3 where package and versioning formality is not as important as pinning (external) dep vers. Again, you can hack together something based on conda devenv that does a good chunk of what is being described in this thead; You can create a dep tree via includes and invoke whatever build/installation/setup process you want. For me, this was key to finding that balance b/w a (complicated) build dep mgr. and a (unstructured) prototyping env(s); I need multiple envs that can exist simultaneously and can potentially communicate (Docker here would introduce unnecessary complications and overhead).

As an analogy, it's sort of like Makefiles but on the level of environments where conda is doing the hard work of solving envs. What the 'directives' of the makefile are is what we're discussing here.

The system that accommodates 3, is the same one accommodates 1 and 2. I just don't see why there should be a distinction; you create an env and develop towards your specific needs in one seamless process. Start with a simple environment.yml and evolve it into more structured environment ymls that could contain poetry elements. Same process beginning with simple envs as the development 'module' instead of having to create a package.

@infokiller
Copy link

Since no one mentioned it, I want to suggest looking closely at Nix and GUIX (note: I'm talking about the package managers, not the closely related Linux distros based on them). Reproducibility is at their core and they do a lot of things right.
Bazel may also be worth looking into.

@ostrokach
Copy link

ostrokach commented Oct 14, 2020

I'd like to share the current technique I'm using to provide reproducible conda environments. I doubt it's cross platform but I've only got to support one.

I'm using GitLab CI on a repo which includes an environment.yml [1]. The CI job installs the environment and then spits out a spec file [2]. That spec file gets a name based on the commit hash and the time (because our channels are always updating), it gets uploaded to a http server, and we build a docker image which has it installed (again versioned w/ the hash and time). The spec file is quick to install because the solve is already done in CI, you're just downloading files specified in the spec file.

We have to consider the date/time because the state of the channels determines how our environment will solve. For example, an environment which only includes python would give us a different version if solved today compared to last year. Environments with dozens of packages will solve differently day to day.

The versioning allows you to keep environments loose, allowing updates. Sometimes you may have to manually pin packages due to failed tests downstream. Or if you don't care about updates just keep using the environment solved however long ago.

This approach requires the desired files to be present and it assumes that the files won't change. I've read that might not be always strictly true with the conda mirrors.

An improvement I'd like to make to this is make a mirror of my channels (conda/conda-forge) which is on ZFS/btrfs/something so I can serve historical daily snapshots. Then I could include the specific daily channel snapshot in my environment and it should solve the same way no matter when it's solved.

If I'm making any assumptions or mistakes I'd love to know.

1: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
2: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#building-identical-conda-environments

We ran into the same problems with environment.yaml files that you describe. An environment.yaml file that created a functional environment one day could produce a different environment, where some things no longer work, a few days later, due to drift in the packages available from different channels.

Our solution is similar to yours, only instead of saving the environment spec file, we package the environment into a .tar.gz archive using conda-pack. See here for an example: https://gitlab.com/conda-envs/default/-/pipelines.

@dwr-psandhu
Copy link

Reproducibility would mean not depending upon network resources. So I vote for conda-pack for 100% reproducibility

It ties you down to OS/machine, but we have docker and VMs for taking care of running code decades later :) Not uncommon in scientific settings

@cjw296
Copy link

cjw296 commented Mar 16, 2021

I don't understand the link with network resources, if your lock file includes hashes of the sources, how would that not be reproducible?

@brianv0
Copy link

brianv0 commented Mar 16, 2021

Well there's reproducibility within the life of conda (really anaconda.org) and reproducibility external to the life of conda where the package repository has disappeared, format has been deprecated, etc... conda-pack solves the latter issue. It's a valid concern for many people working on long term data preservation but I assume it is out of scope of such a proposal, but could be integrated as a feature for lock files (e.g. materialize a lock to a pack).

Speaking as an engineer from the Rubin Observatory, where we have a hundred or more downstream consumers of our conda environments (and now conda-forge metapackage), we had previously used a conda-env specification and periodically would do installs and export a conda list --explicit for lock files (for every architecture combination we support), due to it bypassing the solver when recreating an environment (conda-env export would still trigger the solver, of course). The users would consume the lock files. Of course, conda-lock is similar with the extra SHA256 hashes but it was something we eschewed for the time being to not be dependent on. We have since reverted that as the default and moved to a metapackage in conda-forge (which users now "consume"), with not very much pinning because solver time was atrocious (we were stuck with an old version of boost for a while) if you wanted to add something on top of that environment (unless you deleted the history file after environment creation). The canonical example of this is providing a user with the environment you used to create and analyze a dataset, and then the first thing they do is try to install jupyter or dask or something like that. That is not the case with mamba because it actually ignores the history, but in many cases it seemed conservative in the upgrades/downgrades it was doing, but it was still user-beware in that scenario. We decided the metapackage was a win in the short term because users were often modifying their environments later, but I think an ideal middle ground for us seems to be generating a file of the dependency versions we care about at the time of generating the recipe (somewhere between the lock file which locks down to a build number and hash). That was a reasoning for #10210. Still, the "lock" files are extremely useful for recreating environments today and will be used for production data processing, particularly because they are fast due to bypassing the solver, so we still generate them (but our conda-env is now just the metapackage).

One thing I would really like to see is for the lock generation mechanism to work cross platform. I would like to be able to create a lock file(s) for a list of supported system architectures without necessarily having access to those architectures. I'm sure others have though the same recently with the introduction of the Apple M1. I believe when I looked this was not quite trivial because conda relies on python reporting of the OS directly (sys, platform) and that was sprinkled throughout the code, but my memory may be fuzzy. Of course, you ultimately need a VM matching the data processing OS/architecture to reproduce things exactly, but we have rarely seen differences (other than speed) across platforms/architectures to date (in the context of conda-forge channels).

Following from that, I do not believe that the lock file should not be simply named environment.[ext].lock for example, unless it actually holds all the architectures available in there.

Today we still use conda as our default, but often tell users to install mamba in many scenarios. We have been trying to keep up to date with the latest packages partly to make sure conda solves faster.

I know this is probably a very hard problem to solve, and there's several intertwined issues I've mentioned which are not directly tied to a specification. But ideally unified platform/architecture-agnostic specification (similar to dependencies section in a meta.yaml today) that can be used directly by conda-build, or conda, and you should be able produce a lock files for all platforms/arch (or a unified lock file) from that specification.

Bonus points for thinking of a way of letting a user create a new environment, in a way which is likely to be compatible with an existing environment, if given the environment specification, a lock file, and the new packages they would like to add to the environment.

P.S. Sorry if rambling (I can blame DST), we do appreciate the system and ecosystem around it very much, I just wanted to document some of the experience around shipping environments to users.

ostrokach added a commit to ostrokach/protein-adjacency-net that referenced this issue Mar 23, 2021
… conda-env

See conda/conda#7248 for a list of issues with conda-env.
The biggest issue is that it does not respect the channels list.
@kenodegard kenodegard added plugins::env pertains to conda-env and removed tag::environment-spec labels Jan 18, 2022
@kenodegard kenodegard pinned this issue Mar 10, 2022
@kenodegard kenodegard unpinned this issue Mar 10, 2022
@kenodegard kenodegard mentioned this issue Mar 16, 2022
21 tasks
@travishathaway travishathaway removed this from the 5.0.0 milestone Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic a highlevel collection of smaller related issues plugins::env pertains to conda-env source::governance created by members of the conda governance (https://github.com/conda-incubator/governance) type::feature request for a new feature or capability
Projects
Status: 🍞 Stale
Development

No branches or pull requests