Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize container best practice (esp. for complex tools) #37

Open
jmchilton opened this issue Feb 27, 2017 · 5 comments
Open

Formalize container best practice (esp. for complex tools) #37

jmchilton opened this issue Feb 27, 2017 · 5 comments

Comments

@jmchilton
Copy link
Contributor

tl;dr - Should it be a best practice to (1) register combinations of requirements for complex tools and publish all needed combinations to a container registry or (2) should Galaxy just build complex containers as it needs to for such tools.

I think there is probably broad consensus that the "mulled" approach to building containers should be part of a best practice for using containers with Galaxy. From an operations perspective this produces both tiny containers that are very easy and quick to deploy and manage, from a reproducibility and support perspective this allows the same (best-practice Conda) binaries to work on bare metal or inside of a container, and from a developer perspective this will ideally become much more transparent than a Dockerfile-based approach.

The follow up recommendation is less clear in my opinion. We currently have thousands of containers for individual requirements that can be used with tools that work with BioConda and only have a single requirement tag. For tools that contain multiple requirement tags - which I contend are not a corner case but a very mainstream and typical use case - we could recommend two different things as a best practice.

Put another way - should Galaxy (1) fetch the containers it needs or (2) build them.

Pros of (1) are:

  • From a Galaxy and admin perspective tools with multiple requirements are not handled differently than tools with single requirements.
  • I feel better about the reproducibility of this approach.
  • I feel better about the ability to exactly test the ultimate environments.
  • As hinted at in Holistic Approach to Container Caching galaxyproject/galaxy#3673 - this approach would work better with different deployment scenarios where nodes fetch there own containers by various mechanisms.
  • Provides more surface area to provide value added features - such as singularity containers.

Pros of (2) are:

  • This is more flexible adapts to new tools, requirements, channels, etc... on the fly.
  • Requires less upfront work by the tool author (or perhaps the tool shed).
  • No need to manage a large assortment of existing containers. If changes to the approach are needed we can just push a Galaxy patch and not need to update or rebuild containers. (Though I'm not sure we will ever need to rebuild anything if we get the testing right upfront).

Ping @bgruening, @mvdbeek, @jxtx.

@bgruening
Copy link
Member

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Somewhere on my ToDo list is to extend https://github.com/BioContainers/mulled and create a small tiny website to assemble conda packages and create mixed-mulled containers. The names should be normalised and hashed in a unique way. The aim is to get from a random assembled requirements.txt file (which the same packages) the same container back.

We could also think about to integrate this in the travis testing from IUC. So that IUC creates these containers on PR-merge.

I think there is a benefit in generating them outside of Galaxy, for the reasons you mentioned, but also because I want to generate more ... like Singularity images - and with this everyone can profit - and we get in turn more care and funding for BioConda.

@jxtx
Copy link

jxtx commented Feb 28, 2017

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Do we register and push a container to an external repository when we build one?

I much prefer option 1 for reproducibility. I can see 2 being important for development, but I wouldn't like to see production Galaxy instances using this approach.

@mvdbeek
Copy link
Contributor

mvdbeek commented Feb 28, 2017

I think we'll need both. We can't be sure that a dependency (in a container) really works (almost) everywhere until we have tested it in a bare bones container, so ideally the iuc tool tests would build and run (maybe also push on merge) the container. planemo could have a --local_container option for this.

For production instances we should probably not default to building locally.
In addition to the reproducibility problem @jxtx mentioned, I think on busy sites building many containers at once could kill docker, in a way that is probably worse than activating many conda environments in parallel. Of course we could do the container building in a separate job on the cluster nodes, but then you'd have to build this at least once per worker or introduce some smarter logic to distribute the container.

Also if you build locally you will not know upfront if the built container will work, so what would you do if the container doesn't work? Rebuild until it does? That seems wasteful and is already a minus point for conda_auto_install.

@jmchilton
Copy link
Contributor Author

Thanks all - I don't agree with every nuance - but in large part I agree with most of this. I appreciate yinz taking the time to respond. My goal for the next few days of development was to establish that we can state having an existing container is considered best practice. I'll take that and work on it. Hopefully we will have a process in place by the GCC.

I will however say in defense of (2) as long as it is cached it is no worse for reproducibility than allowing each site to install the binary dependencies locally once - as we do now and have always done. I get that (1) is much better than we've traditionally done so we should do it.

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

@mvdbeek
Copy link
Contributor

mvdbeek commented Feb 28, 2017

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

I agree, I think this is a good idea for certain scenarios. I was just mentioning this as an example for the extra work that would be involved in managing the container lifecycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants