create docker container with all deps and data pre-installed #393

brentp · 2015-03-17T15:29:14Z

To ease use, it would be nice to have a docker container. Things to address.

instructions for installing docker and loading VCF from local file system
can docker work without sudo? (see: http://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo ... and problems: Being a member of the docker group is the same as giving a user full root access to the system subuser-security/subuser#131, might be resolved in recent version: Volumes files have root owner when running docker with non-root user. moby/moby#3124)
~~will dockerhub host a large image?~~ [NO]
create a separate docker data volume (as per @bgruening 's suggestion below).

this would also make a portable and shareable analysis environment.

@chapmanb @roryk I know you have thought about these, any further things to consider?

dgaston · 2015-03-17T15:42:34Z

Regarding the data, it would probably be better to still distribute the data in another fashion and mount it as a shared volume within the docker container, as well as the volume where the user wants to put outputs. While I don't think the docker hub has size limits on images, there seem to be some recurring problems people are having both pushing and pulling large images (moby/moby#2292 has issues for large layers going back as 2013 and as recent as 11 days ago)

Although I agree that having the data with the container greatly aids in portability as well as reproducibility, since those images can/should be versioned.

brentp · 2015-03-17T15:52:53Z

Good thoughts. I'd like to have the data inside so it's a single download to start running gemini. The data on the image should be ~3GB or so. The issue you link mentions a 5GB hard-limit and that most of the problems are on push. I would check to see if pull is more reliable. I guess we would also create an image with everything but dbsnp and 1000G, that would remove the majority of the data from the image.

dgaston · 2015-03-17T16:07:56Z

3 GB probably isn't too bad. Not sure why I was thinking it might be bigger. I agree that a single download would be good. I'm willing to test this out a bit in terms of creating an image and seeing if I can push/pull it from the public repo.

dgaston · 2015-03-17T16:08:38Z

It would fit nicely with some other work that I am doing anyway. I am working on a web interface that has some features including gemini integration, and I want to move that to a Docker-based distribution.

brentp · 2015-03-17T16:11:14Z

@dgaston that would be awesome if you wanted to create an image. We have an impending release with data changes so it might be good to wait until then.

brentp · 2015-03-17T16:13:06Z

actually, it looks like it will be quite a bit larger than 3GB.

bgruening · 2015-03-17T16:41:46Z

I don't think a gemini image with data included is a good idea. We should think about separated data volumes: http://docs.docker.com/userguide/dockervolumes/

If this doesn't work, it should be simple to trigger the data download from within the container.

chapmanb · 2015-03-17T16:45:18Z

Brent and Daniel;
This would be great. More docker, please. We include a ready-to-go GEMINI, without the data, as part of the bcbio docker image, so you could import that directly to try it out:

https://github.com/chapmanb/bcbio-nextgen-vm#docker-image-installation

This also includes a bunch of other tools but should at least be usable as a starting point.

Regarding the discussion points:

Don't include the data. It'll be too big. I have a separate data download stage as part of running bcbio with Docker:

https://github.com/chapmanb/bcbio-nextgen-vm#installation

You do need root (or effectively root) permissions to run Docker now. Namespace support might be coming in 1.6 which would avoid this. I've been patiently waiting for that support for a long while.
My experience is that the docker index can't handle large images. I got lots of timeouts and failures when trying this, and now do a direct download of a gzipped tarball from S3.

dgaston · 2015-03-17T16:48:18Z

Good points Brad. I have been following your use of Docker in bcbio, which is how I got introduced to Docker as a tool in the first place. I agree that keeping the data separate is the best idea. Docker volumes work well. I'll look at how you handle it in bcbio in terms of managing the user download and having that path reflected in the Docker container itself for linking since that is usually specified in the Dockerfile

chapmanb · 2015-03-17T16:58:26Z

Thanks for looking into this I'd definitely be open to learning new ways of handling data, especially via data volumes. Some of that didn't exist, or I didn't understand it, when I wrote that code. It would be awesome to learn ways that don't need so much custom code and rely more on all the cool Docker tooling.

brentp · 2015-03-17T17:04:01Z

thanks for all the insight. I updated the list.

if docker 1.6 supports use without sudo and the data volumes meet our needs, then that should be a good combination to explore.

dgaston · 2015-03-17T17:07:33Z

Probably worth exploring before then and experimenting. Obviously the permissions may be an issue for some users until that is fixed. For the really security conscious you could probably isolate your install inside a Vagrant VM as well for further security

dgaston · 2015-03-20T17:09:08Z

Hi everyone, so I have started working on this in terms of building out a functional image to start with and working my way up from there. I thought about different ways of doing this, including forking the GEMINI github project itself and adding a Dockerfile and going that route, but I think for managing versions it was best to build an image following the use of the installation script.

Right now I have an image that uses Ubuntu 14.04 as the base image, with all dependencies needed at the system level (I think, haven't run into any issues yet) installed, and GEMINI and all its dependencies installed using the install script with the no-data option.

This isn't currently set up like bcbio-nextgen-vm where you have a wrapper script around all of the docker components so right now it is executed using the docker command line itself.

Next step is to handle volumes for interacting with the users filesystem and to automount a data directory for the GEMINI data.

The Dockerhub repo is at https://registry.hub.docker.com/u/dgaston/gemini/

Oh and follow up steps will be to write a wrapper script that will handle downloading and setting up the data directory and interacting with docker. Any and all feedback is appreciated

brentp · 2015-03-20T17:53:22Z

@dgaston, this is excellent! So, you were able to just use the gemini_install.py script to get all the deps?

Have you looked at the data-volumes?

dgaston · 2015-03-20T20:42:28Z

Yes the gemini_install.py script executed just withy --nodata works in creating the image. At least it works in that the build successfully completes and I can create a container and run gemini without arguments or gemini --help and get the help output.

Since even when you run with --nodata you still have to specify a datadir the datadir inside the container is already in the gemini config file. So if you launch the container using the -v flag you can specify a data volume. So for instance my gemini files are in /data/shared/gemini/data and the datadir is /root/gemini/data

docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a /bin/bash

that starts a docker container with the host directory as a data volume. The container isn't executing gemini, just running bash so I can drop into the container and check things are in the right spot and they are.

I could add another directory with say a vcf file I want to run and do that and check it from inside the container before I continue building on the Dockerfile and any external scripts.

Right now I am trying to figure out the best way of installing just the data in a local directory. Of course I can write a simple script that grabs it from the AWS S3 bucket but it would be nice to work with the versioning system that gemini_install.py has. It seems a little hard at the moment to untangle installation of all gemini components from just the data though.

dgaston · 2015-03-20T20:43:07Z

@arq5x or @chapmanb might have some ideas there.

chapmanb · 2015-03-21T09:56:50Z

Daniel;
Great work with putting this together. You're right on with the data directory. Mount the internal gemini directory to somewhere external then you can run:

gemini update --dataonly

and it'll download/upload just the data, mapping it to the non-docker local directory. This is essentially all bcbio does with the scripts, but just manages a consistent set of directories so users don't need to know about the details. Thanks again for all the great work on this.

dgaston · 2015-03-21T11:58:47Z

Oh that's perfect, of course the update will work on its own with the data only flag. Cheers. I have updated the image on docker such that the entrypoint is the gemini command itself such that a command line like:

docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a

is all that is needed, it defaults to --help if no other command line parameters are passed. You can mount multiple volumes so doing your current working directory would be a good idea to have persistent results and output saved outside of the container. It probably also makes sense to orchestrate it a little more complex where you mount the data directory to a data-only container first, and then use --data-volumes-from to get the data into the gemini container. That could be accomplished with a wrapper script of course and some config files but I'm looking into docker's built in orchestration tools since they have acquired fig and some other third party applications for orchestrating multiple container applications.

dgaston · 2015-03-23T15:58:17Z

Well everything seems to be going smoothly. Updated the image to the 0.12.2 fix with cyvcf and it is happily chugging away loading a vcf into a gemini database at the moment. This will work nicely for being able to store and re-run previous versions of GEMINI in a fairly straightforward way. The only real issue being that maintaining older versions of the data files is still a bit cumbersome.

brentp · 2015-03-26T16:44:59Z

@dgaston, what did you use for the base image? I think this would be a nice way to test gemini on a completely new system so we make sure all of the data files and deps get pulled in.

brentp · 2015-03-26T20:59:19Z

On a bare docker image, I have this recipe to run tests on the dev branch:

# start the container, the first time this will take a while.
docker run -t -i ubuntu:14.04 /bin/bash
apt-get update && apt-get -y upgrade
apt-get -y install python2.7-dev git-core wget curl g++-4.8 zlib1g-dev g++ build-essential
ln -s /usr/bin/python2.7 /usr/bin/python

# install
git clone https://github.com/arq5x/gemini.git && cd gemini \
    && git branch --track dev origin/dev \
    && git checkout dev \
    && python2.7 gemini/scripts/gemini_install.py --gemini-version unstable /tools /data

# test
PATH=/data/anaconda/bin:/tools/bin/:$PATH
cd /data/gemini && bash master-test.sh

dgaston · 2015-03-27T11:42:24Z

I used Ubuntu 14.04 as my base, but you could definitely use a smaller one I'm sure. There are some tweaks that hopefully can be made th flatten the image a bit more.

dgaston · 2015-03-27T11:46:40Z

@brentp That could be pretty easily translated into a Dockerfile for testing as well. I could tweak it and push it to the same repo. I have my Dockerfile stashed in this git repo: https://github.com/dgaston/gemini-docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create docker container with all deps and data pre-installed #393

create docker container with all deps and data pre-installed #393

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

dgaston commented Mar 17, 2015

brentp commented Mar 17, 2015

brentp commented Mar 17, 2015

bgruening commented Mar 17, 2015

chapmanb commented Mar 17, 2015

dgaston commented Mar 17, 2015

chapmanb commented Mar 17, 2015

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

dgaston commented Mar 20, 2015

brentp commented Mar 20, 2015

dgaston commented Mar 20, 2015

dgaston commented Mar 20, 2015

chapmanb commented Mar 21, 2015

dgaston commented Mar 21, 2015

dgaston commented Mar 23, 2015

brentp commented Mar 26, 2015

brentp commented Mar 26, 2015

dgaston commented Mar 27, 2015

dgaston commented Mar 27, 2015

create docker container with all deps and data pre-installed #393

create docker container with all deps and data pre-installed #393

Comments

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

dgaston commented Mar 17, 2015

brentp commented Mar 17, 2015

brentp commented Mar 17, 2015

bgruening commented Mar 17, 2015

chapmanb commented Mar 17, 2015

dgaston commented Mar 17, 2015

chapmanb commented Mar 17, 2015

brentp commented Mar 17, 2015

dgaston commented Mar 17, 2015

dgaston commented Mar 20, 2015

brentp commented Mar 20, 2015

dgaston commented Mar 20, 2015

dgaston commented Mar 20, 2015

chapmanb commented Mar 21, 2015

dgaston commented Mar 21, 2015

dgaston commented Mar 23, 2015

brentp commented Mar 26, 2015

brentp commented Mar 26, 2015

dgaston commented Mar 27, 2015

dgaston commented Mar 27, 2015