Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create docker container with all deps and data pre-installed #393

Open
4 tasks
brentp opened this issue Mar 17, 2015 · 23 comments
Open
4 tasks

create docker container with all deps and data pre-installed #393

brentp opened this issue Mar 17, 2015 · 23 comments

Comments

@brentp
Copy link
Collaborator

brentp commented Mar 17, 2015

To ease use, it would be nice to have a docker container. Things to address.

this would also make a portable and shareable analysis environment.

@chapmanb @roryk I know you have thought about these, any further things to consider?

@dgaston
Copy link
Contributor

dgaston commented Mar 17, 2015

Regarding the data, it would probably be better to still distribute the data in another fashion and mount it as a shared volume within the docker container, as well as the volume where the user wants to put outputs. While I don't think the docker hub has size limits on images, there seem to be some recurring problems people are having both pushing and pulling large images (moby/moby#2292 has issues for large layers going back as 2013 and as recent as 11 days ago)

Although I agree that having the data with the container greatly aids in portability as well as reproducibility, since those images can/should be versioned.

@brentp
Copy link
Collaborator Author

brentp commented Mar 17, 2015

Good thoughts. I'd like to have the data inside so it's a single download to start running gemini. The data on the image should be ~3GB or so. The issue you link mentions a 5GB hard-limit and that most of the problems are on push. I would check to see if pull is more reliable. I guess we would also create an image with everything but dbsnp and 1000G, that would remove the majority of the data from the image.

@dgaston
Copy link
Contributor

dgaston commented Mar 17, 2015

3 GB probably isn't too bad. Not sure why I was thinking it might be bigger. I agree that a single download would be good. I'm willing to test this out a bit in terms of creating an image and seeing if I can push/pull it from the public repo.

@dgaston
Copy link
Contributor

dgaston commented Mar 17, 2015

It would fit nicely with some other work that I am doing anyway. I am working on a web interface that has some features including gemini integration, and I want to move that to a Docker-based distribution.

@brentp
Copy link
Collaborator Author

brentp commented Mar 17, 2015

@dgaston that would be awesome if you wanted to create an image. We have an impending release with data changes so it might be good to wait until then.

@brentp
Copy link
Collaborator Author

brentp commented Mar 17, 2015

actually, it looks like it will be quite a bit larger than 3GB.

@bgruening
Copy link
Contributor

I don't think a gemini image with data included is a good idea. We should think about separated data volumes: http://docs.docker.com/userguide/dockervolumes/

If this doesn't work, it should be simple to trigger the data download from within the container.

@chapmanb
Copy link
Collaborator

Brent and Daniel;
This would be great. More docker, please. We include a ready-to-go GEMINI, without the data, as part of the bcbio docker image, so you could import that directly to try it out:

https://github.com/chapmanb/bcbio-nextgen-vm#docker-image-installation

This also includes a bunch of other tools but should at least be usable as a starting point.

Regarding the discussion points:

  • Don't include the data. It'll be too big. I have a separate data download stage as part of running bcbio with Docker:

https://github.com/chapmanb/bcbio-nextgen-vm#installation

  • You do need root (or effectively root) permissions to run Docker now. Namespace support might be coming in 1.6 which would avoid this. I've been patiently waiting for that support for a long while.
  • My experience is that the docker index can't handle large images. I got lots of timeouts and failures when trying this, and now do a direct download of a gzipped tarball from S3.

@dgaston
Copy link
Contributor

dgaston commented Mar 17, 2015

Good points Brad. I have been following your use of Docker in bcbio, which is how I got introduced to Docker as a tool in the first place. I agree that keeping the data separate is the best idea. Docker volumes work well. I'll look at how you handle it in bcbio in terms of managing the user download and having that path reflected in the Docker container itself for linking since that is usually specified in the Dockerfile

@chapmanb
Copy link
Collaborator

Thanks for looking into this I'd definitely be open to learning new ways of handling data, especially via data volumes. Some of that didn't exist, or I didn't understand it, when I wrote that code. It would be awesome to learn ways that don't need so much custom code and rely more on all the cool Docker tooling.

@brentp
Copy link
Collaborator Author

brentp commented Mar 17, 2015

thanks for all the insight. I updated the list.

if docker 1.6 supports use without sudo and the data volumes meet our needs, then that should be a good combination to explore.

@dgaston
Copy link
Contributor

dgaston commented Mar 17, 2015

Probably worth exploring before then and experimenting. Obviously the permissions may be an issue for some users until that is fixed. For the really security conscious you could probably isolate your install inside a Vagrant VM as well for further security

@dgaston
Copy link
Contributor

dgaston commented Mar 20, 2015

Hi everyone, so I have started working on this in terms of building out a functional image to start with and working my way up from there. I thought about different ways of doing this, including forking the GEMINI github project itself and adding a Dockerfile and going that route, but I think for managing versions it was best to build an image following the use of the installation script.

Right now I have an image that uses Ubuntu 14.04 as the base image, with all dependencies needed at the system level (I think, haven't run into any issues yet) installed, and GEMINI and all its dependencies installed using the install script with the no-data option.

This isn't currently set up like bcbio-nextgen-vm where you have a wrapper script around all of the docker components so right now it is executed using the docker command line itself.

Next step is to handle volumes for interacting with the users filesystem and to automount a data directory for the GEMINI data.

The Dockerhub repo is at https://registry.hub.docker.com/u/dgaston/gemini/

Oh and follow up steps will be to write a wrapper script that will handle downloading and setting up the data directory and interacting with docker. Any and all feedback is appreciated

@brentp
Copy link
Collaborator Author

brentp commented Mar 20, 2015

@dgaston, this is excellent! So, you were able to just use the gemini_install.py script to get all the deps?

Have you looked at the data-volumes?

@dgaston
Copy link
Contributor

dgaston commented Mar 20, 2015

Yes the gemini_install.py script executed just withy --nodata works in creating the image. At least it works in that the build successfully completes and I can create a container and run gemini without arguments or gemini --help and get the help output.

Since even when you run with --nodata you still have to specify a datadir the datadir inside the container is already in the gemini config file. So if you launch the container using the -v flag you can specify a data volume. So for instance my gemini files are in /data/shared/gemini/data and the datadir is /root/gemini/data

docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a /bin/bash

that starts a docker container with the host directory as a data volume. The container isn't executing gemini, just running bash so I can drop into the container and check things are in the right spot and they are.

I could add another directory with say a vcf file I want to run and do that and check it from inside the container before I continue building on the Dockerfile and any external scripts.

Right now I am trying to figure out the best way of installing just the data in a local directory. Of course I can write a simple script that grabs it from the AWS S3 bucket but it would be nice to work with the versioning system that gemini_install.py has. It seems a little hard at the moment to untangle installation of all gemini components from just the data though.

@dgaston
Copy link
Contributor

dgaston commented Mar 20, 2015

@arq5x or @chapmanb might have some ideas there.

@chapmanb
Copy link
Collaborator

Daniel;
Great work with putting this together. You're right on with the data directory. Mount the internal gemini directory to somewhere external then you can run:

gemini update --dataonly

and it'll download/upload just the data, mapping it to the non-docker local directory. This is essentially all bcbio does with the scripts, but just manages a consistent set of directories so users don't need to know about the details. Thanks again for all the great work on this.

@dgaston
Copy link
Contributor

dgaston commented Mar 21, 2015

Oh that's perfect, of course the update will work on its own with the data only flag. Cheers. I have updated the image on docker such that the entrypoint is the gemini command itself such that a command line like:

docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a

is all that is needed, it defaults to --help if no other command line parameters are passed. You can mount multiple volumes so doing your current working directory would be a good idea to have persistent results and output saved outside of the container. It probably also makes sense to orchestrate it a little more complex where you mount the data directory to a data-only container first, and then use --data-volumes-from to get the data into the gemini container. That could be accomplished with a wrapper script of course and some config files but I'm looking into docker's built in orchestration tools since they have acquired fig and some other third party applications for orchestrating multiple container applications.

@dgaston
Copy link
Contributor

dgaston commented Mar 23, 2015

Well everything seems to be going smoothly. Updated the image to the 0.12.2 fix with cyvcf and it is happily chugging away loading a vcf into a gemini database at the moment. This will work nicely for being able to store and re-run previous versions of GEMINI in a fairly straightforward way. The only real issue being that maintaining older versions of the data files is still a bit cumbersome.

@brentp
Copy link
Collaborator Author

brentp commented Mar 26, 2015

@dgaston, what did you use for the base image? I think this would be a nice way to test gemini on a completely new system so we make sure all of the data files and deps get pulled in.

@brentp
Copy link
Collaborator Author

brentp commented Mar 26, 2015

On a bare docker image, I have this recipe to run tests on the dev branch:

# start the container, the first time this will take a while.
docker run -t -i ubuntu:14.04 /bin/bash
apt-get update && apt-get -y upgrade
apt-get -y install python2.7-dev git-core wget curl g++-4.8 zlib1g-dev g++ build-essential
ln -s /usr/bin/python2.7 /usr/bin/python

# install
git clone https://github.com/arq5x/gemini.git && cd gemini \
    && git branch --track dev origin/dev \
    && git checkout dev \
    && python2.7 gemini/scripts/gemini_install.py --gemini-version unstable /tools /data

# test
PATH=/data/anaconda/bin:/tools/bin/:$PATH
cd /data/gemini && bash master-test.sh

@dgaston
Copy link
Contributor

dgaston commented Mar 27, 2015

I used Ubuntu 14.04 as my base, but you could definitely use a smaller one I'm sure. There are some tweaks that hopefully can be made th flatten the image a bit more.

@dgaston
Copy link
Contributor

dgaston commented Mar 27, 2015

@brentp That could be pretty easily translated into a Dockerfile for testing as well. I could tweak it and push it to the same repo. I have my Dockerfile stashed in this git repo: https://github.com/dgaston/gemini-docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants