-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create docker container with all deps and data pre-installed #393
Comments
Regarding the data, it would probably be better to still distribute the data in another fashion and mount it as a shared volume within the docker container, as well as the volume where the user wants to put outputs. While I don't think the docker hub has size limits on images, there seem to be some recurring problems people are having both pushing and pulling large images (moby/moby#2292 has issues for large layers going back as 2013 and as recent as 11 days ago) Although I agree that having the data with the container greatly aids in portability as well as reproducibility, since those images can/should be versioned. |
Good thoughts. I'd like to have the data inside so it's a single download to start running gemini. The data on the image should be ~3GB or so. The issue you link mentions a 5GB hard-limit and that most of the problems are on push. I would check to see if pull is more reliable. I guess we would also create an image with everything but dbsnp and 1000G, that would remove the majority of the data from the image. |
3 GB probably isn't too bad. Not sure why I was thinking it might be bigger. I agree that a single download would be good. I'm willing to test this out a bit in terms of creating an image and seeing if I can push/pull it from the public repo. |
It would fit nicely with some other work that I am doing anyway. I am working on a web interface that has some features including gemini integration, and I want to move that to a Docker-based distribution. |
@dgaston that would be awesome if you wanted to create an image. We have an impending release with data changes so it might be good to wait until then. |
actually, it looks like it will be quite a bit larger than 3GB. |
I don't think a If this doesn't work, it should be simple to trigger the data download from within the container. |
Brent and Daniel; https://github.com/chapmanb/bcbio-nextgen-vm#docker-image-installation This also includes a bunch of other tools but should at least be usable as a starting point. Regarding the discussion points:
https://github.com/chapmanb/bcbio-nextgen-vm#installation
|
Good points Brad. I have been following your use of Docker in bcbio, which is how I got introduced to Docker as a tool in the first place. I agree that keeping the data separate is the best idea. Docker volumes work well. I'll look at how you handle it in bcbio in terms of managing the user download and having that path reflected in the Docker container itself for linking since that is usually specified in the Dockerfile |
Thanks for looking into this I'd definitely be open to learning new ways of handling data, especially via data volumes. Some of that didn't exist, or I didn't understand it, when I wrote that code. It would be awesome to learn ways that don't need so much custom code and rely more on all the cool Docker tooling. |
thanks for all the insight. I updated the list. if docker 1.6 supports use without sudo and the data volumes meet our needs, then that should be a good combination to explore. |
Probably worth exploring before then and experimenting. Obviously the permissions may be an issue for some users until that is fixed. For the really security conscious you could probably isolate your install inside a Vagrant VM as well for further security |
Hi everyone, so I have started working on this in terms of building out a functional image to start with and working my way up from there. I thought about different ways of doing this, including forking the GEMINI github project itself and adding a Dockerfile and going that route, but I think for managing versions it was best to build an image following the use of the installation script. Right now I have an image that uses Ubuntu 14.04 as the base image, with all dependencies needed at the system level (I think, haven't run into any issues yet) installed, and GEMINI and all its dependencies installed using the install script with the no-data option. This isn't currently set up like bcbio-nextgen-vm where you have a wrapper script around all of the docker components so right now it is executed using the docker command line itself. Next step is to handle volumes for interacting with the users filesystem and to automount a data directory for the GEMINI data. The Dockerhub repo is at https://registry.hub.docker.com/u/dgaston/gemini/ Oh and follow up steps will be to write a wrapper script that will handle downloading and setting up the data directory and interacting with docker. Any and all feedback is appreciated |
@dgaston, this is excellent! So, you were able to just use the gemini_install.py script to get all the deps? Have you looked at the data-volumes? |
Yes the gemini_install.py script executed just withy --nodata works in creating the image. At least it works in that the build successfully completes and I can create a container and run gemini without arguments or gemini --help and get the help output. Since even when you run with --nodata you still have to specify a datadir the datadir inside the container is already in the gemini config file. So if you launch the container using the -v flag you can specify a data volume. So for instance my gemini files are in /data/shared/gemini/data and the datadir is /root/gemini/data docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a /bin/bash that starts a docker container with the host directory as a data volume. The container isn't executing gemini, just running bash so I can drop into the container and check things are in the right spot and they are. I could add another directory with say a vcf file I want to run and do that and check it from inside the container before I continue building on the Dockerfile and any external scripts. Right now I am trying to figure out the best way of installing just the data in a local directory. Of course I can write a simple script that grabs it from the AWS S3 bucket but it would be nice to work with the versioning system that gemini_install.py has. It seems a little hard at the moment to untangle installation of all gemini components from just the data though. |
Daniel;
and it'll download/upload just the data, mapping it to the non-docker local directory. This is essentially all bcbio does with the scripts, but just manages a consistent set of directories so users don't need to know about the details. Thanks again for all the great work on this. |
Oh that's perfect, of course the update will work on its own with the data only flag. Cheers. I have updated the image on docker such that the entrypoint is the gemini command itself such that a command line like: docker run -t -v /data/shared/gemini/data:/root/gemini/data -i dgaston/gemini:v0.1a is all that is needed, it defaults to --help if no other command line parameters are passed. You can mount multiple volumes so doing your current working directory would be a good idea to have persistent results and output saved outside of the container. It probably also makes sense to orchestrate it a little more complex where you mount the data directory to a data-only container first, and then use --data-volumes-from to get the data into the gemini container. That could be accomplished with a wrapper script of course and some config files but I'm looking into docker's built in orchestration tools since they have acquired fig and some other third party applications for orchestrating multiple container applications. |
Well everything seems to be going smoothly. Updated the image to the 0.12.2 fix with cyvcf and it is happily chugging away loading a vcf into a gemini database at the moment. This will work nicely for being able to store and re-run previous versions of GEMINI in a fairly straightforward way. The only real issue being that maintaining older versions of the data files is still a bit cumbersome. |
@dgaston, what did you use for the base image? I think this would be a nice way to test gemini on a completely new system so we make sure all of the data files and deps get pulled in. |
On a bare docker image, I have this recipe to run tests on the dev branch: # start the container, the first time this will take a while.
docker run -t -i ubuntu:14.04 /bin/bash
apt-get update && apt-get -y upgrade
apt-get -y install python2.7-dev git-core wget curl g++-4.8 zlib1g-dev g++ build-essential
ln -s /usr/bin/python2.7 /usr/bin/python
# install
git clone https://github.com/arq5x/gemini.git && cd gemini \
&& git branch --track dev origin/dev \
&& git checkout dev \
&& python2.7 gemini/scripts/gemini_install.py --gemini-version unstable /tools /data
# test
PATH=/data/anaconda/bin:/tools/bin/:$PATH
cd /data/gemini && bash master-test.sh |
I used Ubuntu 14.04 as my base, but you could definitely use a smaller one I'm sure. There are some tweaks that hopefully can be made th flatten the image a bit more. |
@brentp That could be pretty easily translated into a Dockerfile for testing as well. I could tweak it and push it to the same repo. I have my Dockerfile stashed in this git repo: https://github.com/dgaston/gemini-docker |
To ease use, it would be nice to have a docker container. Things to address.
will dockerhub host a large image?[NO]this would also make a portable and shareable analysis environment.
@chapmanb @roryk I know you have thought about these, any further things to consider?
The text was updated successfully, but these errors were encountered: