Run bcbio-nextgen genomic sequencing analyses using isolated containers and virtual machines
Python Shell
Latest commit 6aec33c Jan 28, 2017 @chapmanb AWS: generalize ansible setup for multiple regions
Ensures VPC and keypair commands sent to correctly specified created
regions. Previous approach ended up mixing zones depending on boto
settings. Thanks to @ohofmann and @brainstorm.

Allow generalization of VPC/keypair naming to allow multiple
VPCs/keypairs if needed.
Permalink
Failed to load latest commit information.
ansible
bcbiovm AWS: generalize ansible setup for multiple regions Jan 28, 2017
elasticluster Add location key to the default Elasticluster config file Oct 5, 2015
scripts Setup AWS infrastructure for ansible runs Jan 25, 2017
.gitignore Refactor monolithic script into modules. Update setup to handle cases… Jan 5, 2014
LICENSE.txt Small cleanups for monitoring graphs. Add license (fixes #27) Nov 20, 2014
README.rst
setup.py Add missing pythonpy dependency when installing, move serialize code … Sep 28, 2015

README.rst

bcbio-nextgen-vm

Run bcbio-nextgen genomic sequencing analysis pipelines using code and tools on cloud platforms or isolated inside of lightweight containers. This enables:

  • Improved installation: Pre-installing all required biological code, tools and system libraries inside a container removes the difficulties associated with supporting multiple platforms. Installation only requires setting up docker and download of the latest container.
  • Pipeline isolation: Third party software used in processing is fully isolated and will not impact existing tools or software. This eliminates the need for modules or PATH manipulation to provide partial isolation.
  • Full reproducibility: You can maintain snapshots of the code and processing environment indefinitely, providing the ability to re-run an older analysis by reverting to an archived snapshot.

This currently supports running on Amazon Web Services (AWS) and locally with lightweight docker containers. The bcbio documentation contains details on using bcbio-vm to run analyses on AWS. We also have in progress work on migrating bcbio's pipeline descriptions to use the Common Workflow Language (CWL).

We support using bcbio-vm for both AWS and local docker usage on Linux systems. On Mac OSX, only AWS usage currently works. Local docker support for Mac OSX is a work in progress and we have more details on the current status below. We welcome feedback and problem reports.

Installation

  • Install bcbio-vm using conda with an isolated Miniconda Python and link to a location on your PATH:

    wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
    bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
    ~/install/bcbio-vm/anaconda/bin/conda install --yes -c bioconda bcbio-nextgen-vm
    ln -s ~/install/bcbio-vm/anaconda/bin/bcbio_vm.py /usr/local/bin/bcbio_vm.py
    ln -s ~/install/bcbio-vm/anaconda/bin/arvados-cwl-runner /usr/local/bin/arvados-cwl-runner
    ln -s ~/install/bcbio-vm/anaconda/bin/cwltool /usr/local/bin/cwltool
    ln -s ~/install/bcbio-vm/anaconda/bin/conda /usr/local/bin/bcbiovm_conda
    

    If you're using bcbio-vm from your local machine to run on a pre-built remote AWS instance, or on an Arvados cloud instance this is all you need to get started. If you'd like to run locally or on a server with Docker, keep following the instructions to install the third party tools and data.

  • Install docker on your system. You will need root permissions.

  • Setup a docker group to provide the ability to run Docker without being root. Some installations, like Debian/Ubuntu packages do this automatically. You'll also want to add the trusted user who will be managing and testing docker images to this group:

    sudo groupadd docker
    sudo service docker restart
    sudo gpasswd -a ${USERNAME} docker
    newgrp docker
    
  • Ensure the driver script is setgid to the docker group. This allows users to run bcbio-nextgen without needing to be in the docker group or have root access. To avoid security issues, bcbio_vm.py sanitizes input arguments and runs the internal docker process as the calling user using a small wrapper script so it will only have permissions available to that user:

    sudo chgrp docker /usr/local/bin/bcbio_vm.py
    sudo chmod g+s /usr/local/bin/bcbio_vm.py
    
  • Install a dockerized bcbio-nextgen. This will get the latest bcbio docker image with software and tools, as well as downloading genome data:

    bcbio_vm.py --datadir=~/install/bcbio-vm/data install --data --tools \
      --genomes GRCh37 --aligners bwa
    

    For more details on expected download sizes, see the bcbio system requirements documentation. By default, the installation will download and import the default docker image as bcbio/bcbio. You can specify an alternative image location with --image your_image_name, and skip the --tools argument if this image is already present and configured.

    If you have an existing bcbio-nextgen installation and want to avoid re-installing existing genome data, first symlink to the current installation data:

    mkdir ~/install/bcbio-vm/data
    cd ~/install/bcbio-vm/data
    ln -s /usr/local/share/bcbio_nextgen/genomes
    ln -s /usr/local/share/gemini/data gemini_data
    
  • If you didn't use the recommended installation organization (a shared directory with code under anaconda and data under data) set the data location configuration once for each individual user of bcbio-nextgen to avoid needing to specify the location of data directories on subsequent runs:

    bcbio_vm.py --datadir=~/install/bcbio-vm/data saveconfig
    

Running

Usage of bcbio_vm.py is similar to bcbio_nextgen.py, with some cleanups to make the command line more consistent. To run an analysis on a prepared bcbio-nextgen sample configuration file:

bcbio_vm.py run -n 4 sample_config.yaml

To run distributed on a cluster using IPython parallel:

bcbio_vm.py ipython sample_config.yaml torque your_queue -n 64

bcbio-nextgen also contains tests that exercise docker functionality:

cd bcbio-nextgen/tests
./run_tests.sh docker
./run_tests.sh docker_ipython

Upgrading

bcbio-nextgen-vm enables easy updates of the wrapper code, tools and data. To update the wrapper code:

bcbio_vm.py install --wrapper

To update tools, with a download of the latest docker image:

bcbio_vm.py install --tools

To update the associated data files:

bcbio_vm.py install --data

Combine all commands to update everything concurrently.

Development Notes

These notes are for building containers from scratch or developing on bcbio-nextgen.

Mac OSX docker support

Running Docker on Mac OSX requires using a virtual machine wrapper. The recommended approach is to use boot2docker which wraps docker inside VirtualBox.

The current issue is mounting external directories into boot2docker. The mounts work as of Docker 1.3, but do not maintain the original user ID and group ID, but rather get mounted as root. Since bcbio runs as the original user to avoid security issues, you don't have permissions to make modifications in the directories. There is an open issue on the problem and we're currently not sure about the best approach or workaround.

Also, if you experience timeouts while pulling the docker image on OSX, please try to reboot the VirtualBox VM running boot2docker and/or upgrade it via:

docker-machine upgrade <boot2docker_VM>

We'd be happy to accept patches/suggestions from interested Mac OSX users.

Docker image installation

Install the current bcbio docker image into your local repository by hand with:

docker pull bcbio/bcbio

The installer does this automatically, but this is useful if you want to work with the bcbio-nextgen docker image independently from the wrapper.

Updates

To update bcbio-nextgen in a local docker instance during development, first clone the development code:

git clone https://github.com/chapmanb/bcbio-nextgen
cd bcbio-nextgen

Edit the code as needed, then update your local install with:

bcbio_vm.py devel setup_install

You can update the tools in your local container with:

bcbio_vm.py devel upgrade_tools

and register a GATK jar inside the container with:

bcbio_vm.py devel register gatk /path/to/GenomeAnalysisTK.tar.bz2

Creating docker image

Docker hub builds the bcbio docker image. We manually trigger this build to avoid overloading Docker hub services with a long rebuild on every change to the bcbio repository.

Preparing pre-built genomes

bcbio_vm downloads pre-built reference genomes when running analyses, to avoid needing these to be present on the initial machine images. To create the pre-built tarballs for a specific genome, start and bootstrap a single bcbio machine using the elasticluster interface. On the machine start a screen session then run:

bcbio_vm.py devel biodata --genomes GRCh37 --aligners bwa --aligners bowtie2 --datatarget vep

This requires permissions to write to the biodata bucket.