Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Tree: fb4378ee45

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.


Python scripts and modules for automated next gen sequencing analysis. These provide a fully automated pipeline for taking sequencing results from an Illumina sequencer, converting them to standard Fastq format, aligning to a reference genome, doing SNP calling, and producing a summary PDF of results.

The pipeline runs on single multicore machines, in compute clusters managed by LSF or SGE, or on the Amazon cloud. This tutorial describes running the pipeline on Amazon with CloudBioLinux and CloudMan.

The scripts are tightly integrated with the Galaxy web-based analysis tool. Tracking of samples occurs via a web based LIMS system, and processed results are uploading into Galaxy Data Libraries for researcher access and additional analysis. See the installation instructions for the front end and a detailed description of the full system.

Overview of workflow

Code structure

The main scripts that handle automation of the analysis and storage are:

  • scripts/ -- Sits on the sequencer output machine; run via a cron job every hour to check for new output runs.

  • scripts/ -- Main server script; runs specific servers for top level analysis management, storage, or distributed processing of individual analysis steps:

    • Called with -q toplevel -- Coordinate sample processing, running the full automated analysis and optionally uploading results to Galaxy. and a Galaxy graphical front end both send messages on this queue for starting processing.

    • Called with no queue arguments -- Run individual steps in the analysis pipeline. Start multiple servers on distributed machines connected with a shared filesystem to allow scaling on a cluster or Amazon EC2.

    • Called with -q storage -- Manage long term storage of larger files, like qseq and images.

Specify system specific information in configuration files:

  • config/transfer_info.yaml -- Configuration on the sequencing machine, specifying where to check for new sequencing data.
  • config/post_process.yaml -- Configuration for analysis and storage. This contains links to Galaxy, program locations and customization for processing algorithms.
  • config/universe_wsgi.ini -- Variables used from your Galaxy server configuration, including RabbitMQ details for communication between the sequencing and analysis machines.

Scripts involved in the processing:

  • scripts/ -- Drives the high level analysis of sequencing lanes based on information specified through the Galaxy LIMS system or in a YAML configuration file. Also produces a PDF summary file with statistics on alignments, duplicates, GC distribution, quality scores, and other metrics of interest.
  • scripts/ -- Handles storing and uploading Fastq, alignment, analysis and summary files to Galaxy.


Required libraries and data files

The code drives a number of next-generation sequencing analysis tools; install these on any machines involved in the processing. The CloudBioLinux and Galaxy CloudMan projects contain automated scripts to help with installation:

Or they can be install manually; the Requirements section below lists all software used by the pipeline.

Scripts and configuration

Clone a copy of the code from GitHub:

  git clone git://

Use a recent version of Python 2 (2.6 or 2.7), and install with:

  cd bcbb/nextgen && python install

The setup script installs required python library dependencies.

Copy the YAML & ini files in config and adjust them to match your environment. It is also a good idea to set your $PATH pointing to any third-party binaries you are using.

Parallel execution

The pipeline runs in parallel in two different ways:

  • multiple cores -- Analyses will run in parallel using multiple cores on a single machine. This requires only the mulitprocessing Python library, included by default with most Python installations. Change num_cores in your post_process.yaml file to specify the number of parallel cores to use.

  • parallel messaging -- This allows scaling beyond the cores on a single machine. It requires multiple machines with a shared filesystem, with communication handled using RabbitMQ messaging. This is ideally suited for clusters.

To enable parallel messaging:

  1. Configure RabbitMQ as described below. Ensure all processing machines can talk to the RabbitMQ server on port 5672. Update universe_wsgi.ini to contain the server details.

  2. Change num_cores in post_process.yaml to messaging

The helper script runs pipelines in a distributed cluster environment. It takes care of starting worker nodes, running the processing, and then cleaning up after jobs:

  1. Edit your post_process.yaml file to set parameters in the distributed section corresponding to your environment: this includes the type of cluster management, arguments to start jobs, and the number of workers to start.

  2. Execute using the same arguments as the post_process.yaml configuration file, the directory of fastq files, and a run_info.yaml file specifying the fastq details, barcodes, and the types of analyses to run.

If you have a different architecture you can run the distributed processing by hand:

  1. Start the processing server, on each processing machine. This takes one argument, the post_process.yaml file (which references the universe_wsgi.ini configuration).

  2. Run the analysis script This will offload parallel work to the workers started in step 3 but also does processing itself, so is also submitted as a job in cluster environments.


The test suite exercises the scripts driving the analysis, so are a good starting point to ensure correct installation. Run tests from the main code directory using nose:

  nosetest -v -s

tests/ exercises the full framework using an automatically downloaded test dataset. It runs through barcode deconvolution, alignment and full SNP analysis. Tweak the configuration for the tests for your environment:

  • tests/data/automated/post_process.yaml -- May need adjustment to point to installed software in non-standard locations. Change the num_cores parameter to test multiple processor and parallel execution.
  • tests/data/automated/universe_wsgi.ini -- Defines the location of the RabbitMQ server for messaging based parallel execution during tests.
  • tests/data/automated/run_info.yaml -- Change the analysis variable can to 'Standard' if SNP calling is not required in your environment. This will run a smaller pipeline of alignment and analysis.

RabbitMQ messaging server

RabbitMQ messaging manages communication between the sequencing machine and the analysis machine. This allows complete separation between all of the machines. The RabbitMQ server can run anywhere; an easy solution is to install it on the Galaxy and analysis server:

    (yum or apt-get) install rabbitmq-server

Setup rabbitmq for passing Galaxy and processing messages:

    rabbitmqctl add_user <username> <password>
    rabbitmqctl add_vhost bionextgen
    rabbitmqctl set_permissions -p bionextgen <username> ".*" ".*" ".*"

Then adjust the [galaxy_amqp] section of your universe_wsgi.ini Galaxy configuration file. An example configuration is available in the config directory; you'll need to specifically change these three values:

    host = <host you installed the RabbitMQ server on>
    userid = <username>
    password = <password>

ssh keys

The sequencing, analysis and storage machines transfer files using secure copy. This requires that you can securely copy files between machines without passwords, using ssh public key authentication. You want to enable password-less ssh for the following machine combinations:

  • Analysis server to illumina_finished_msg machine
  • Storage server to illumina_finished_msg machine

Sequencing machines

The sequencer automation has been fully tested using Illumina GAII and HiSeq sequencing machines. The framework is general and supports other platforms; we welcome feedback from researchers with different machines at their institutions.

Illumina machines produce run directories that include the date, machine identifier, and flowcell ID:


A shortened name, with just date and flowcell ID, is used to uniquely identify each flowcell during processing.

Development environment

The installation instructions assume that you have full root access to install python modules and packages (production environment). If this is not the case, you may want to install a python VirtualEnv and other tools automatically on your $HOME to ease your development needs using the following script:

    wget && ./


Next gen analysis

Processing infrastructure

  • RabbitMQ for communication between machines
  • LaTeX and pdflatex for report generation

Optional software for generating report graphs

  • R with ggplot2, plyr, sqldf libraries.
  • rpy2. Build R with shared libraries available: ./configure --enable-R-shlib.

Python modules installed with the package

Variant calling

Broad's GATK pipeline drives variant (SNPs and Indels) analysis. This requires some associated data files, and also has some configurable options. The relevant section from the post_process.yaml file is:

coverage_depth: "low" # other options: high
coverage_interval: "exome" # other options: genome, regional
dbsnp: variation/dbsnp_132.vcf
train_hapmap: variation/hapmap_3.3.vcf
train_1000g_omni: variation/1000G_omni2.5.vcf
train_indels: variation/indels_mills_devine.vcf

The dbSNP and training files are from the GATK resource bundle. These are inputs into the training models for recalibration. The automated CloudBioLinux data scripts will download and install these in the variation subdirectory relative to the genome files.

The coverage_depth and coverage_interval are adjustable from the defaults within individual run_info.yaml files describing the sample; a fastq file with standard phred quality scores, full genome coverage and high sequencing depth:

- analysis: SNP calling
    quality_format: Standard
    coverage_interval: genome
    coverage_depth: high

Internals: files generated by this pipeline

Initial Fastq files (pre-analysis)

After basecalling, a number of FastQ files are generated via script:


The template of them being:


Where 1|2 is the forward and reverse read respectively. Only reads that pass the quality filter (PF) are kept. Quality scores are not converted to Sanger, but are kept in the produced format (currently Illumina 1.3+). The dots in the fastq files are converted into Ns.

Post-processing generated files

Those are distributed in three directories: alignments, barcode and the top level run directory.

Top level directory

Those files comprise both plain text files, images and structured data that can be useful both automatically and in human-readable form to get different aspects about the sequencing results.

  • run_summary.yaml

Contains a structured view of the run global parameters.

- barcode_id: '2'
  barcode_type: SampleSheet
  lane: '8'
    Aligned: '1204752'
    Pair duplicates: '881332'
    Read length: '101'
    Reads: '11594437'
  request: ''
  researcher: ''
  sample: ''
  • PDF files

Taking a look at an specific sample in a lane, we find the different sub-components that conform the final summary report (*-summary.pdf). The summary contains GC content, read quality figures and an insert size histogram.

  • BigWig files

Per-sample wigtoBigWig converted files.

  • *_metrics files

They contain plain text values from picard, namely:


They contain output metrics from Picard, which are parsed later and used to generate plots. For instance, for sort-dup.align_metrics, the output from net.sf.picard.analysis.CollectAlignmentSummaryMetrics is stored.

alignments directory

Contains the results of the alignments for each sample. As we see on the listing below, lane 1, barcode id 5 has been aligned in SAM and BAM formats. For convenience, to facilitate SNP calling, for instance, a sorted BAM file is also generated.


*_barcode directories

Those contain fastq files conforming with the naming schema we've seen before. They are the result of the demultiplexing process, where the "unmatched" files contain the reads that have not passed the approximate barcoding matching algorithm:




*-barcodes.cfg contains a simple mapping between barcode id's and the actual barcode sequence:


The _bc.metrics file has a plain read distribution for each barcode:

2   11594437
3   20247932
9   14390566
unmatched   908420

Barcodes are added to the 3' end of the first sequence. That way, it remains platform-independent and can be easily handled downstream. This GitHub discussion explains how demultiplexing works. The demultiplexing is performed by the script.

Something went wrong with that request. Please try again.