Python scripts and modules for automated next gen sequencing analysis. These provide a fully automated pipeline for taking sequencing results from an Illumina sequencer, converting them to standard Fastq format, aligning to a reference genome, doing SNP calling, and producing a summary PDF of results.
The pipeline runs on single multicore machines, in compute clusters managed by LSF or SGE, or on the Amazon cloud. This tutorial describes running the pipeline on Amazon with CloudBioLinux and CloudMan.
The scripts are tightly integrated with the Galaxy web-based analysis tool. Tracking of samples occurs via a web based LIMS system, and processed results are uploading into Galaxy Data Libraries for researcher access and additional analysis. See the installation instructions for the front end and a detailed description of the full system.
The main scripts that handle automation of the analysis and storage are:
scripts/illumina_finished_msg.py-- Sits on the sequencer output machine; run via a cron job every hour to check for new output runs.
scripts/nextgen_analysis_server.py-- Main server script; runs specific servers for top level analysis management, storage, or distributed processing of individual analysis steps:
-q toplevel-- Coordinate sample processing, running the full automated analysis and optionally uploading results to Galaxy.
illumina_finished_msg.pyand a Galaxy graphical front end both send messages on this queue for starting processing.
Called with no queue arguments -- Run individual steps in the analysis pipeline. Start multiple servers on distributed machines connected with a shared filesystem to allow scaling on a cluster or Amazon EC2.
-q storage-- Manage long term storage of larger files, like qseq and images.
Specify system specific information in configuration files:
config/transfer_info.yaml-- Configuration on the sequencing machine, specifying where to check for new sequencing data.
config/post_process.yaml-- Configuration for analysis and storage. This contains links to Galaxy, program locations and customization for processing algorithms.
config/universe_wsgi.ini-- Variables used from your Galaxy server configuration, including RabbitMQ details for communication between the sequencing and analysis machines.
Scripts involved in the processing:
scripts/automated_initial_analysis.py-- Drives the high level analysis of sequencing lanes based on information specified through the Galaxy LIMS system or in a YAML configuration file. Also produces a PDF summary file with statistics on alignments, duplicates, GC distribution, quality scores, and other metrics of interest.
scripts/upload_to_galaxy.py-- Handles storing and uploading Fastq, alignment, analysis and summary files to Galaxy.
Required libraries and data files
The code drives a number of next-generation sequencing analysis tools; install these on any machines involved in the processing. The CloudBioLinux and Galaxy CloudMan projects contain automated scripts to help with installation:
- Install bioinformatics software
- Install data files like genome builds in association with Galaxy: my script and from the Galaxy team.
Or they can be install manually; the Requirements section below lists all software used by the pipeline.
Scripts and configuration
Clone a copy of the code from GitHub:
git clone https://github.com/chapmanb/bcbb.git
Use a recent version of Python 2 (2.6 or 2.7), and install with:
cd bcbb/nextgen && python setup.py install
The setup script installs required python library dependencies.
Copy the YAML & ini files in config and adjust them to match your environment. It is also a good idea to set your $PATH pointing to any third-party binaries you are using.
The pipeline runs in parallel in two different ways:
multiple cores -- Analyses will run in parallel using multiple cores on a single machine. This requires only the
mulitprocessingPython library, included by default with most Python installations. Change
post_process.yamlfile to specify the number of parallel cores to use.
parallel messaging -- This allows scaling beyond the cores on a single machine. It requires multiple machines with a shared filesystem, with communication handled using RabbitMQ messaging. This is ideally suited for clusters.
To enable parallel messaging:
Configure RabbitMQ as described below. Ensure all processing machines can talk to the RabbitMQ server on port 5672. Update
universe_wsgi.inito contain the server details.
distributed_nextgen_pipeline.py helper script runs pipelines in a
distributed cluster environment. It takes care of starting worker
nodes, running the processing, and then cleaning up after jobs:
post_process.yamlfile to set parameters in the
distributedsection corresponding to your environment: this includes the type of cluster management, arguments to start jobs, and the number of workers to start.
distributed_nextgen_pipeline.pyusing the same arguments as
post_process.yamlconfiguration file, the directory of fastq files, and a
run_info.yamlfile specifying the fastq details, barcodes, and the types of analyses to run.
If you have a different architecture you can run the distributed processing by hand:
Start the processing server,
nextgen_analysis_server.pyon each processing machine. This takes one argument, the
post_process.yamlfile (which references the
Run the analysis script
automated_initial_analysis.py. This will offload parallel work to the workers started in step 3 but also does processing itself, so is also submitted as a job in cluster environments.
The test suite exercises the scripts driving the analysis, so are a good starting point to ensure correct installation. Run tests from the main code directory using nose:
nosetest -v -s
tests/test_automated_analysis.py exercises the full framework using
an automatically downloaded test dataset. It runs through barcode
deconvolution, alignment and full SNP analysis. Tweak the
configuration for the tests for your environment:
tests/data/automated/post_process.yaml-- May need adjustment to point to installed software in non-standard locations. Change the num_cores parameter to test multiple processor and parallel execution.
tests/data/automated/universe_wsgi.ini-- Defines the location of the RabbitMQ server for messaging based parallel execution during tests.
tests/data/automated/run_info.yaml-- Change the
analysisvariable can to 'Standard' if SNP calling is not required in your environment. This will run a smaller pipeline of alignment and analysis.
RabbitMQ messaging server
RabbitMQ messaging manages communication between the sequencing machine and the analysis machine. This allows complete separation between all of the machines. The RabbitMQ server can run anywhere; an easy solution is to install it on the Galaxy and analysis server:
(yum or apt-get) install rabbitmq-server
Setup rabbitmq for passing Galaxy and processing messages:
rabbitmqctl add_user <username> <password> rabbitmqctl add_vhost bionextgen rabbitmqctl set_permissions -p bionextgen <username> ".*" ".*" ".*"
Then adjust the
[galaxy_amqp] section of your
Galaxy configuration file. An example configuration is available in
the config directory; you'll need to specifically change these three
[galaxy_amqp] host = <host you installed the RabbitMQ server on> userid = <username> password = <password>
The sequencing, analysis and storage machines transfer files using secure copy. This requires that you can securely copy files between machines without passwords, using ssh public key authentication. You want to enable password-less ssh for the following machine combinations:
- Analysis server to
- Storage server to
The sequencer automation has been fully tested using Illumina GAII and HiSeq sequencing machines. The framework is general and supports other platforms; we welcome feedback from researchers with different machines at their institutions.
Illumina machines produce run directories that include the date, machine identifier, and flowcell ID:
A shortened name, with just date and flowcell ID, is used to uniquely identify each flowcell during processing.
The installation instructions assume that you have full root access to install python modules and packages (production environment). If this is not the case, you may want to install a python VirtualEnv and other tools automatically on your $HOME to ease your development needs using the following script:
wget https://bitbucket.org/brainstorm/custom_env/raw/1cd4f4ae27d5/pyHost.sh && ./pyHost.sh
Next gen analysis
- bowtie or bwa
- GATK (version 1.2; only for variant calling)
- snpEff (version 2.0.2; only for variant calling)
- RabbitMQ for communication between machines
- LaTeX and pdflatex for report generation
Optional software for generating report graphs
- R with ggplot2, plyr, sqldf libraries.
rpy2. Build R with shared libraries available:
Python modules installed with the package
Broad's GATK pipeline drives variant (SNPs and Indels) analysis. This
requires some associated data files, and also has some configurable options. The
relevant section from the
post_process.yaml file is:
coverage_depth: "low" # other options: high coverage_interval: "exome" # other options: genome, regional dbsnp: variation/dbsnp_132.vcf train_hapmap: variation/hapmap_3.3.vcf train_1000g_omni: variation/1000G_omni2.5.vcf train_indels: variation/Mills_Devine_2hit.indels.vcf
The dbSNP and training files are from the GATK resource bundle. These are inputs into the training models for recalibration. The automated CloudBioLinux data scripts will download and install these in the variation subdirectory relative to the genome files.
coverage_interval are adjustable from the defaults
run_info.yaml files describing the sample; a fastq file
with standard phred quality scores, full genome coverage and high sequencing
- analysis: SNP calling algorithm: quality_format: Standard coverage_interval: genome coverage_depth: high
The YAML configuration file provides a number of hooks to customize analysis.
Place these under the
analysis keyword. For variant calling:
alignerAligner to use: [bwa, bowtie, bowtie2, mosaik, novoalign]
trim_readsWhether to trim off 3' B-only ends from fastq reads [false, true]
align_split_size: Split FASTQ files into specified number of records per file. Allows parallelization at the cost of increased temporary disk space usage.
variantcallerVariant calling algorithm [gatk, freebayes]
quality_formatQuality format of fastq inputs [illumina, standard]
coverage_intervalRegions covered by sequencing. Influences GATK options for filtering [exome, genome, regional]
coverage_depthDepth of sequencing coverage. Influences GATK variant calling [high, low]
hybrid_targetBED file with target regions for hybrid selection experiments.
variant_regionsBED file of regions to call variants in.
ploidyPloidy of called reads. Defaults to 2 (diploid).
Global reference files for variant calling and assessment:
train_indelsTraining files for GATK variant recalibration.
call_backgroundBackground VCF to use for calling.
Internals: files generated by this pipeline
Initial Fastq files (pre-analysis)
After basecalling, a number of FastQ files are generated via solexa_qseq_to_fastq.py script:
1_081227_B45GT6ABXX_1_fastq.txt 1_081227_B45GT6ABXX_2_fastq.txt (...)
The template of them being:
Where 1|2 is the forward and reverse read respectively. Only reads that pass the quality filter (PF) are kept. Quality scores are not converted to Sanger, but are kept in the produced format (currently Illumina 1.3+). The dots in the fastq files are converted into Ns.
Post-processing generated files
Those are distributed in three directories: alignments, barcode and the top level run directory.
Top level directory
Those files comprise both plain text files, images and structured data that can be useful both automatically and in human-readable form to get different aspects about the sequencing results.
Contains a structured view of the run global parameters.
- barcode_id: '2' barcode_type: SampleSheet lane: '8' metrics: Aligned: '1204752' Pair duplicates: '881332' Read length: '101' Reads: '11594437' request: '' researcher: '' sample: ''
- PDF files
Taking a look at an specific sample in a lane, we find the different
sub-components that conform the final summary report (
The summary contains GC content, read quality figures and an insert size
6_110126_B816J0ABXX_5_1_fastq_txt_quality.pdf 6_110126_B816J0ABXX_5_2_fastq.txt.segments.hist.pdf 6_110126_B816J0ABXX_5-sort-dup-gc.pdf 6_110126_B816J0ABXX_5_1_fastq.txt.segments.hist.pdf 6_110126_B816J0ABXX_5-sort_1_fastq_qual.pdf 6_110126_B816J0ABXX_5-sort-dup-insert.pdf 6_110126_B816J0ABXX_5_2_fastq_txt_quality.pdf 6_110126_B816J0ABXX_5-sort_2_fastq_qual.pdf 6_110126_B816J0ABXX_5-sort-summary.pdf
- BigWig files
Per-sample wigtoBigWig converted files.
They contain plain text values from picard, namely:
1_110126_B816J0ABXX_5-sort-dup.align_metrics 1_110126_B816J0ABXX_5-sort-dup.gc_metrics 1_110126_B816J0ABXX_5-sort-dup.dup_metrics 1_110126_B816J0ABXX_5-sort-dup.insert_metrics
They contain output metrics from Picard, which are parsed later and
used to generate plots. For instance, for
output from net.sf.picard.analysis.CollectAlignmentSummaryMetrics is
Contains the results of the alignments for each sample. As we see on the listing below, lane 1, barcode id 5 has been aligned in SAM and BAM formats. For convenience, to facilitate SNP calling, for instance, a sorted BAM file is also generated.
1_110126_B816J0ABXX_5.sam 1_110126_B816J0ABXX_5.bam 1_110126_B816J0ABXX_5-sort.bam 1_110126_B816J0ABXX_5_1_fastq.bam
Those contain fastq files conforming with the naming schema we've seen before. They are the result of the demultiplexing process, where the "unmatched" files contain the reads that have not passed the approximate barcoding matching algorithm:
4_110126_B816J0ABXX_1_1_fastq.txt 4_110126_B816J0ABXX_5_2_fastq.txt 4_110126_B816J0ABXX_1_2_fastq.txt 4_110126_B816J0ABXX_6_1_fastq.txt 4_110126_B816J0ABXX_5_1_fastq.txt 4_110126_B816J0ABXX_6_2_fastq.txt 4_110126_B816J0ABXX_unmatched_1_fastq.txt 4_110126_B816J0ABXX_unmatched_2_fastq.txt 4_110126_B816J0ABXX_bc.metrics SampleSheet-barcodes.cfg
*-barcodes.cfg contains a simple mapping between barcode id's and the actual barcode sequence:
3 ATCACGA 2 ACTTGAA 9 TAGCTTA (...)
_bc.metrics file has a plain read distribution for each barcode:
2 11594437 3 20247932 9 14390566 unmatched 908420
Barcodes are added to the 3' end of the first sequence. That way, it remains platform-independent and can be easily handled downstream. This GitHub discussion explains how demultiplexing works. The demultiplexing is performed by the barcode_sort_trim.py script.