Skip to content

cancerit/cgpbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cgpbox

The cgpbox project encapsulates the core Cancer Genome Project analysis pipeline in an easy to use docker image:

Docker Repository on Quay

The pipeline is optimised for somatic variation calling using BWA mem mapped, Illumina paired-end sequencing.

Analysis performed

cgpbox will perform the following analysis (not necessarily in this order):

  • Basic genotype call using the standard Sequenom SNP locations
    • GRCh37 locations here
  • A comparison of the called genotype between tumour and normal.
  • An evaluation of gender using 4 chrY specific SNPs
    • GRCh37 locations here
    • not ordered as first 2 are part of standard Sequenom QCplex, additional are included for improved accuracy in patchy sequencing.
  • Copy Number Variation (CNV) using ascatNgs
  • Insertion and deletion (InDel) calling using cgpPindel
  • Single Nucleotide Variant (SNV) calling using CaVEMan
  • Gene annotation of SNV and InDel calls using VAGrENT
  • Structural Variation (SV) calls using BRASS
    • Basic gene annotation via grass

Running the docker image

The bulk of this repository is to manage the building of a docker image so that users don't have to.

Provided you have a base system configured to run docker then you only need to fulfil the following requirements:

  1. Ability to provide large workspace as a volume mount point.
    • Workspace needs to be ~25% of the sum of your BAM inputs.
    • Normally it is simplest to place the BAMs in the same area.
  2. 24 cores or more for sensible turn around times.
  3. ~4GB RAM per core

The required resources are unfortunately large but the system does run many elements in parallel to reduce wall-time.


Test run

The current test dataset takes quite a long time to run. We are working to find more suitable data that we can share. Please see Running your data to use your own sample pair.

To run the pre-built docker image with the test data log into a docker enabled host and run the following:

$ cd ~
$ curl -sSL --retry 10 -O https://raw.githubusercontent.com/cancerit/cgpbox/master/examples/run.params
$ export MOUNT_POINT=/some/large/storage
$ (docker run --rm -v $MOUNT_POINT:/datastore -v ~/run.params:/datastore/run.params quay.io/wtsicgp/cgp_in_a_box > ~/run.out) >& ~/run.err &

$MOUNT_POINT should be a storage area with ~25GB of space for this test.

Result files will be written to $MOUNT_POINT/output

Running your data

To analyse your own pairs of tumour normal BAM files you can modify the example run.params file indicated in Test run.

The run.params file contains comments to assist you but here are the critical items:

  • NAME_* - Should match the sample names found in the headers of the BAM files.
  • *_MT - Refers to data linked to the MuTant/tumour sample.
  • *_WT - Refers to data linked to the WildType/Normal sample.
  • BAM_* - Paths to the input BAM files, path is that found within the docker image.

You are also able to force the CPU count to be a specified value. By default the image will use all cores available to the docker image. Should you need to make more memory available you can force a CPU value to be lower than the actual by specifying the value you want as CPU=4 (uncommenting if needed).

Please see Input requirements.

PRE-EXEC array

This is an optional section to provide actions that should be performed prior to the main analysis being triggered. In the example run.params this downloads and unpacks the test dataset.

The uses are only limited by the tools available within the docker image (S3 tools are already included). If there is a good case for additional tools please raise an issue.

If not needed comment out or delete.

POST-EXEC array

This is an optional section to provide actions that should be performed after to the main analysis being triggered. In the example run.params this shows how you could automatically trigger an upload to an S3 bucket.

The uses are only limited by the tools available within the docker image (S3 tools are already included). If there is a good case for additional tools please raise an issue.

If not needed comment out or delete.

Other params not documented here

There are some other parameters that have not been documented here as they relate to future features. Basic notes are included with all parameters in examples/run.params.

Input requirements

cgpbox expects to be provided with a pair of BAM files (one tumour, one normal) each:

  • Mapped with BWA-mem
    • Having valid ReadGroup headers including LB and SN tags
    • See SAM/BAM specification here for more details.
  • Duplicates marked.
  • BAM indexes created.

Data mapped in different fashion

Data mapped using a different algorithm may process successfully however we are unlikely to be able to provide detailed support.

If you already have a mapped BAM you can re-map with all of the above handled for you using the bwa_mem.pl script which is part of PCAP-core.

Monitoring

A simple webpage has been created so that you can monitor the progress of your job. It simply provides evidence that things are progressing and requires the base host (not the docker) to have python installed:

$ cd $MOUNT_POINT/site
$ sudo python -m SimpleHTTPServer 80 >& ~/monitor.log&

Then point you browser at:

http://yourhost/html/index.html

-- Example display: startup

Example display: mid run

Output

On completion the data files used to generate the web-site are copied into the output location along with files containing timing/memory data. These can be found at $MOUNT_POINT/output/*.time and are of the form:

$ cat ascat.time
command:ascat.pl -o /datastore/output/HCC1143_vs_HCC1143_BL/ascat -t /datastore/output/tmp/HCC1143.bam -n /datastore/output/tmp/HCC1143_BL.bam -s /datastore/reference_files/ascat/SnpLocus.tsv -sp /datastore/reference_files/ascat/SnpPositions.tsv -sg /datastore/reference_files/ascat/SnpGcCorrections.tsv -r /datastore/reference_files/genome.fa -q 20 -g L -rs Human -ra GRCh37 -pr WGS -pl ILLUMINA -c 8
real:1390.62
user:2106.95
sys:40.48
text:0k
data:0k
max:2183804k

Additionally all of the data in the output folder is packaged as a tar.gz for easy retrieval (example data set: $MOUNT_POINT/result_HCC1143_vs_HCC1143_BL.tar.gz). Please see examples/run.params for an example of using post-exec to push your data to AWS.

Primary analysis software

It incorporates the following cancerit projects:

Dependancies

Additionally these have dependancies on the following software packages which may have different license restrictions to the cancerit packages:

LICENSE

Copyright (c) 2016 Genome Research Ltd.

Author: Cancer Genome Project cgpit@sanger.ac.uk

This file is part of cgpbox.

cgpbox is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

  1. The usage of a range of years within a copyright statement contained within this distribution should be interpreted as being equivalent to a list of years including the first and last year specified and all consecutive years between them. For example, a copyright statement that reads ‘Copyright (c) 2005, 2007- 2009, 2011-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2007, 2008, 2009, 2011, 2012’ and a copyright statement that reads ‘Copyright (c) 2005-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012’."