CancerPipeline

######################################################################

CancerPipeline

This is a simple pipeline to automate the processing of cancer samples downstream of variant calling.

The pipeline steps are as follows:

Annotation

Filter variants according to coverage, allele frequency or quality cutoffs.
Annotate variants with Annovar and SNPeff.
Transform sample vcfs into mafs.
Unite all samples into a common output maf/vcf.

MuSiC

Prepare input files for MuSiC (i.e. intersection of samples available as bam and samples in the united maf).
Run all MuSiC tools sequentially.

MutSigCV

Run MutSigCV on the united maf.

CancerPipeline is written in Python/ruffus and aware of which steps have been run before on which files. Each rerun of a pipeline will thus only touch the files that have changed since the last time it was run, not stupidly rerun all of them.

##Input

CancerPipeline allows the definition of different projects, each with different filtering settings, a different name etc. If new samples arrive, a project can easily be rerun to bring all folders, files and functional analyses for that project up to date. In the following examples, our project is called myproject.

The pipeline expects all unfiltered input vcf to be located in one folder. Different projects can share the same raw vcf folder.

.
└── root
    └── vcfs_raw
        ├── sample1.vcf
        └── sample2.vcf

CancerPipeline will then create output folders per project for the filtered and annotated vcfs and an output folder for the functional analysis tools.

.
└── root
    ├── analysis_myproject
    │   ├── input
    │   │   └── myproject_input.txt
    │   └── output
    │       └── myproject_sign_genes.txt
    ├── vcfs_myproject
    │   ├── sample1_annotated.tsv
    │   ├── sample1.vcf
    │   ├── sample2_annotated.tsv
    │   └── sample2.vcf
    └── vcfs_raw
        ├── sample1.vcf
        └── sample2.vcf

File locations and settings per project are saved into a regular text file in Python configuration format (see below). The file contains information on each project that was run, serving both as a config file for the current runs and a log file for past runs.

Example project definition file

[myproject]

root = /home/user/root/
raw_vcf_folder = /home/user/root/vcfs_raw/
bam_folder = /home/user/bams/
bed = /home/user/root/data/myproject.bed
ref = /home/user/root/data/hg19.fasta

cpus = 1
verbose_logging = False 
min_cov = 100
min_varfreq = 0.05

Location settings

Use absolute paths!

root: the root folder of this pipeline
raw_vcf_folder: the folder containing the unfiltered vcfs
bed: the bed file for this panel
ref: path to the reference sequence (in fasta format)
bam_folder (optional): the folder containing the bam and bam.bai files. If not supplied, MuSiC will not be started.
whitelist (optional): a list of samples from the raw_vcf_folder to process (default: use whole folder)
blacklist (optional): a list of samples from the raw_vcf_folder to NOT process (default: use whole folder)

Filtering settings (optional)

min_cov: Minimum accepted coverage for a variant position
min_varfreq: Minimum variant frequency for a variant position
min_qual: Minimum quality for a variant position

Run flags (optional)

vcf_type: select the type of input vcf (iontorrent/illumina_strelka) (default: iontorrent)
cpus: number of CPUs to use in parallel (default: 1)
functional_analysis: if False, don't run MutSigCV and MuSiC (default: True)
verbose_logging: if set to True, will result in more output while running (default: False)
version_numbers_not_in_blacklist: legacy flag, don't use

##Usage

If there's only one project in the definition file, the only argument needed is the name of the file:

> python CancerPipeline.py project_definitions.txt

If there's more than one project defined, we need to select a project with the 2nd argument::

> python CancerPipeline.py project_definitions.txt myproject

##Output

For each unfiltered input mysample, the following outputs are created in a new folder named ./myproject/:

mysample.vcf: all variants that passed the filtering step
mysample.maf: all filtered variants, in maf format
mysample_annotated.tsv: all filtered variants with annotation from the vcf itself, Annovar and SNPeff

The following files are created for all unfiltered inputs together:

all_samples_myproject.maf: all filtered variants in maf format
all_samples_myproject.tsv: all filtered and annotated variants

The folder analysis_myproject contains intermediate and output files for MuSiC and MutSigCV. The list of significant genes created by is in ./analysis_myproject/output/myproject_mutsigcv.sig_genes.txt

##Dependencies

CoverageCheck was developed on Ubuntu 13.10 with Python 2.7 and the Python packages numpy (1.8.1+), pandas (0.13.1+) and ruffus (2.4+).

The pipeline depends on the installation of the following 3rd-party tools:

Annovar (including the databases for all annotation sources, see Annovar documentation at http://www.openbioinformatics.org/annovar/annovar_download.html)
SNPeff (v3.5)
vcf2maf (github.com/ckandoth/vcf2maf)
MutSigCV (v1.4)
Genome Music (v0.4)

The install directories of these tools need to be set in ToolConfig.py.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
AnnotationRowParsers.py		AnnotationRowParsers.py
AnnotationRowParsers.pyc		AnnotationRowParsers.pyc
CancerPipeline.py		CancerPipeline.py
README.md		README.md
SampleAnnotation.py		SampleAnnotation.py
SampleAnnotation.pyc		SampleAnnotation.pyc
ToolConfig.py		ToolConfig.py
ToolConfig.pyc		ToolConfig.pyc
annotate_individual_vcf.py		annotate_individual_vcf.py
automate_annovar.py		automate_annovar.py
automate_annovar.pyc		automate_annovar.pyc
automate_mutsig.py		automate_mutsig.py
automate_mutsig.pyc		automate_mutsig.pyc
automate_snpeff.py		automate_snpeff.py
automate_snpeff.pyc		automate_snpeff.pyc
automate_vcf2maf.py		automate_vcf2maf.py
automate_vcf2maf.pyc		automate_vcf2maf.pyc
example_project_definitions.txt		example_project_definitions.txt
pipeline.png		pipeline.png
pipeline_config.py		pipeline_config.py
prepare_and_run_music.py		prepare_and_run_music.py
prepare_and_run_music.pyc		prepare_and_run_music.pyc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CancerPipeline

About

Releases

Packages

Languages

aweller/CancerPipeline

Folders and files

Latest commit

History

Repository files navigation

CancerPipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages