A comprehensive pipeline to analyze and visualize structural variants
This is the
master branch which implements the TCGA_Virus workflow.
BreakPointSurveyor (BPS) is a set of core libraries (BreakPointSurveyor-Core) and workflows (this project) which, with optional external tools, evaluate genomic sequence data to discover, analyze, and provide a visual summary of breakpoint events.
The BreakPointSurveyor project provides three reference workflows, each implemented as a separate git branch. These workflows (and the links to view them) are:
- TCGA_Virus (
masterbranch): Comprehensive workflow and data for one TCGA virus-positive sample (TCGA-BA-4077-01B-01D-2268-08) which has been aligned to a custom reference
- 1000SV (
1000SVbranch): Analysis of discordant reads on publicly available human sample
- Synthetic (
Syntheticbranch): Creation and analysis of a small dataset containing inter-chromosomal and intra-chromosomal breakpoints
Matthew A. Wyczalkowski, Kristine M. Wylie, Song Cao, Michael D. McLellan, Jennifer Flynn, Mo Huang, Kai Ye, Xian Fan, Ken Chen, Michael C. Wendl, Li Ding; BreakPoint Surveyor: A Pipeline for Structural Variant Visualization. Bioinformatics 2017. doi: 10.1093/bioinformatics/btx362
Download BreakPointSurveyor with three example workflows with,
git clone --recursive https://github.com/ding-lab/BreakPointSurveyor.git
See here for detailed installation instructions. The Getting started with the Synthetic branch section has instructions on working with a relatively small test dataset. See also the BPS developer guide for information about implementing your own workflow.
Getting Started with Docker
Draft documentation, though Docker is the way to go
./data is not a good option if ./data already exists.
docker pull mwyczalkowski/breakpoint_surveyor mkdir ./data_docker docker run -v $PWD/data_docker:/data -it mwyczalkowski/breakpoint_surveyor
This will download the container and then run it. Data directory will be in
./data. Once the container is running,
run the entire Synthetic workflow with,
cd /usr/local/BreakPointSurveyor git checkout Synthetic # this ends up being weird ./run_BPS A* ... etc
NOTE: git checkout Synthetic does not work as expected, leaves
master directories hanging around
BPS generates two types of plots: structure plots and expression plots. Figures below are generated by the TCGA-Virus workflow.
Structure plots visualize breakpoints as points with X,Y coordinates given by the breakpoint position along each chromosome. Such figures also display read depth, gene and exon annotations, and a copy number histogram. In this workflow, read depth and discordant reads are obtained from aligned WGS data, and calls from various structural variant tools shown. Breakpoint predictions from other tools, whether from WGS or RNA-Seq data, can be readily integrated into the structure plot.
See T_PlotStructure for interpretation and details.
Expression plots illustrate relative gene expression near breakpoints, with gene position, size, orientation, and name shown. Expression is obtained for the sample and a population of controls from either processed expression data (e.g., TCGA RSEM) or RNA-Seq data directly.
See U_PlotExpression for interpretation and details.
There are three layers of BreakPoint Surveyor project:
- BPS Core: core analysis and plotting, typically in R or Python
- BPS Workflow: Project- and locale-specific workflows. Mostly as BASH scripts
- BPS Data: BPS-generated secondary data, graphical objects, and plots
For convenience, the workflows demonstrated here combine the Workflow and Data layers; also, the Core layer is implemented as a submodule and downloaded together with this project.
The BPS workflow is designed for scalability, and has been used to process batches of hundreds of whole genome and RNA-Seq datasets. It consists of a series of directories, each of which implements a stage in the BPS workflow. The order of processing indicated by the stage prefix. The figure below illustrates the stages and their relationship in the TCGA_Virus workflow.
Below is a list of the stages associated with the TCGA_Virus workflow (
master branch) and their description:
- A_Reference: Reference-specific analysis and files.
- B_ExonGene: Generate exon and gene definitions files.
- C_Project: Create list of BAMs, both realigned WGS and RNA-Seq. Create BAMs in
- F_PindelRP: Run Pindel and process breakpoint predictions.
- G_Discordant: Process realigned BAM file to extract discordant human-virus reads
- H_NovoBreak: Identify breakpoint with novoBreak
- I_Contig: Create contigs using Tigra-SV and realign them
- J_PlotList: Identify target regions for further processing and visualization
- K_ReadDepth: Evaluate read depth in target regions, obtain BAM file statistics for both WGS and RNA-Seq data
- L_Expression: Analyze expression in vicinity of integration events using RNA-Seq data. (
- M_RSEM_Expression: Analyze expression in vicinity of integration events using TCGA RSEM data. (
- N_DrawBreakpoint Plot breakpoint coordinates from various predictors to breakpoint panel GGP.
- O_DrawDepth Create read depth/copy number panel GGP and add breakpoint predictions
- P_DrawAnnotation Create annotation panel GGP showing genes and exons
- Q_DrawHistogram: Create histogram panel GGP showing distribution of read depth
- T_PlotStructure: Assemble GGP panels into BPS structure plot and save as PDF
- U_PlotExpression: Create BPS Expression plot based on expression P-values and save as PDF (
The 1000SV and Synthetic workflows generally have a subset of these stages. See BPS Developer Guide for additional information about developing new workflow stages. The BreakPointSurveyor-Core project (distributed as a submodule of this project) has details about BPS utilities underlying these stages.
Genomic datasets tend to be very large and frequently have restrictions on access and distribution. Each of the three workflows operates on distinct datasets of various size, clinical relevance, and availability, to demonstrate different BreakPointSurveyor capabilities.
In general, the workflows include all intermediate data which is allowed to be distributed and which is not prohibitively large.
TCGA_Virus workflow (
The TCGA_Virus workflow provides an in-depth analysis of a virus integration event in the TCGA WGS sample (TCGA-BA-4077-01B-01D-2268-08), which is a head and neck cancer sample. Because of TCGA restictions we do not distribute any sequence data. After downloadeding, sequence data was aligned to a custom reference which includes human and virus sequences (details). We do not distribute the reference because of size constraints.
Relative expression calcuations require a case and a population of controls. We provide two examples of expression calculations:
- Expression calculated directly from RNA-Seq data (RPKM)
- Expression obtained from a precomputed matrix of expression (TCGA RSEM)
The 1000SV workflow investigates interchromosomal human-human breakpoints in a publicly available human sample from the 1000 Genomes project, NA19240, which was sequenced at high (80X) coverage; this 65Gb file can be downloaded here.
The analysis focuses on two events with interchromosal discordant reads. Expression analalysis is not performed in the 1000SV workflow. We demonstrate using attributes to provide additional information about discordant reads.
The Synthetic workflow generates a simple breakpoints (inter- and intra-chromosomal) and corresponding synthetic read datasets of modest size which can be analyzed and visualized in BPS. We create a custom reference, consisting only of the chromosomes of interest, for improved performance (this reference is not distributed due to size).
We then generate a breakpoint sequence from sections of the human reference, and synthetic (simulated) reads are created. These are re-aligned to the custom reference. The resulting BAM file is then analyzed similarly to the 1000SV workflow. Expression analysis is not performed in the Synthetic workflow.
The Synthetic branch also illustrates more elaborate exon/gene annotations as well as an intrachromosomal inversion/duplication event.
The Synthetic workflow utilizes a relatively small dataset which is created from scratch, and can be run relatively quickly on a laptop computer. It is a good place to start working with BPS.
There are a number of dependencies you'll need to install to get stated. You'll need the Core dependencies and as well as BWA, described here.
Get a fresh copy of BPS and switch to the
Synthetic branch with,
git clone --recursive https://github.com/ding-lab/BreakPointSurveyor.git git checkout Synthetic
bps.config to locate the installed software.
The idea is to run each stage in order according to its first letter. You can run an entire stage with,
Each of these eleven stages consists of one or more steps. These steps are named starting with a number
1_get_BAM_paths.sh), and consist of shell scripts which execute a specific task. See the
documentation for each stage, as well as the contents of each step's script file, for details
about implementation and debugging.
Performance per stage for TCGA_Virus branch, obtained with
- A_Reference B_ExonGene C_Project: <1 seconds
- F_PindelRP: 124 seconds
- G_Discordant: 3800 seconds
- H_NovoBreak: 1614 seconds
- I_Contig: 210 seconds
- J_PlotList: 666 seconds
- K_ReadDepth: 3329 seconds
- L_Expression: 1602 seconds
- M_RSEM_Expression: 649 seconds
- N_DrawBreakpoint O_DrawDepth P_DrawAnnotation Q_DrawHistogram T_PlotStructure U_PlotExpression: 20 seconds
Matthew A. Wyczalkowski, firstname.lastname@example.org
This software is licensed under the GNU General Public License v3.0
This work was supported by the National Cancer Institute [R01CA178383 and R01CA180006 to Li Ding, R01CA172652 to Ken Chen]; and National Human Genome Research Institute [U01HG006517 to Li Ding].
This work was supported by the National Cancer Institute [R01CA178383, R01CA180006, 1U24CA211006-01, and 1U24CA210972-01 to Li Ding, R01CA172652 to Ken Chen]; and National Human Genome Research Institute 60 [U01HG006517 to Li Ding].