Skip to content
Browse files

initial commit

  • Loading branch information...
gavinha committed Aug 7, 2018
1 parent dd5ab71 commit 8d8df71a117e040c48f2be2ee0fe93741debef0f
Showing with 194 additions and 2 deletions.
  1. +194 −2
@@ -1,2 +1,194 @@
# SV_10X_analysis
Structural variation analysis of 10X Genomics linked-read data
# *Snakemake workflow for structural rearrangement analysis of 10X Genomics linked-read WGS*

## Description
This workflow will run the [SvABA]( structural variation (SV) analysis for a set of tumor-normal pairs, starting from the BAM files aligned using [Long Ranger]( software. The analysis includes structural variant prediction and assessment of barcode (BX) overlap from 10X linked-reads. It will also look for [TitanCNA]( results and combine these results to output SV classes.
This analysis was used in and described in the publicationL
Viswanathan SR*, Ha G*, Hoff A*, et al. Structural Alterations Driving Castration-Resistant Prostate Cancer Revealed by Linked-Read Genome Sequencing. *Cell* 174, 433–447.e19 (2018).

## Contact
Gavin Ha
Fred Hutchinson Cancer Research Center
contact: <> or <>
Date: August 3, 2018

## Requirements
### Software packages or libraries
- R-3.4
- Python 3.4
- snakemake-3.12.0
- [samtools-1.3.1](

# Files in the workflow
### Scripts used by the workflow
The following scripts are used by this snakemake workflow:
- [barCodeOverlap.R](code/barCodeOverlap.R) -
- [combineSVABAandTITAN.R](code/combineSVABAandTITAN.R) -
- [plotTitanSvaba.R](code/plotTitanSvaba.R) - Extracts allelic counts from the tumor sample at the germline heterozygous SNP sites
- [svaba_utils.R](code/svaba_utils.R) - Main R script to run TitanCNA
- [tenX_utils.R](code/tenX_utils.R) - R script to select optimal solution for each sample
- [plotting.R](code/plotting.R) - R script to merge autosomes and chrX results, plus post-processing steps including adjusting max copy values.

### Tumour-Normal sample list [config/samples.yaml](config/samples.yaml)
The list of tumour-normal paired samples should be defined in a YAML file. In particular, the [Long Ranger]( (v2.2.2) analysis directory is listed under samples. See `config/samples.yaml` for an example. Both fields `samples` and `pairings` must to be provided. `pairings` key must match the tumour sample while the value must match the normal sample.
tumor_sample_1: /path/to/tumor/longranger/dir
normal_sample_1: /path/to/normal/longranger/dir
tumor_sample_1: normal_sample_1

# Running the analysis

## 1. Invoking the snakemake workflow for SvABA
The first snakemake file [svaba.snakefile](svaba.snakefile) will
a. Run the full [SvABA]( analysis
b. Compute barcode overlap values for barcode rescue in the next step.
# show commands and workflow
snakemake -s svaba.snakefile.snakefile -np
# run the workflow locally using 5 cores
snakemake -s svaba.snakefile --cores 5

Users can use launch the jobs on a cluster.
An implementation that works with Broad UGER (qsub) is provided. Parameters for memory, runtime, and parallel environment can be specified directly in the snakemake files; default values for each rule has already been set in `params` within the [config.yaml](config/config.yaml) and the command below can be used as-is. Other cluster parameters can be set directly in [](config/
Note: users will need to adjust these for use with their cluster-specific settings
snakemake -s svaba.snakefile --cluster-sync "qsub -l h_vmem={params.mem},h_rt={params.runtime} {}" -j 50 --jobscript config/

## 2. Integrating SV and copy number results ([TitanCNA](
The second snakemake file [combineSvabaTitan.snakefile](combineSvabaTitan.snakefile) will
a. Combine SvABA and [TitanCNA (10X analysis)]( results to assign copy number to the breakpoints and rearrangement classes to the SV events.
b. Plot copy number and SV for each chromosome.
snakemake -s combineSvabaTitan.snakefile --cores 5
snakemake -s combineSvabaTitan.snakefile --cluster-sync "qsub -l h_vmem={params.mem},h_rt={params.runtime} {}" -j 50 --jobscript config/

To generate custom zoomed-in plots for a region of interest, users can specify the coordinates in the [config.yaml](config/config.yaml) file. Change the values in these fields, for example:
plot_zoom: TRUE
plot_chrs: X
plot_startPos: 66000000
plot_endPos: 67000000

# Configuration and settings
All settings for the workflow are contained in [config/config.yaml](config/config.yaml). The settings are organized by paths to scripts and reference files and then by each step in the workflow.

### 1. Path to tools
- `svaba_exe` is the compiled SvABA executable
svaba_exe: /path/to/svaba
samTools: /path/to/samtools

### 2. Path to scripts
These are provided in this repo under [code/](code/).
tenX_funcs: code/tenX_utils.R
svaba_funcs: code/svaba_utils.R
plot_funcs: code/plotting.R
bxRescue_script: code/barCodeOverlap.R
combineSVCN_script: code/combineSVABAandTITAN.R
plotSVCN_script: code/plotTitanSvaba.R

### 3. Path to R package files
Specify the directory in which [TitanCNA]( are installed.
*Set these if the R files in these libraries have been modified or updated but not yet installed or updated in R*.
titan_libdir: /path/to/TitanCNA/

### 4. Path to TitanCNA 10X snakemake results
Specifies the TitanCNA results to be merged with SvABA results in [combineSvabaTitan.snakefile](combineSvabaTitan.snakefile).
titan_results: /path/to/TitanCNA/snakemake_results/

### 5. Reference files and settings
Global reference files used by the `snakefiles` and scripts.
- `refGenome` specify the reference genome used in the Long Ranger analysis
- `genomeStyle` specifies the chromosome naming convention to used for **output** files. Input files can be any convention as long as it is the same genome build. Only use `UCSC` (e.g. chr1) or `NCBI` (e.g. 1).
`cytobandFile` is used for plotting the chromosome idiograms and only needs to specify [data/cytoBand_hg38.txt](data/cytoBand_hg38.txt) if using hg38.
`chrs` specifies the chromosomes to analyze; users do not need to be concerned about chromosome naming convention here as the code will handle it based on the `genomeStyle`.
refGenome: /path/to/ref/genome.fasta
genomeBuild: hg38
genomeStyle: UCSC
cytobandFile: data/cytoBand_hg38.txt # only need if hg38
chrs: c(1:22, \"X\")

### 6. Long Ranger bam filen
Set this to the filenames that are used for the BAM files generated by Long Ranger. The current filenames are ones generated by [Long Ranger]( v2.2.2
bamFileName: phased_possorted_bam.bam

### 7. Cluster resource parameter settings
If you are using a cluster, then these are resource settings for memory and runtime limits, and parallel environments.
These are the default settings for all tasks that do not have rule-specific resources.
*Note that these settings are used for the Broad's UGER cluster so users will need to modify this for their own clusters.*
# invoke using: snakemake -s svaba.snakefile --cluster-sync "qsub -l h_vmem={params.mem},h_rt={params.runtime} {}" -j 200 --jobscript config/
std_mem: 4G # memory limit
std_runtime: "05:00:00" # runtime limit
std_numCores: -pe smp 1 -binding linear:1 # use one core

### 8. [svaba.snakefile](svaba.snakefile) settings: SvABA
The cluster resources for SvABA are set here. 3G of memory for each of the 4 cores totals to 12G set as the limit.
- `svaba_dbSNPindelVCF` specifies a VCF file containing known indels to use for germline filtering [Homo_sapiens_assembly38.known_indels.vcf.gz](

svaba_dbSNPindelVCF: /cga/meyerson/References/hg19/dbsnp_indel.vcf
svaba_mem: 3G # per core
svaba_runtime: "300:00:00"
svaba_numThreads: 4
svaba_numCores: -pe smp 4 -binding linear:4 # should match svaba_numThreads

### 9. [svaba.snakefile](svaba.snakefile) settings: Barcode counting
Minimum thresholds and settings for barcode (BX) counting.
- `bxRescue_minLength` sets minimum length of the intra-chromosomal SV event to be consider for barcode counting. Shorter events are excluded, except for fold-back inversions.
- `bxRescue_windowSize` sets the window size region to the left or right of the breakpoint for barcode counting.
bxRescue_minMapQ: 20
bxRescue_minLength: 10000
bxRescue_windowSize: 1000
bxRescue_minReadOverlapSupport: 2

### 10. [combineSvabaTitan.snakefile](combineSvabaTitan.snakefile) settings: Plotting
Settings used for plotting copy number and SV results.
- `plot_zoom` indicates the plot should be focused on a specific region smaller than a whole chromosome. If set to `TRUE`, then `plot_chrs` (should only be a single chr), `plot_startPos`, `plot_endPos` should be set.
- `plot_geneFile` is a text file listing the regions to annotate in the plot. 4 column file: name, chr, start, stop.
plot_zoom: FALSE
plot_chrs: c(1:22, \"X\") # "None" will also plot all chromosomes
plot_startPos: None
plot_endPos: None
plot_geneFile: data/AR_coord.txt ## include list of genes to annotate on plot
plot_ylim: c(-2,6)
plot_size: c(8,4)
plot_type: titan ## titan - will include haplotype fraction
plot_format: png

0 comments on commit 8d8df71

Please sign in to comment.
You can’t perform that action at this time.