Skip to content
This repository has been archived by the owner on May 13, 2020. It is now read-only.

Commit

Permalink
Dev (#6)
Browse files Browse the repository at this point in the history
* Update wdl to GATK4.1

* Updated imports to git tag URL, updated gatk docker in json, updated Readme structure

* minor edits and link to documentation in readme
  • Loading branch information
bshifaw committed Feb 13, 2019
1 parent fd444bf commit 216157d
Show file tree
Hide file tree
Showing 6 changed files with 223 additions and 55 deletions.
56 changes: 37 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
# somatic-cnvs
Workflows for somatic copy number variant analysis

## Running the Somatic CNV WDL
### Purpose :
Workflows for somatic copy number variant analysis.

### Which WDL should you use?
### cnv_somatic_panel_workflow :
Builds a panel of normals (PON) for the cnv pair workflow.

- Building a panel of normals (PoN): ``cnv_somatic_panel_workflow.wdl``
- Running a matched pair: ``cnv_somatic_pair_workflow.wdl``
#### Requirements/Expectations

#### Setting up parameter json file for a run

To get started, create the json template (using ``java -jar wdltool.jar inputs <workflow>``) for the workflow you wish to run and adjust parameters accordingly.

*Please note that there are optional workflow-level and task-level parameters that do not appear in the template file. These are set to reasonable values by default, but can also be adjusted if desired.*

#### Required parameters in the somatic panel workflow

Important: The normal_bams samples in the json can be used test the wdl, they are NOT to be used to create a panel of normals for sequence analysis. For instructions on creating a proper PON please refer to user the documents https://software.broadinstitute.org/gatk/documentation/ .
Important: The normal_bams samples in the json can be used test the wdl, they are NOT to be used to create a panel of normals for sequence analysis. For instructions on creating a proper PON please refer to user the documents [Panel of Normals](https://software.broadinstitute.org/gatk/documentation/article?id=11053) and [Generate a CNV panel of normals with CreateReadCountPanelOfNormals](https://gatkforums.broadinstitute.org/dsde/discussion/11682#2) .

The reference used must be the same between PoN and case samples.

Expand All @@ -37,7 +29,14 @@ In additional, there are optional workflow-level and task-level parameters that

Further explanation of other task-level parameters may be found by invoking the ``--help`` documentation available in the gatk.jar for each tool.

#### Required parameters in the somatic pair workflow
#### Outputs
- Read count PON in HD5 format
- Addtional metrics

### cnv_somatic_pair_workflow :
Running a matched pair to obtain somatic copy number variants.

#### Requirements/Expectations

The reference and bins (if specified) must be the same between PoN and case samples.

Expand All @@ -62,8 +61,27 @@ To invoke Oncotator on the called tumor copy-ratio segments:

Further explanation of these task-level parameters may be found by invoking the ``--help`` documentation available in the gatk.jar for each tool.

##Important
- The data in gs://gatk-test-data/1kgp are from the 1000 Genomes Project (http://www.internationalgenome.org/home) and are provided as is.
If you have questions on the data, please direct them to the 1000 Genomes Project email at info@1000genomes.org. Do NOT post questions about the data to the GATK forum.
#### Outputs
- Modeled segments for tumor and normal
- Modeled segments plot for tumor and normal
- Denoised copy ratios for tumor and normal
- Denoised copy ratios plot for tumor and normal
- Denoised copy ratios lim 4 plot for tumor and normal
- Addtional metrics

### Software version requirements :
- GATK4.1 or later

Cromwell version support
- Successfully tested on v37

### Important Note :
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- For help running workflows on the Google Cloud Platform or locally please view the following tutorial [(How to) Execute Workflows from the gatk-workflows Git Organization](https://software.broadinstitute.org/gatk/documentation/article?id=12521).
- For help running workflows on the Google Cloud Platform or locally please
view the following tutorial [(How to) Execute Workflows from the gatk-workflows Git Organization](https://software.broadinstitute.org/gatk/documentation/article?id=12521).
- The following material is provided by the GATK Team. Please post any questions or concerns to one of our forum sites : [GATK](https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team/) , [FireCloud](https://gatkforums.broadinstitute.org/firecloud/categories/ask-the-firecloud-team) or [Terra](https://broadinstitute.zendesk.com/hc/en-us/community/topics/360000500432-General-Discussion) , [WDL/Cromwell](https://gatkforums.broadinstitute.org/wdl/categories/ask-the-wdl-team).
- Please visit the [User Guide](https://software.broadinstitute.org/gatk/documentation/) site for further documentation on our workflows and tools.

### LICENSING :
Copyright Broad Institute, 2019 | BSD-3
This script is released under the WDL open source code license (BSD-3) (full license text at https://github.com/openwdl/wdl/blob/master/LICENSE). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.
148 changes: 134 additions & 14 deletions cnv_common_tasks.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ task AnnotateIntervals {
File ref_fasta
File ref_fasta_fai
File ref_fasta_dict
File? mappability_track_bed
File? mappability_track_bed_idx
File? segmental_duplication_track_bed
File? segmental_duplication_track_bed_idx
Int? feature_query_lookahead
File? gatk4_jar_override

# Runtime parameters
Expand All @@ -68,6 +73,10 @@ task AnnotateIntervals {

Int machine_mem_mb = select_first([mem_gb, 2]) * 1000
Int command_mem_mb = machine_mem_mb - 500

# Determine output filename
String filename = select_first([intervals, "wgs.preprocessed"])
String base_filename = basename(filename, ".interval_list")

command <<<
set -e
Expand All @@ -76,8 +85,11 @@ task AnnotateIntervals {
gatk --java-options "-Xmx${command_mem_mb}m" AnnotateIntervals \
-L ${intervals} \
--reference ${ref_fasta} \
${"--mappability-track " + mappability_track_bed} \
${"--segmental-duplication-track " + segmental_duplication_track_bed} \
--feature-query-lookahead ${default=1000000 feature_query_lookahead} \
--interval-merging-rule OVERLAPPING_ONLY \
--output annotated_intervals.tsv
--output ${base_filename}.annotated.tsv
>>>

runtime {
Expand All @@ -89,7 +101,77 @@ task AnnotateIntervals {
}

output {
File annotated_intervals = "annotated_intervals.tsv"
File annotated_intervals = "${base_filename}.annotated.tsv"
}
}

task FilterIntervals {
File intervals
File? blacklist_intervals
File? annotated_intervals
Array[File]? read_count_files
Float? minimum_gc_content
Float? maximum_gc_content
Float? minimum_mappability
Float? maximum_mappability
Float? minimum_segmental_duplication_content
Float? maximum_segmental_duplication_content
Int? low_count_filter_count_threshold
Float? low_count_filter_percentage_of_samples
Float? extreme_count_filter_minimum_percentile
Float? extreme_count_filter_maximum_percentile
Float? extreme_count_filter_percentage_of_samples
File? gatk4_jar_override

# Runtime parameters
String gatk_docker
Int? mem_gb
Int? disk_space_gb
Boolean use_ssd = false
Int? cpu
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 7]) * 1000
Int command_mem_mb = machine_mem_mb - 500

# Determine output filename
String filename = select_first([intervals, "wgs.preprocessed"])
String base_filename = basename(filename, ".interval_list")

command <<<
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk4_jar_override}

gatk --java-options "-Xmx${command_mem_mb}m" FilterIntervals \
-L ${intervals} \
${"-XL " + blacklist_intervals} \
${"--annotated-intervals " + annotated_intervals} \
${if defined(read_count_files) then "--input " else ""} ${sep=" --input " read_count_files} \
--minimum-gc-content ${default="0.1" minimum_gc_content} \
--maximum-gc-content ${default="0.9" maximum_gc_content} \
--minimum-mappability ${default="0.9" minimum_mappability} \
--maximum-mappability ${default="1.0" maximum_mappability} \
--minimum-segmental-duplication-content ${default="0.0" minimum_segmental_duplication_content} \
--maximum-segmental-duplication-content ${default="0.5" maximum_segmental_duplication_content} \
--low-count-filter-count-threshold ${default="5" low_count_filter_count_threshold} \
--low-count-filter-percentage-of-samples ${default="90.0" low_count_filter_percentage_of_samples} \
--extreme-count-filter-minimum-percentile ${default="1.0" extreme_count_filter_minimum_percentile} \
--extreme-count-filter-maximum-percentile ${default="99.0" extreme_count_filter_maximum_percentile} \
--extreme-count-filter-percentage-of-samples ${default="90.0" extreme_count_filter_percentage_of_samples} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${base_filename}.filtered.interval_list
>>>

runtime {
docker: "${gatk_docker}"
memory: machine_mem_mb + " MB"
disks: "local-disk " + select_first([disk_space_gb, 50]) + if use_ssd then " SSD" else " HDD"
cpu: select_first([cpu, 1])
preemptible: select_first([preemptible_attempts, 5])
}

output {
File filtered_intervals = "${base_filename}.filtered.interval_list"
}
}

Expand Down Expand Up @@ -200,6 +282,8 @@ task CollectAllelicCounts {
task ScatterIntervals {
File interval_list
Int num_intervals_per_scatter
String? output_dir
File? gatk4_jar_override

# Runtime parameters
String gatk_docker
Expand All @@ -210,16 +294,35 @@ task ScatterIntervals {
Int? preemptible_attempts

Int machine_mem_mb = select_first([mem_gb, 2]) * 1000
Int command_mem_mb = machine_mem_mb - 500

# If optional output_dir not specified, use "out";
String output_dir_ = select_first([output_dir, "out"])

String base_filename = basename(interval_list, ".interval_list")

command <<<
set -e

grep @ ${interval_list} > header.txt
grep -v @ ${interval_list} > all_intervals.txt
split -l ${num_intervals_per_scatter} --numeric-suffixes all_intervals.txt ${base_filename}.scattered.
for i in ${base_filename}.scattered.*; do cat header.txt $i > $i.interval_list; done
mkdir ${output_dir_}
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk4_jar_override}

{
>&2 echo "Attempting to run IntervalListTools..."
gatk --java-options "-Xmx${command_mem_mb}m" IntervalListTools \
--INPUT ${interval_list} \
--SUBDIVISION_MODE INTERVAL_COUNT \
--SCATTER_CONTENT ${num_intervals_per_scatter} \
--OUTPUT ${output_dir_} &&
# output files are named output_dir_/temp_0001_of_N/scattered.interval_list, etc. (N = num_intervals_per_scatter);
# we rename them as output_dir_/base_filename.scattered.0000.interval_list, etc.
ls ${output_dir_}/*/scattered.interval_list | \
cat -n | \
while read n filename; do mv $filename ${output_dir_}/${base_filename}.scattered.$(printf "%04d" $n).interval_list; done
} || {
# if only a single shard is required, then we can just rename the original interval list
>&2 echo "IntervalListTools failed because only a single shard is required. Copying original interval list..."
cp ${interval_list} ${output_dir_}/${base_filename}.scattered.1.interval_list
}
>>>

runtime {
Expand All @@ -231,14 +334,18 @@ task ScatterIntervals {
}

output {
Array[File] scattered_interval_lists = glob("${base_filename}.scattered.*.interval_list")
Array[File] scattered_interval_lists = glob("${output_dir_}/${base_filename}.scattered.*.interval_list")
}
}

task PostprocessGermlineCNVCalls {
String entity_id
Array[File] gcnv_calls_tars
Array[File] gcnv_model_tars
Array[File] calling_configs
Array[File] denoising_configs
Array[File] gcnvkernel_version
Array[File] sharded_interval_lists
File contig_ploidy_calls_tar
Array[String]? allosomal_contigs
Int ref_copy_number_autosomal_contigs
Expand All @@ -258,21 +365,33 @@ task PostprocessGermlineCNVCalls {

String genotyped_intervals_vcf_filename = "genotyped-intervals-${entity_id}.vcf.gz"
String genotyped_segments_vcf_filename = "genotyped-segments-${entity_id}.vcf.gz"
Boolean allosomal_contigs_specified = defined(allosomal_contigs) && length(select_first([allosomal_contigs, []])) > 0

Array[String] allosomal_contigs_args = if defined(allosomal_contigs) then prefix("--allosomal-contig ", select_first([allosomal_contigs])) else []

String dollar = "$" #WDL workaround for using array[@], see https://github.com/broadinstitute/cromwell/issues/1819

command <<<
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk4_jar_override}

sharded_interval_lists_array=(${sep=" " sharded_interval_lists})

# untar calls to CALLS_0, CALLS_1, etc directories and build the command line
# also copy over shard config and interval files
gcnv_calls_tar_array=(${sep=" " gcnv_calls_tars})
calling_configs_array=(${sep=" " calling_configs})
denoising_configs_array=(${sep=" " denoising_configs})
gcnvkernel_version_array=(${sep=" " gcnvkernel_version})
sharded_interval_lists_array=(${sep=" " sharded_interval_lists})
calls_args=""
for index in ${dollar}{!gcnv_calls_tar_array[@]}; do
gcnv_calls_tar=${dollar}{gcnv_calls_tar_array[$index]}
mkdir CALLS_$index
tar xzf $gcnv_calls_tar -C CALLS_$index
mkdir -p CALLS_$index/SAMPLE_${sample_index}
tar xzf $gcnv_calls_tar -C CALLS_$index/SAMPLE_${sample_index}
cp ${dollar}{calling_configs_array[$index]} CALLS_$index/
cp ${dollar}{denoising_configs_array[$index]} CALLS_$index/
cp ${dollar}{gcnvkernel_version_array[$index]} CALLS_$index/
cp ${dollar}{sharded_interval_lists_array[$index]} CALLS_$index/
calls_args="$calls_args --calls-shard-path CALLS_$index"
done

Expand All @@ -289,17 +408,18 @@ task PostprocessGermlineCNVCalls {
mkdir extracted-contig-ploidy-calls
tar xzf ${contig_ploidy_calls_tar} -C extracted-contig-ploidy-calls

allosomal_contigs_args="--allosomal-contig ${sep=" --allosomal-contig " allosomal_contigs}"

gatk --java-options "-Xmx${command_mem_mb}m" PostprocessGermlineCNVCalls \
$calls_args \
$model_args \
${true="$allosomal_contigs_args" false="" allosomal_contigs_specified} \
${sep=" " allosomal_contigs_args} \
--autosomal-ref-copy-number ${ref_copy_number_autosomal_contigs} \
--contig-ploidy-calls extracted-contig-ploidy-calls \
--sample-index ${sample_index} \
--output-genotyped-intervals ${genotyped_intervals_vcf_filename} \
--output-genotyped-segments ${genotyped_segments_vcf_filename}

rm -r CALLS_*
rm -r MODEL_*
>>>

runtime {
Expand Down
2 changes: 1 addition & 1 deletion cnv_somatic_pair_workflow.b37.inputs.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"CNVSomaticPairWorkflow.intervals": "gs://gatk-test-data/cnv/somatic/ice_targets.tsv.interval_list",

"##_COMMENT3": "Docker",
"CNVSomaticPairWorkflow.gatk_docker": "broadinstitute/gatk:4.0.8.0",
"CNVSomaticPairWorkflow.gatk_docker": "broadinstitute/gatk:4.1.0.0",
"##CNVSomaticPairWorkflow.oncotator_docker": "(optional) String?",

"##_COMMENT4": "Memory Optional",
Expand Down
Loading

0 comments on commit 216157d

Please sign in to comment.