update sv scripts to only copy a single bam file and index, and respect project parameter #4646

cwhelan · 2018-04-11T02:42:21Z

This PR makes four changes to the SV pipeline management scripts: First it uses the new glob argument for copying data into HDFS to only copy the requested BAM file and its index, rather than everything in the GCS bucket directory (this is useful when the bucket directory contains multiple samples, as is the case with our HGSV trios). Second, the scripts now respect the project argument at each phase of the pipeline. Third, the output directory will include the name of the cluster so that if multiple samples are being processed by the same branch in parallel it's easier to figure out which results directory contains which sample and the results won't collide if two runs happen to start at the same timestamp. Finally, if the user requests not to copy the FASTQ files from the main management script, the script will not direct the SV discovery tool to write them to disk.

codecov-io · 2018-04-17T19:34:44Z

Codecov Report

Merging #4646 into master will increase coverage by 0.044%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4646       +/-   ##
===============================================
+ Coverage     79.844%   79.887%   +0.044%     
- Complexity     17334     17366       +32     
===============================================
  Files           1074      1074               
  Lines          62938     63085      +147     
  Branches       10186     10232       +46     
===============================================
+ Hits           50252     50397      +145     
+ Misses          8707      8705        -2     
- Partials        3979      3983        +4

Impacted Files	Coverage Δ	Complexity Δ
...lkers/ReferenceConfidenceVariantContextMerger.java	`95.5% <0%> (+0.243%)`	`101% <0%> (+30%)`	⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java	`90% <0%> (+40%)`	`3% <0%> (+2%)`	⬆️

TedBrookings · 2018-04-18T13:44:34Z

scripts/sv/manage_sv_pipeline.sh

@@ -285,7 +285,7 @@ done
 ########################################################################################################################
 # call runWholePipeline
 # set output directory to datetime-git branch-git hash stamped folder
-OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"
+OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${CLUSTER_NAME}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"


Since the default CLUSTER_NAME duplicates some of the information already in OUTPUT_DIR, would it meet the stated goals of the PR (while keeping OUTPUT_DIR more concise) to replace it with SANITIZED_BAM?
Independently of this I'd suggest giving OUTPUT_DIR an optional override, which yields the easiest scripting of processing the results. e.g. my suggestion for this line would be something like:
OUTPUT_DIR="/results/${SV_OUTPUT_DIR:-"$(date "+%Y-%m-%d_%H.%M.%S")-${SANITIZED_BAM}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"}"

TedBrookings

These changes all seem good and necessary. Two comments For consideration.

In an inline comment I have two potentially independent suggestions for changing the default OUTPUT_DIR in manage_sv_pipeline.sh
Should we at some point rationalize the name of runWholePipeline.sh into snake-case?

…de cluster name in results to avoid collisions

cwhelan · 2018-04-23T20:51:51Z

@TedBrookings I implemented these suggestions of yours:

replace CLUSTER_NAME with SANITIZED_BAM in the results directory
allow output directory to be overridden by setting SV_OUTPUT_DIR
rename runWholePipeline to run-whole-pipeline

I also found and fixed two other bugs:

added a set -f to the create cluster script to avoid having bash expand the wildcard glob for that step (this caused a problem if a file matching the pattern was present locally)
changed how the copy results script got cluster info so that it parses the result header (this fixes a problem when using preemptible workers because the number of columns in the results of the gcloud dataproc clusters list command had a different number of columns

Can you give these changes a quick re-review?

TedBrookings

Your awk-fu is stronger than mine. I think I 70% understand it, but it seems to produce sensible results.
The other changes are all good.

…ct project parameter (broadinstitute#4646) * update SV scripts to only copy a single bam file, respect project, include cluster name in results to avoid collisions, and fix an issue with preemptible workers

cwhelan added the SV label Apr 11, 2018

cwhelan force-pushed the cw_copy_only_sample_bam_on_cluster_start branch 3 times, most recently from addfad9 to 55fb225 Compare April 11, 2018 19:26

TedBrookings reviewed Apr 18, 2018

View reviewed changes

TedBrookings approved these changes Apr 18, 2018

View reviewed changes

cwhelan added 5 commits April 23, 2018 10:49

update scripts to only copy a single bam file, respect project, inclu…

163c6e2

…de cluster name in results to avoid collisions

pr comments for output dir

19958e6

rename runWholePipeline to snake case

38e0b88

make scripts avoid expanding glob

a2a7e31

fix to handle preemptibles

5e5ded9

cwhelan force-pushed the cw_copy_only_sample_bam_on_cluster_start branch from 55fb225 to 5e5ded9 Compare April 23, 2018 20:45

TedBrookings approved these changes Apr 23, 2018

View reviewed changes

cwhelan merged commit e530c2e into master Apr 24, 2018

cwhelan deleted the cw_copy_only_sample_bam_on_cluster_start branch April 24, 2018 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update sv scripts to only copy a single bam file and index, and respect project parameter #4646

update sv scripts to only copy a single bam file and index, and respect project parameter #4646

cwhelan commented Apr 11, 2018 •

edited

codecov-io commented Apr 17, 2018 •

edited

TedBrookings Apr 18, 2018

TedBrookings left a comment

cwhelan commented Apr 23, 2018

TedBrookings left a comment

update sv scripts to only copy a single bam file and index, and respect project parameter #4646

update sv scripts to only copy a single bam file and index, and respect project parameter #4646

Conversation

cwhelan commented Apr 11, 2018 • edited

codecov-io commented Apr 17, 2018 • edited

Codecov Report

TedBrookings Apr 18, 2018

Choose a reason for hiding this comment

TedBrookings left a comment

Choose a reason for hiding this comment

cwhelan commented Apr 23, 2018

TedBrookings left a comment

Choose a reason for hiding this comment

cwhelan commented Apr 11, 2018 •

edited

codecov-io commented Apr 17, 2018 •

edited