Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update sv scripts to only copy a single bam file and index, and respect project parameter #4646

Merged
merged 5 commits into from
Apr 24, 2018

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Apr 11, 2018

This PR makes four changes to the SV pipeline management scripts: First it uses the new glob argument for copying data into HDFS to only copy the requested BAM file and its index, rather than everything in the GCS bucket directory (this is useful when the bucket directory contains multiple samples, as is the case with our HGSV trios). Second, the scripts now respect the project argument at each phase of the pipeline. Third, the output directory will include the name of the cluster so that if multiple samples are being processed by the same branch in parallel it's easier to figure out which results directory contains which sample and the results won't collide if two runs happen to start at the same timestamp. Finally, if the user requests not to copy the FASTQ files from the main management script, the script will not direct the SV discovery tool to write them to disk.

@cwhelan cwhelan added the SV label Apr 11, 2018
@cwhelan cwhelan force-pushed the cw_copy_only_sample_bam_on_cluster_start branch 3 times, most recently from addfad9 to 55fb225 Compare April 11, 2018 19:26
@codecov-io
Copy link

codecov-io commented Apr 17, 2018

Codecov Report

Merging #4646 into master will increase coverage by 0.044%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4646       +/-   ##
===============================================
+ Coverage     79.844%   79.887%   +0.044%     
- Complexity     17334     17366       +32     
===============================================
  Files           1074      1074               
  Lines          62938     63085      +147     
  Branches       10186     10232       +46     
===============================================
+ Hits           50252     50397      +145     
+ Misses          8707      8705        -2     
- Partials        3979      3983        +4
Impacted Files Coverage Δ Complexity Δ
...lkers/ReferenceConfidenceVariantContextMerger.java 95.5% <0%> (+0.243%) 101% <0%> (+30%) ⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java 90% <0%> (+40%) 3% <0%> (+2%) ⬆️

@@ -285,7 +285,7 @@ done
########################################################################################################################
# call runWholePipeline
# set output directory to datetime-git branch-git hash stamped folder
OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"
OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${CLUSTER_NAME}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default CLUSTER_NAME duplicates some of the information already in OUTPUT_DIR, would it meet the stated goals of the PR (while keeping OUTPUT_DIR more concise) to replace it with SANITIZED_BAM?
Independently of this I'd suggest giving OUTPUT_DIR an optional override, which yields the easiest scripting of processing the results. e.g. my suggestion for this line would be something like:
OUTPUT_DIR="/results/${SV_OUTPUT_DIR:-"$(date "+%Y-%m-%d_%H.%M.%S")-${SANITIZED_BAM}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"}"

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes all seem good and necessary. Two comments For consideration.

  1. In an inline comment I have two potentially independent suggestions for changing the default OUTPUT_DIR in manage_sv_pipeline.sh
  2. Should we at some point rationalize the name of runWholePipeline.sh into snake-case?

@cwhelan cwhelan force-pushed the cw_copy_only_sample_bam_on_cluster_start branch from 55fb225 to 5e5ded9 Compare April 23, 2018 20:45
@cwhelan
Copy link
Member Author

cwhelan commented Apr 23, 2018

@TedBrookings I implemented these suggestions of yours:

  • replace CLUSTER_NAME with SANITIZED_BAM in the results directory
  • allow output directory to be overridden by setting SV_OUTPUT_DIR
  • rename runWholePipeline to run-whole-pipeline

I also found and fixed two other bugs:

  • added a set -f to the create cluster script to avoid having bash expand the wildcard glob for that step (this caused a problem if a file matching the pattern was present locally)
  • changed how the copy results script got cluster info so that it parses the result header (this fixes a problem when using preemptible workers because the number of columns in the results of the gcloud dataproc clusters list command had a different number of columns

Can you give these changes a quick re-review?

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your awk-fu is stronger than mine. I think I 70% understand it, but it seems to produce sensible results.
The other changes are all good.

@cwhelan cwhelan merged commit e530c2e into master Apr 24, 2018
@cwhelan cwhelan deleted the cw_copy_only_sample_bam_on_cluster_start branch April 24, 2018 01:25
cwhelan added a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
…ct project parameter (broadinstitute#4646)

* update SV scripts to only copy a single bam file, respect project, include cluster name in results to avoid collisions, and fix an issue with preemptible workers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants