-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update sv scripts to only copy a single bam file and index, and respect project parameter #4646
Conversation
addfad9
to
55fb225
Compare
Codecov Report
@@ Coverage Diff @@
## master #4646 +/- ##
===============================================
+ Coverage 79.844% 79.887% +0.044%
- Complexity 17334 17366 +32
===============================================
Files 1074 1074
Lines 62938 63085 +147
Branches 10186 10232 +46
===============================================
+ Hits 50252 50397 +145
+ Misses 8707 8705 -2
- Partials 3979 3983 +4
|
scripts/sv/manage_sv_pipeline.sh
Outdated
@@ -285,7 +285,7 @@ done | |||
######################################################################################################################## | |||
# call runWholePipeline | |||
# set output directory to datetime-git branch-git hash stamped folder | |||
OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}" | |||
OUTPUT_DIR="/results/$(date "+%Y-%m-%d_%H.%M.%S")-${CLUSTER_NAME}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default CLUSTER_NAME duplicates some of the information already in OUTPUT_DIR, would it meet the stated goals of the PR (while keeping OUTPUT_DIR more concise) to replace it with SANITIZED_BAM?
Independently of this I'd suggest giving OUTPUT_DIR an optional override, which yields the easiest scripting of processing the results. e.g. my suggestion for this line would be something like:
OUTPUT_DIR="/results/${SV_OUTPUT_DIR:-"$(date "+%Y-%m-%d_%H.%M.%S")-${SANITIZED_BAM}-${GIT_BRANCH}-${GATK_JAR_HASH}${UNTRACKED_COMMIT}"}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes all seem good and necessary. Two comments For consideration.
- In an inline comment I have two potentially independent suggestions for changing the default OUTPUT_DIR in manage_sv_pipeline.sh
- Should we at some point rationalize the name of runWholePipeline.sh into snake-case?
…de cluster name in results to avoid collisions
55fb225
to
5e5ded9
Compare
@TedBrookings I implemented these suggestions of yours:
I also found and fixed two other bugs:
Can you give these changes a quick re-review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your awk-fu is stronger than mine. I think I 70% understand it, but it seems to produce sensible results.
The other changes are all good.
…ct project parameter (broadinstitute#4646) * update SV scripts to only copy a single bam file, respect project, include cluster name in results to avoid collisions, and fix an issue with preemptible workers
This PR makes four changes to the SV pipeline management scripts: First it uses the new glob argument for copying data into HDFS to only copy the requested BAM file and its index, rather than everything in the GCS bucket directory (this is useful when the bucket directory contains multiple samples, as is the case with our HGSV trios). Second, the scripts now respect the project argument at each phase of the pipeline. Third, the output directory will include the name of the cluster so that if multiple samples are being processed by the same branch in parallel it's easier to figure out which results directory contains which sample and the results won't collide if two runs happen to start at the same timestamp. Finally, if the user requests not to copy the FASTQ files from the main management script, the script will not direct the SV discovery tool to write them to disk.