Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
WDLs to revert/convert between BAM, FASTQ and uBAM #83
Conversation
|
Note that this builds on #81 |
|
|
Third case (which we need for Monkol) in progress. I'm considering splitting out the cram to bam conversion to make the revert & sort step more generic. |
ktibbett
commented
Jan 8, 2017
|
@vdauwera that seems like a very good idea. |
|
Great, will do that then, @ktibbett. |
Change notes
|
|
Hmm test failed -- RevertSam failed with this error:
I'm going to assume this means I should set SO in RevertSam, but then I wonder why the separate SortSam step -- or was that supposed to be changing to coordinate? There was some ambiguity in the original script: RevertSam said |
|
Huh, the RevertSam command had a |
This is just running RevertSam now. |
ktibbett
commented
Jan 8, 2017
dshiga
commented
Jan 9, 2017
|
Hi @vdauwera, just to give some context about the original WDL and sorting: For our use case, we run the workflow on coordinate sorted CRAMs and so the input to RevertSam is a coordinate sorted BAM. By specifying SORT_ORDER=coordinate, we prevent RevertSam itself from trying to do any additional sorting -- by default it will sort in queryname order. (It would have been clearer to put SORT_ORDER=unsorted here instead. I didn't do that at the time because I was misled by some of the RevertSam code that makes it look like it will try to make a SortingCollection unless you exactly match the SORT_ORDER already in the input BAM header, though that's not actually the case.) However -- we actually want the outputs to be queryname sorted, because the ubams that we generate on prem and push to the cloud are normally queryname sorted and we wanted this workflow to give us back exactly the same ubams we would have fed into the single sample pipeline to make the CRAM. So then why not just let RevertSam do that sorting? I figured that instead of one big sort in RevertSam, which requires a lot of disk and memory and takes a long time, we would be better off sorting each of the resulting read group level ubams, which can be done in parallel on smaller machines. This is not at all evident from the original WDL itself, though, and I intend to add some clarifying comments to it! |
|
Oh I see, that makes way more sense now -- thanks for clarifying, @dshiga! Do you have any benchmarking numbers on the runtime of the RevertSam job if doing the sort vs. total runtime of unsorted RevertSam + separate SortSam step? We don't have a choice for the use case I'm covering because we have to use SANITIZE = true but I'd like to document this in a note on efficiency. |
dshiga
commented
Jan 9, 2017
|
I can tell you that running the existing workflow on a 45 GB CRAM took ~11.5 hours, 11 hours of which was RevertSam. The sort jobs took ~30 minutes each (but ran in parallel), and the validate jobs, also in parallel, took about 5 minutes. Unfortunately, I don't have numbers for doing the sorting inside RevertSam. I just vaguely remember it taking a long time and having trouble running real (as opposed to slimmed down) data through it, because it wanted so much memory and disk. |
|
Ah ok, thanks for this. |
|
This is ready for review. |
|
Need to update the output syntax for workflow nesting. |
|
|
All workflows successful on Cromwell 24. |
vdauwera commentedJan 8, 2017
•
edited
Three different use cases: