WDLs to revert/convert between BAM, FASTQ and uBAM #83

Merged
merged 1 commit into from Jan 22, 2017

Conversation

Projects
None yet
3 participants
Collaborator

vdauwera commented Jan 8, 2017 edited

Three different use cases:

  • Reverting several per-readgroup bams to paired FASTQs
  • Converting paired FASTQs to uBAM
  • Reverting multi-readgroup BAM to uBAM
Collaborator

vdauwera commented Jan 8, 2017

Note that this builds on #81

Collaborator

vdauwera commented Jan 8, 2017 edited

  • Test RevertRGBamsToPairedFastQsWf: 8480ab89-f89b-455d-b64e-9760d9ba1f39
  • Test ConvertPairedFastQToUnmappedBamWf: c9b61450-aeb2-4b65-821b-8903e9d73d30
Collaborator

vdauwera commented Jan 8, 2017 edited

Third case (which we need for Monkol) in progress. I'm considering splitting out the cram to bam conversion to make the revert & sort step more generic.

ktibbett commented Jan 8, 2017

@vdauwera that seems like a very good idea.

Collaborator

vdauwera commented Jan 8, 2017

Great, will do that then, @ktibbett.

Collaborator

vdauwera commented Jan 8, 2017

  • Test RevertBamToUnmappedRGBamsWf: abb24610-b621-4b5a-9e10-590be65b2624

Change notes

  • I stripped the original script down to just RevertSam + SortSam. Converting from CRAM is too specific, and I think validation should be done separately from processing for the generic use case.

  • In RevertSam I added SANITIZE and MAX_DISCARD_FRACTION but replace OUTPUT_MAP by just OUTPUT for the sake of being simple and generic.

  • Currently the inputs json uses some older b37-aligned data but I could update it to hg38 if desired. As it stands it aligns nicely with the use case so I don't really care either way.

  • Added a bunch of comments + a header to document usage and license terms.

Collaborator

vdauwera commented Jan 8, 2017

Hmm test failed -- RevertSam failed with this error:

SORT_ORDER must be queryname when sanitization is enabled with SANITIZE=true.

I'm going to assume this means I should set SO in RevertSam, but then I wonder why the separate SortSam step -- or was that supposed to be changing to coordinate? There was some ambiguity in the original script: RevertSam said SORT_ORDER=queryname but filename said .coord.sorted.unmapped.bam.

Collaborator

vdauwera commented Jan 8, 2017

Huh, the RevertSam command had a SORT_ORDER=coordinate line I hadn't noticed. Will change that. @ktibbett do you know why the script was emitting coordinate-sorted ubams but then sorting them by queryname, instead of just emitting qname-sorted from the get-go? This doesn't make a lot of sense to me... but I'm probably missing something.

Collaborator

vdauwera commented Jan 8, 2017 edited

  • Test RevertBamToUnmappedRGBamsWf with SortSam stripped out: 5175e654-0838-4cc0-b9a4-1898abbad6f5

This is just running RevertSam now.

ktibbett commented Jan 8, 2017

@dshiga , can you please look at @vdauwera 's question about RevertSam outputting in coordinate-sorted order...?

dshiga commented Jan 9, 2017

Hi @vdauwera, just to give some context about the original WDL and sorting:

For our use case, we run the workflow on coordinate sorted CRAMs and so the input to RevertSam is a coordinate sorted BAM. By specifying SORT_ORDER=coordinate, we prevent RevertSam itself from trying to do any additional sorting -- by default it will sort in queryname order. (It would have been clearer to put SORT_ORDER=unsorted here instead. I didn't do that at the time because I was misled by some of the RevertSam code that makes it look like it will try to make a SortingCollection unless you exactly match the SORT_ORDER already in the input BAM header, though that's not actually the case.)

However -- we actually want the outputs to be queryname sorted, because the ubams that we generate on prem and push to the cloud are normally queryname sorted and we wanted this workflow to give us back exactly the same ubams we would have fed into the single sample pipeline to make the CRAM.

So then why not just let RevertSam do that sorting? I figured that instead of one big sort in RevertSam, which requires a lot of disk and memory and takes a long time, we would be better off sorting each of the resulting read group level ubams, which can be done in parallel on smaller machines.

This is not at all evident from the original WDL itself, though, and I intend to add some clarifying comments to it!

Collaborator

vdauwera commented Jan 9, 2017

Oh I see, that makes way more sense now -- thanks for clarifying, @dshiga! Do you have any benchmarking numbers on the runtime of the RevertSam job if doing the sort vs. total runtime of unsorted RevertSam + separate SortSam step? We don't have a choice for the use case I'm covering because we have to use SANITIZE = true but I'd like to document this in a note on efficiency.

dshiga commented Jan 9, 2017

I can tell you that running the existing workflow on a 45 GB CRAM took ~11.5 hours, 11 hours of which was RevertSam. The sort jobs took ~30 minutes each (but ran in parallel), and the validate jobs, also in parallel, took about 5 minutes.

Unfortunately, I don't have numbers for doing the sorting inside RevertSam. I just vaguely remember it taking a long time and having trouble running real (as opposed to slimmed down) data through it, because it wanted so much memory and disk.

Collaborator

vdauwera commented Jan 9, 2017

Ah ok, thanks for this.

Collaborator

vdauwera commented Jan 10, 2017

This is ready for review.

Collaborator

vdauwera commented Jan 20, 2017

Need to update the output syntax for workflow nesting.

Collaborator

vdauwera commented Jan 21, 2017 edited

  • Update & test ConvertPairedFastQToUnmappedBamWf: bd54620c-89c1-44e9-929b-46427556920e
  • Update & test RevertBamToUnmappedRGBamsWf: de19eda3-578b-4c7d-bbd3-7559368f4986
  • Update & test RevertRGBamsToPairedFastQsWf: d06c9971-a5c8-43e8-9e39-25db270b2b7e
Collaborator

vdauwera commented Jan 21, 2017

All workflows successful on Cromwell 24.

@vdauwera vdauwera WDLs for generic conversion/reversion use cases
 - revert batch of single-readgroup SAM/BAMs to FASTQ
 - convert batch of paired FASTQs to uBAM per readgroup
 - revert multi-readgroup SAM/BAM to uBAM
575a3c3

@vdauwera vdauwera merged commit 7ac575f into develop Jan 22, 2017

1 check was pending

code-review/pullapprove Approval required by 2 of: cjllanwarne, geoffjentry, Horneth, kshakir, mcovarr
Details

vdauwera deleted the gvda_reversion_wdls branch Jan 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment