Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_bams_bulk #6

Open
wants to merge 125 commits into
base: master
Choose a base branch
from
Open

merge_bams_bulk #6

wants to merge 125 commits into from

Conversation

lakras
Copy link
Contributor

@lakras lakras commented Dec 15, 2019

Bulk version of merge_bams.

merge_bams_bulk takes as input:

  • Array[File]+ in_bams: an array of input bam files
  • File in_bam_out_bam_table: a file containing a list of input bam filenames and the corresponding output bam filenames (tab-separated, one input file per line)
  • File? reheader_table: an optional reheader table

Input bam files are matched to output bam files according to the mapping in in_bam_out_bam_table.

(I am no longer including bulk versions of align_and_plot and assemble_denovo because I do not think they are necessary now that the batch runner beta tool is functional.)

…scatter, so it actually shows up as an input on dnanexus
…riables in original assemble_denovo so that assemble_denovo_bulk can provide them when calling assemble_denovo as subworkflow
pipes/WDL/workflows/merge_bams_bulk.wdl Outdated Show resolved Hide resolved
Comment on lines +25 to +34
String placeholder = write_map(in_bam_to_out_bam) # need to touch map in outer scatter for it to be seen in inner scatter

# retrieves the input bam files for this output bam file
scatter (in_bam in in_bams) {
String in_bam_basename = basename(in_bam, ".bam")
if(in_bam_to_out_bam[in_bam_basename] == out_bam) {
File relevant_in_bam = in_bam
}
}
Array[File] relevant_in_bams = select_all(relevant_in_bam) # gathers input bam files from the scatter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating -- as I stare at this, it appears you're basically trying to pull off two things in these particular lines of code highlighted here:

  1. invert the data structure (the Map) that might be transposed from what would make the workflow a bit more intuitive (you wanted to scatter on the unique values instead of the unique keys, so arguably this Map is backwards from what you want)
  2. find a way to resolve your filenames-in-a-table Strings into proper Files, something I never thought was possible, but this pulls it off by converting every File back into a String (with basename) and testing one by one in the if statement

Is it possible we could avoid all this if the workflow took, as input, an Map[String, Array[File]] out_filename_to_in_bams, where the keys were output filenames (as Strings) and the values were all the input BAM files (as WDL File objects, not text) that are to be merged into one output? Then this whole workflow could be as easy as:

workflow merge_bams_bulk {
  Map[String, Array[File]] out_filename_to_in_bams
  File? reheader_table
  String? docker="quay.io/broadinstitute/viral-core"

  scatter (pair in out_filename_to_in_bams) {
    call demux.merge_and_reheader_bams {
      input:
        out_basename = basename(pair.left, ".bam"),
        in_bams = pair.right,
        reheader_table = reheader_table,
        docker = docker
    }
  }
}

Then maybe if there's really a UI-based desire to present the input mapping as a table with filenames, perhaps we make another WDL workflow (merge_bams_bulk_from_tsv or something like that?) that reads the input text table and Array[File] input like you have here, and converts them all to a Map[String,Array[File]] using the above magic you've already written, and then invokes the simplified workflow I proposed as a subworkflow.

Copy link
Contributor Author

@lakras lakras Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be so glorious. Unfortunately Map[String, Array[File]] is not supported by DNAnexus, or at least not in any way I can understand.

The closest thing I could find to an explanation is at https://github.com/dnanexus/dxWDL/blob/master/doc/Internals_v0_81.md:

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a hash, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

—or at https://libraries.io/github/dnanexus/dxWDL under "Design":

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a file, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

I don't actually understand what this is saying, or if the takeaway is that there is some way to make complex types work with the UI. If you have any insights beyond mine then maybe we could make it work.

You can play around with a demo of what this and other complex datatypes are like at Hep A MA DPH/pipelines/2.0.11.0-86-g2003e73-lk-bulkifying_demo_of_complex_input_types/merge_bams_bulk (I can't figure out how to link to it directly, sorry.) Basically there are two inputs created for each complex type: the opportunity to select files and a string input. What to do with them, or if there is anything that can be done with them, I couldn't figure out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that quote under "Design" sounds exactly what you're doing in your pipeline. You take an input Array[File] of input bam Files and then you take a text file input that contains a serialized representation of the two dimensional data structure. Is there any reason you have to use a tab file instead of a json with the exact same info? I assume you're creating your large tab file programmatically anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously just didn't parse that at all, but no reason except that I couldn't figure out how the actual UI inputs work.

For any complex data type (in this case we are looking at map_of_strings_to_arrays_of_files), there are two inputs. The first takes a flat array of files, as described:
Screen Shot 2019-12-16 at 12 53 08 PM
Screen Shot 2019-12-16 at 12 54 07 PM

The second takes a string:
Screen Shot 2019-12-16 at 12 53 24 PM

Somehow, this string should contain the json Map[String, Array[File]]. But how? If the user has to copy/paste the mapping, I actually think the current version of merge_bams_bulk, where the mapping lives in a file, is better for reproducibility and figuring out any errors. Or should it be a hash, as described in the first of the two explanations? If so, a hash of what? created with what hash function? (And if I can't figure it out, how would an end user know what to do?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my main issue. Anytime we write a WDL pipeline, I want it to be portable across different execution environments--otherwise, why are we writing it in WDL in the first place? So it should run just fine on your laptop, on a cluster, on DNAnexus, on Terra, and other environments I've never used before. One of the main principles in making that happen is to try to remove as much of the unix file system concept from the high level pipeline itself, leaving it only to the code within each task, because most of the time, these pipelines run in places where directories don't exist, and so forth.

Talking about a bunch of filenames in the text content of an input file starts getting very infrastructure specific, because the way files are organized and referred to are very infrastructure specific. For example, in the case of DNAnexus, the official / unambiguous way to refer to a file object is by its DNAnexus opaque file identifier (project-ABC:file-XYZ). In other places, you have directories. Sometimes there's a non-opaque file identifier, such as with most cloud buckets. So what the execution engines generally promote is capturing the input files in an infrastructure-specific way when you execute the workflow and describe the Files, but handling the plumbing of the pipelines in an opaque (don't make any assumptions) kind of way. So using DNAnexus's preferred way of taking in those input mappings is at least something that doesn't mess with compatibility with other platforms. Another approach, as I mentioned before, is to keep the basic part of this pipeline clean and generic (take in the Map[String, Array[File]]) and use an optional/add-on task right before it to make life easier for DNAnexus users (perhaps even named in a way that makes it clear that it's platform-specific).

As for the json representation, almost every WDL executor uses json anyway for describing inputs to executions (that's what you used in that dx-defaults file you deleted). It'd be something to the effect of:

 {
   "outfilename1.bam": ["project-ABC:file-XYZ1A", "project-ABC:file-XYZ1B"]
   "outfilename2.bam": ["project-ABC:file-XYZ2A", "project-ABC:file-XYZ2B"] 
 }

I mean, maybe those input files could be presented in a directory/filename-like representation instead of the opaque identifiers I used above, but maybe not -- the dx-defaults don't accept those forms, but you never know until you try. Anyway, although the above json would be annoying to human-write, I'm assuming your tabfile was script-produced anyway, and there's no real reason it couldn't produce this form of the output as well.

This might seem like splitting hairs between two ways that are semantically doing the same thing (providing a serialized mapping of the two-dimensional data structure, one in tsv form, one in json, and the former is certainly simpler), but I think there's actually a significant difference philosophically between the two: in the latter case, we're telling the execution engine (whether it's DNAnexus or something else) the data structure in a way it understands and having it resolve it. In the former case, we're telling our own custom-written code about the mapping and having it resolve it -- and when it comes to File objects, I think this potentially gets dangerous, as it breaks an intended abstraction layer, and could easily have difficulty porting more broadly.

Also, if the primary purpose of the preceding bits is to solve a UI issue on DNAnexus, it seems that it would be sensible to make this part of the solution very DNAnexus-specific code. Similar to the assembly stats python script (that is DNAnexus specific), we could imagine a "parse-the-tabfile-to-json" script that is also DNAnexus specific and provided separately from all the WDL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all counts. My only concern is that I reaaaaally hope it doesn't turn out that we need to use "project-ABC:file-XYZ1A"-type filenames in the json. I like being able to see the mapping for debugging and to verify that I didn't mess anything up—though that can be/is printed as an intermediate output, so at least we could see that. But also because an input file with those filenames would be a lot harder to put together. I guess we'll see! I'll play around with actually using the json inputs later this week.

pipes/dnax/dx-defaults-assemble_denovo_bulk.json Outdated Show resolved Hide resolved
Comment on lines +47 to +57
task clean_in_bam_out_bam_table {
File table

command {
cat ${table} | sed 's/[.]bam$//g' | sed $'s/[.]bam\t/\t/g' | tee in_bam_out_bam_table
}

output {
File clean_table = "in_bam_out_bam_table"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task seems unnecessary -- you're already calling the WDL basename function on all the members of the table, so why do it twice (once in WDL, once in a task)?

Copy link
Contributor Author

@lakras lakras Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As you definitely already can tell,) my thinking here was to deal with the user potentially including the .bam extension in the input and/or output bam filenames listed in in_bam_out_bam_table. I'd like to be able to handle, for example, a mapping like

input_a1.bam	output_a
input_a2	output_a
input_b1	output_b.bam
input_b2	output_b
input_c1.bam	output_c.bam
input_d1	output_d.bam

or even a more reasonable

input_a1.bam	output_a.bam
input_a2.bam	output_a.bam

vs.

input_a1	output_a
input_a2	output_a

I ran into problems with this line:
if(in_bam_to_out_bam[in_bam_basename] == out_bam)

Note that in_bam_basename is the basename of an input bam file from the list of bam files provided by the user (Array[File]+ in_bams), while in_bam_to_out_bam is the above mapping file provided by the user. (This map is built directly from the contents of the file, and is not mutable.)

This would be fine if we could check the map for in_bam_basename and in_bam_basename + ".bam". Unfortunately, I couldn't find a way to check if a key is present in a map—and if the key is not in the map, in_bam_to_out_bam[in_bam_basename] gives me an error. So I decided to clean out the .bams from inside the user-provided mapping file before the map is created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants