merge_bams_bulk #6

lakras · 2019-12-15T05:43:59Z

Bulk version of merge_bams.

merge_bams_bulk takes as input:

Array[File]+ in_bams: an array of input bam files
File in_bam_out_bam_table: a file containing a list of input bam filenames and the corresponding output bam filenames (tab-separated, one input file per line)
File? reheader_table: an optional reheader table

Input bam files are matched to output bam files according to the mapping in in_bam_out_bam_table.

(I am no longer including bulk versions of align_and_plot and assemble_denovo because I do not think they are necessary now that the batch runner beta tool is functional.)

…scatter, so it actually shows up as an input on dnanexus

…bles

…n file

…all to assemble_denovo

…riables in original assemble_denovo so that assemble_denovo_bulk can provide them when calling assemble_denovo as subworkflow

… defaults for assemble_denovo_bulk too

… hardcoded

… with a hardcoded key

…ppears

…runner beta works again

pipes/WDL/workflows/merge_bams_bulk.wdl

dpark01 · 2019-12-16T02:23:42Z

pipes/WDL/workflows/merge_bams_bulk.wdl

+        String placeholder = write_map(in_bam_to_out_bam) # need to touch map in outer scatter for it to be seen in inner scatter
+
+        # retrieves the input bam files for this output bam file
+        scatter (in_bam in in_bams) {
+            String in_bam_basename = basename(in_bam, ".bam")
+            if(in_bam_to_out_bam[in_bam_basename] == out_bam) {
+                File relevant_in_bam = in_bam
+            }
+        }
+        Array[File] relevant_in_bams = select_all(relevant_in_bam) # gathers input bam files from the scatter


Fascinating -- as I stare at this, it appears you're basically trying to pull off two things in these particular lines of code highlighted here:

invert the data structure (the Map) that might be transposed from what would make the workflow a bit more intuitive (you wanted to scatter on the unique values instead of the unique keys, so arguably this Map is backwards from what you want)

find a way to resolve your filenames-in-a-table Strings into proper Files, something I never thought was possible, but this pulls it off by converting every File back into a String (with basename) and testing one by one in the if statement

Is it possible we could avoid all this if the workflow took, as input, an Map[String, Array[File]] out_filename_to_in_bams, where the keys were output filenames (as Strings) and the values were all the input BAM files (as WDL File objects, not text) that are to be merged into one output? Then this whole workflow could be as easy as:

workflow merge_bams_bulk { Map[String, Array[File]] out_filename_to_in_bams File? reheader_table String? docker="quay.io/broadinstitute/viral-core" scatter (pair in out_filename_to_in_bams) { call demux.merge_and_reheader_bams { input: out_basename = basename(pair.left, ".bam"), in_bams = pair.right, reheader_table = reheader_table, docker = docker } } }

Then maybe if there's really a UI-based desire to present the input mapping as a table with filenames, perhaps we make another WDL workflow (merge_bams_bulk_from_tsv or something like that?) that reads the input text table and Array[File] input like you have here, and converts them all to a Map[String,Array[File]] using the above magic you've already written, and then invokes the simplified workflow I proposed as a subworkflow.

That would be so glorious. Unfortunately Map[String, Array[File]] is not supported by DNAnexus, or at least not in any way I can understand.

The closest thing I could find to an explanation is at https://github.com/dnanexus/dxWDL/blob/master/doc/Internals_v0_81.md:

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a hash, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

—or at https://libraries.io/github/dnanexus/dxWDL under "Design":

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a file, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

I don't actually understand what this is saying, or if the takeaway is that there is some way to make complex types work with the UI. If you have any insights beyond mine then maybe we could make it work.

You can play around with a demo of what this and other complex datatypes are like at Hep A MA DPH/pipelines/2.0.11.0-86-g2003e73-lk-bulkifying_demo_of_complex_input_types/merge_bams_bulk (I can't figure out how to link to it directly, sorry.) Basically there are two inputs created for each complex type: the opportunity to select files and a string input. What to do with them, or if there is anything that can be done with them, I couldn't figure out.

Wow that quote under "Design" sounds exactly what you're doing in your pipeline. You take an input Array[File] of input bam Files and then you take a text file input that contains a serialized representation of the two dimensional data structure. Is there any reason you have to use a tab file instead of a json with the exact same info? I assume you're creating your large tab file programmatically anyway?

I previously just didn't parse that at all, but no reason except that I couldn't figure out how the actual UI inputs work.

For any complex data type (in this case we are looking at map_of_strings_to_arrays_of_files), there are two inputs. The first takes a flat array of files, as described:

The second takes a string:

Somehow, this string should contain the json Map[String, Array[File]]. But how? If the user has to copy/paste the mapping, I actually think the current version of merge_bams_bulk, where the mapping lives in a file, is better for reproducibility and figuring out any errors. Or should it be a hash, as described in the first of the two explanations? If so, a hash of what? created with what hash function? (And if I can't figure it out, how would an end user know what to do?)

Here's my main issue. Anytime we write a WDL pipeline, I want it to be portable across different execution environments--otherwise, why are we writing it in WDL in the first place? So it should run just fine on your laptop, on a cluster, on DNAnexus, on Terra, and other environments I've never used before. One of the main principles in making that happen is to try to remove as much of the unix file system concept from the high level pipeline itself, leaving it only to the code within each task, because most of the time, these pipelines run in places where directories don't exist, and so forth.

Talking about a bunch of filenames in the text content of an input file starts getting very infrastructure specific, because the way files are organized and referred to are very infrastructure specific. For example, in the case of DNAnexus, the official / unambiguous way to refer to a file object is by its DNAnexus opaque file identifier (project-ABC:file-XYZ). In other places, you have directories. Sometimes there's a non-opaque file identifier, such as with most cloud buckets. So what the execution engines generally promote is capturing the input files in an infrastructure-specific way when you execute the workflow and describe the Files, but handling the plumbing of the pipelines in an opaque (don't make any assumptions) kind of way. So using DNAnexus's preferred way of taking in those input mappings is at least something that doesn't mess with compatibility with other platforms. Another approach, as I mentioned before, is to keep the basic part of this pipeline clean and generic (take in the Map[String, Array[File]]) and use an optional/add-on task right before it to make life easier for DNAnexus users (perhaps even named in a way that makes it clear that it's platform-specific).

As for the json representation, almost every WDL executor uses json anyway for describing inputs to executions (that's what you used in that dx-defaults file you deleted). It'd be something to the effect of:

{ "outfilename1.bam": ["project-ABC:file-XYZ1A", "project-ABC:file-XYZ1B"] "outfilename2.bam": ["project-ABC:file-XYZ2A", "project-ABC:file-XYZ2B"] }

I mean, maybe those input files could be presented in a directory/filename-like representation instead of the opaque identifiers I used above, but maybe not -- the dx-defaults don't accept those forms, but you never know until you try. Anyway, although the above json would be annoying to human-write, I'm assuming your tabfile was script-produced anyway, and there's no real reason it couldn't produce this form of the output as well.

This might seem like splitting hairs between two ways that are semantically doing the same thing (providing a serialized mapping of the two-dimensional data structure, one in tsv form, one in json, and the former is certainly simpler), but I think there's actually a significant difference philosophically between the two: in the latter case, we're telling the execution engine (whether it's DNAnexus or something else) the data structure in a way it understands and having it resolve it. In the former case, we're telling our own custom-written code about the mapping and having it resolve it -- and when it comes to File objects, I think this potentially gets dangerous, as it breaks an intended abstraction layer, and could easily have difficulty porting more broadly.

Also, if the primary purpose of the preceding bits is to solve a UI issue on DNAnexus, it seems that it would be sensible to make this part of the solution very DNAnexus-specific code. Similar to the assembly stats python script (that is DNAnexus specific), we could imagine a "parse-the-tabfile-to-json" script that is also DNAnexus specific and provided separately from all the WDL.

I agree on all counts. My only concern is that I reaaaaally hope it doesn't turn out that we need to use "project-ABC:file-XYZ1A"-type filenames in the json. I like being able to see the mapping for debugging and to verify that I didn't mess anything up—though that can be/is printed as an intermediate output, so at least we could see that. But also because an input file with those filenames would be a lot harder to put together. I guess we'll see! I'll play around with actually using the json inputs later this week.

pipes/dnax/dx-defaults-assemble_denovo_bulk.json

dpark01 · 2019-12-16T02:27:35Z

pipes/WDL/workflows/merge_bams_bulk.wdl

+task clean_in_bam_out_bam_table {
+    File table
+
+    command {
+    	cat ${table} | sed 's/[.]bam$//g' | sed $'s/[.]bam\t/\t/g' | tee in_bam_out_bam_table
+    }
+
+    output {
+        File clean_table = "in_bam_out_bam_table"
+    }
+}


This task seems unnecessary -- you're already calling the WDL basename function on all the members of the table, so why do it twice (once in WDL, once in a task)?

(As you definitely already can tell,) my thinking here was to deal with the user potentially including the .bam extension in the input and/or output bam filenames listed in in_bam_out_bam_table. I'd like to be able to handle, for example, a mapping like

input_a1.bam output_a input_a2 output_a input_b1 output_b.bam input_b2 output_b input_c1.bam output_c.bam input_d1 output_d.bam

or even a more reasonable

input_a1.bam output_a.bam input_a2.bam output_a.bam

vs.

input_a1 output_a input_a2 output_a

I ran into problems with this line:
if(in_bam_to_out_bam[in_bam_basename] == out_bam)

Note that in_bam_basename is the basename of an input bam file from the list of bam files provided by the user (Array[File]+ in_bams), while in_bam_to_out_bam is the above mapping file provided by the user. (This map is built directly from the contents of the file, and is not mutable.)

This would be fine if we could check the map for in_bam_basename and in_bam_basename + ".bam". Unfortunately, I couldn't find a way to check if a key is present in a map—and if the key is not in the map, in_bam_to_out_bam[in_bam_basename] gives me an error. So I decided to clean out the .bams from inside the user-provided mapping file before the map is created.

lakras added 30 commits November 12, 2019 11:11

added assemble_denovo_bulk with one task inside scatter

fa66e5d

added second task as-is to workflow

77c7c4d

gave the tasks aliases

4bd3f22

one task in the workflow, with aliasing

2ec6e0f

added dx-defaults-assemble_denovo_bulk faile

92d119d

made lastal_db_fasta from first task a workflow variable outside the …

1333e6d

…scatter, so it actually shows up as an input on dnanexus

added all tasks without aliasing with input variables as global varia…

7327f77

…bles

added missing commas, oops

c14ab66

limited to two tasks in workflow

ab5e06c

three tasks in workflow

cda9c49

deleted extra whitespace to see if that makes any difference

5f56e1e

added trim_clip_db as a workflow-level variable

309b84a

added remaining 2 tasks to workflow with workflow-level variables

2103912

moved default value of trim_clip_db to workflow-level variable in jso…

49e36d7

…n file

turned interior of scatter in assemble_denovo_bulk into subworkflow c…

6899f34

…all to assemble_denovo

actually named the subworkflow oops

0eb1b28

added additional arguments

2b475f9

actually named the subworkflow again gah

6ddae5b

added lastal_db_fasta and reference_genome_fasta as workflow-level va…

f905e27

…riables in original assemble_denovo so that assemble_denovo_bulk can provide them when calling assemble_denovo as subworkflow

moved trim_clip_db out to workflow-level variable and added it to the…

2c69f08

… defaults for assemble_denovo_bulk too

fixed dumb syntax error

1af46b3

actually fixed dumb syntax error

011403a

reverted back to before subworkflow

4d79c3b

...now without dumb syntax error

fbea2de

added option to add novocraft license

589a18f

removed empty defaults file for assemble_denovo_bulk

ad3861e

trying subworkflows again...

fb2644f

without syntax error

135c0f1

renamed subworkflow to avoid confusion

6ca90ed

fixed typo

1280d6b

lakras added 23 commits November 19, 2019 19:33

hardcoded map again

3f6314b

trying to access elements of map at different scopes

e47d6a5

fixed dumb error oops

c0f3ab4

put hardcoded test first

e5c81d8

removed references to hardcoded map

4281ef3

removed all except the inner-most test map access

183c550

pulled map access out of if statement

bd16496

map access in outer scatter

38e21fe

map access outside of scatter

b2ea552

map access in inner scatter, outer scatter, and outside of scatter

4575496

reading out_bams from in_bam_out_bam_table instead of its own file or…

3e45419

… hardcoded

map access in outer scatter commented out

20ccbbd

deleted commented out code

280ff5f

tried adding length of map

a5ecff6

added length of map as access to map, instead of accessing an element…

d0f4166

… with a hardcoded key

trying write_map as map touch

9456b6f

cleaned it up

4e330d3

renamed aa_merge_bams_bulk.wdl back to merge_bams_bulk.wdl

b10c66d

added function to remove .bam from inside in_bam_out_bam file if it a…

c211760

…ppears

removed align_and_plot_bulk and assemble_denovo_bulk since the batch …

ad5f699

…runner beta works again

removed variable declaration for cleaned in_bam_to_out_bam file

d03aa0e

improved comment

888bb2a

renamed intermediate file cleaned_table

85d8e62

lakras requested review from dpark01 and tomkinsc December 15, 2019 05:44

dpark01 reviewed Dec 16, 2019

View reviewed changes

lakras and others added 4 commits December 15, 2019 22:15

deleted defaults for assemble_denovo_bulk

97ddcbd

removed out_bams variable that is only used once

4e6d91e

Merge branch 'master' into lk-bulkifying

8c69883

Merge branch 'master' into lk-bulkifying

877f573

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_bams_bulk #6

merge_bams_bulk #6

lakras commented Dec 15, 2019

dpark01 Dec 16, 2019

lakras Dec 16, 2019 •

edited

Loading

dpark01 Dec 16, 2019

lakras Dec 16, 2019

dpark01 Dec 16, 2019

lakras Dec 16, 2019

dpark01 Dec 16, 2019

lakras Dec 16, 2019 •

edited

Loading

merge_bams_bulk #6

Are you sure you want to change the base?

merge_bams_bulk #6

Conversation

lakras commented Dec 15, 2019

dpark01 Dec 16, 2019

Choose a reason for hiding this comment

lakras Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

dpark01 Dec 16, 2019

Choose a reason for hiding this comment

lakras Dec 16, 2019

Choose a reason for hiding this comment

dpark01 Dec 16, 2019

Choose a reason for hiding this comment

lakras Dec 16, 2019

Choose a reason for hiding this comment

dpark01 Dec 16, 2019

Choose a reason for hiding this comment

lakras Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

lakras Dec 16, 2019 •

edited

Loading

lakras Dec 16, 2019 •

edited

Loading