Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary index files and directories in WDL #2269

Closed
katevoss opened this issue May 13, 2017 · 15 comments
Closed

Secondary index files and directories in WDL #2269

katevoss opened this issue May 13, 2017 · 15 comments

Comments

@katevoss
Copy link

@vdauwera commented on Mon Apr 24 2017

Need to put in a Cromwell ticket for this. Basic ask: have Cromwell automatically look for (and co-localize) accessory files when given files with a specific extensions. E.g. if I give it foo.bam file it should look for foo.bai.

Note that sometimes it's just a matter of swapping the extension, but sometimes it's adding another extension, and there can be multiple accessory files, e.g. reference.fasta is always accompanied by both reference.fasta.fai and reference.dict.

This would ideally be configurable by the Cromwell admin, who would set up a list of primary file extensions and their accessory file naming patterns. Bonus points if the user can provide their own config on the command line to override the server's config. And also I want a pet unicorn that farts glitter.


WDL folks;
This is a followup from a recent discussion about getting compatible bcbio generated WDL (http://gatkforums.broadinstitute.org/wdl/discussion/9257/object-attribute-access-and-secondary-index-files). Thanks to all the great help you've provided we now have compatible WDL output that passes validation:

https://github.com/bcbio/test_bcbio_cwl/blob/master/run_info-cwl-wdl

This is brilliant, and I'd like to move into testing runs with Cromwell. Before starting this, there is one major area I know we're missing in the conversion, handling of secondary files and directories of files. CWL has the notion of secondaryFiles (http://www.commonwl.org/v1.0/Workflow.html#File) which you can use to block these and ensure they get staged/run next to each other. I use this in bcbio and wanted to figure out the best way to map it into WDL.

There are two cases we use these for:

  • Index files associated with compressed inputs, like BAM bai indices and bgzip VCF tbi indices. These are a single index file attached to the original file that should get staged in the same directory when running.
  • Directories of index files like bwa or snpeff. These are a bit trickier since they can have many files and a variable number depending on the input.

What is the recommended way to deal with these cases in WDL? I'll have to re-engineer bcbio to be able to represent and pass these and wanted to do so in a way that was forward compatible with WDL's thoughts and plans. I've seen recommendations on current hacks like explicitly declaring the indexes as separate files, or tarring up a directory of files and passing that as input. I'm not clear enough on staging files from WDL/Cromwell to understand if these are guaranteed to always go in the right place (bai next to bam, all indexes in the same directory).

Thanks for any thoughts/suggestions/tips.

This Issue was generated from your [forums]
[forums]: http://gatkforums.broadinstitute.org/wdl/discussion/9299/secondary-index-files-and-directories-in-wdl/p1


@vdauwera commented on Thu May 04 2017

@katevoss this is a very common request from the Cromwell user community

@katevoss katevoss added this to the Poss Q - WDL author joy milestone May 13, 2017
@geoffjentry
Copy link
Contributor

Not keen on the specific implementation suggestion. But the general idea is obviously a good one.

@vdauwera
Copy link
Contributor

Not sure if you mean the unicorn or the file-finding setup; I am of course open to alternate suggestions on both.

This would bring much author joy forth into the world.

@geoffjentry
Copy link
Contributor

Hah

I'd prefer to provide a clean way to specify any collection of files (see CWLs secondary files concept) and then syntactic sugar in the form of specialized types, e.g. BamFile which knows to look for an index

Having it be configurable at the Cromwell level implies a potential lack of portability for WDLs

@vdauwera
Copy link
Contributor

Oh hmm very good point, hadn't viewed it from the portability angle.

@katevoss katevoss removed this from the Poss Q - WDL author joy milestone Jul 18, 2017
@kshakir
Copy link
Contributor

kshakir commented Jul 25, 2017

@davidbenjamin has an interesting proposal for user-defined / explicit sets of params for WDL: https://github.com/broadinstitute/wdl/issues/102

Depending on how "CWL support" addresses the secondaryFiles mentioned above, it's supposed that similar WDL features will follow.

@vdauwera
Copy link
Contributor

FYI this was a key item in feedback from our WDL sessions in the UK workshops; having to specify accessory files is a big source of annoyance. Not that it's any surprise, but we're definitely getting confirmation from real users.

@geoffjentry
Copy link
Contributor

We should certainly heed the lesson that CWL learned to provide both the concepts of directory and secondary files. They wound up implementing the former because people were also trying to do that and shoehorning it into the latter.

@kcibul
Copy link
Contributor

kcibul commented Jul 25, 2017 via email

@patmagee
Copy link

@kcibul fyi openwdl/wdl#160

not strictly secondary files, but its close

@illusional
Copy link
Collaborator

Is there any progress on secondary files, I see we have structs which I could probably use but I'm looking for a concept that makes it simpler to pick up index files rather than writing more globs and more mappings. We got directory support in WDL (openwdl/wdl#241) and in Cromwell (#3980).

I understand that the language and the engine are different, but Cromwell has some concept of these secondary files as the CWL implementation supports it.

illusional added a commit to PMCC-BioinformaticsCore/janis that referenced this issue Feb 14, 2019
@patmagee
Copy link

@illusional While I cannot speak on behalf of the cromwell team on what they are implementing, I can say that there has been no discussions around secondary files for WDL. My inclination is that we will try to steer clear of it within WDL. However I encourage you to create an issue or make a PR in the WDL repo suggesting this change and we can allow the community to determine wtheher or not it should be something supported.

@geoffjentry
Copy link
Contributor

I would agree w/ @patmagee that this is a matter for the OpenWDL group. Any Cromwell-level constructs to get at the underlying functionality would require non-portable WDLs to be written. I'll tag @cjllanwarne in case he has any clever ideas on how to express the concept in portable WDL in a less sucky way.

I disagree with @patmagee that WDL should steer clear of the concept - IMO not doing this in the first place was one of the larger mistakes we made in the early days of WDL. Perhaps something with Object. We're seeing something similar play out in GA4GH land w/ DRS ... the concept of a file bundle seems inescapable and it's not quite the same thing as Directory

@vdauwera
Copy link
Contributor

vdauwera commented Feb 16, 2019 via email

@illusional
Copy link
Collaborator

Thanks @patmagee, @geoffjentry and all for directing me to the correct place, I've created a discussion over at openwdl/wdl#289 as a place to have the conversation. If anyone finds this conversation, I'd love to see any thoughts you have on how accessory files may be specified in WDL.

@geoffjentry
Copy link
Contributor

Sounds great @illusional. I'm going to close this issue from the Cromwell side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants