New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: add a GLOB value to FileType enum #77

Open
pditommaso opened this Issue Aug 3, 2017 · 21 comments

Comments

Projects
None yet
6 participants
@pditommaso

pditommaso commented Aug 3, 2017

The purpose of this proposal is the extend the ability of TES API to handle dynamically generated output files which name may not be known at the time the task definition by adding a GLOB value to the possible types of a TaskParameter output.

This would allow the possibility to specify as output one or more files matching a glob pattern eg. *.bam or sample_x.{bam,bai}, etc.

The server would resolve the list of matching files and would return them to the client when a FULL task view is retrieved.

The output name field can be used to logically group the files that resolves against the same glob pattern. For example having the following output in the task definition:

output[ {name: "bam", path: "*.bam", type: "GLOB" } ]

the task view would return:

output[ 
   {name: "bam", path: "/some/path/pair_x.bam", type: "FILE" },
   {name: "bam", path: "/some/path/pair_y.bam", type: "FILE" },
   {name: "bam", path: "/some/path/pair_z.bam", type: "FILE" }
 ]  
@geoffjentry

This comment has been minimized.

Contributor

geoffjentry commented Aug 3, 2017

Not a blocker but would need to be explicit which globbing rules are in play. I know you linked to bash, but that's not always a given.

@pditommaso

This comment has been minimized.

pditommaso commented Aug 4, 2017

My proposal is to stick with BASH standard wildcards, see here aka globbing patterns, hence:

  • ? matches any single character
  • * matches any number of characters
  • [] species a range eg
  • {} specifies alternative choices or wildcards
  • [!] negation

I agree that BASH may not be always given, however these wildcards are quite common (maybe with the exception of the negation) and powerful. It should not to be hard to implement in any programming language (also I guess it's possible to map them to a regexp).

@pditommaso

This comment has been minimized.

pditommaso commented Aug 4, 2017

I'm realising that maybe add this as an extra GLOB value to the FileType enum is not the best idea (also considering the discussion on #76).

Another option could be to define a path pseudo-protocol glob: which can be used to prefix a wildcard path. For example:

output[ { name: "bam", path: "glob:*.bam" } ]

This looks more flexible and keep open for possible extensions if needed. However the type field may still be needed, at least to return the type of the matching paths (file or directory). This would be cheap to determinate on the server side (files are local) but expensive for the client for which they would be remote resources to query.

@buchanae

This comment has been minimized.

Contributor

buchanae commented Aug 4, 2017

To be clear, globs here are used to define which output files will be uploaded to storage, correct? Conversely, globs are not being used to change the output of a GetTask or ListTasks call, correct?

Another option could be to define a path pseudo-protocol glob: which can be used to prefix a wildcard path.

If you have code to handle glob paths, you probably don't need the "glob:" prefix in order to recognize a glob.

Alternatively, you can achieve this with a post-processing Executor that runs a sh -c "mv path/to/*.txt outputdir" and then uploads outputdir. The output file list is available in the task logs here

@pditommaso

This comment has been minimized.

pditommaso commented Aug 4, 2017

To be clear, globs here are used to define which output files will be uploaded to storage, correct?

Yes

Conversely, globs are not being used to change the output of a GetTask or ListTasks call, correct?

No. But it would be needed a way to fetch the actual list of resolved file names/paths.

If you have code to handle glob paths, you probably don't need the "glob:" prefix in order to recognize a glob.

Maybe no. But how the server implementation knows that is a glob instead of a concreate file name?

Alternatively, you can achieve this with a post-processing Executor that runs a sh -c "mv path/to/*.txt outputdir"

Smart, but it looks like more an hack. Wouldn't be better to handle in a proper manner?

The output file list is available in the task logs here

If I'm not wrong a GetTask returns a Task message which contains an outputs field. Is it supposed to report the list of outputs produced by the task? or instead the one in the TaskLog? of both?

@buchanae

This comment has been minimized.

Contributor

buchanae commented Aug 4, 2017

No. But it would be needed a way to fetch the actual list of resolved file names/paths.

I think that's what TaskLog.OutputFileLog is for.

Maybe no. But how the server implementation knows that is a glob instead of a concrete file name?

The implementation can do whatever bash does. Actually, the implementation would probably use a library that will do this automatically.

Smart, but it looks like more an hack. Wouldn't be better to handle in a proper manner?

I agree globs would be more convenient. Is the convenience worth the extra work, complexity, bugs, etc?

If I'm not wrong a GetTask returns a Task message which contains an outputs field. Is it supposed to report the list of outputs produced by the task? or instead the one in the TaskLog? of both?

I can see how that's confusing. Task.outputs is the output file definition from the original task message (the one sent to CreateTask). TaskLog.outputs is the final list of files that was uploaded.

@pditommaso

This comment has been minimized.

pditommaso commented Aug 4, 2017

I agree globs would be more convenient. Is the convenience worth the extra work, complexity, bugs, etc?

The problem I see with this solution is that it alters the path of the resulting files, having an impact on the downstream tasks which depend on that results.

TaskLog.outputs is the final list of files that was uploaded

Fine.

@buchanae

This comment has been minimized.

Contributor

buchanae commented Aug 4, 2017

The problem I see with this solution is that it alters the path of the resulting files, having an impact on the downstream tasks which depend on that results.

Good point.

@buchanae

This comment has been minimized.

Contributor

buchanae commented Oct 24, 2017

Given this output definition:

{ "url": "s3://my-bucket/my-data/", "path": "/data/bwa/*/*.bam"}

And these files:

/data/bwa/sample-1/chunk-1.bam
/data/bwa/sample-1/chunk-2.bam
/data/bwa/sample-2/chunk-1.bam
/data/bwa/sample-2/chunk-2.bam

What does TaskLog.outputs look like when the task has completed? What are the rules for how the implementation determines what the path under "s3://my-bucket/my-data/" is?

@pditommaso

This comment has been minimized.

pditommaso commented Oct 24, 2017

In the same manner as the current specification states to handle:

{ "url": "s3://my-bucket/my-data/", "path": "/data/bwa/sample-1/chunk-1.bam"}

The implementation should only applies the glob to find which file names match that pattern, then applies the current implementation/rule to resolve each of them against the target URL.

@buchanae

This comment has been minimized.

Contributor

buchanae commented Oct 24, 2017

So the resulting URL would be s3://my-bucket/my-data/chunk-1.bam? The two chunk-1.bam files would conflict in that case.

@pditommaso

This comment has been minimized.

pditommaso commented Oct 24, 2017

Interesting, now I'm realising your point is how to handle subdirectories. In NF the output declaration only allows relative paths and subdirectory structure is maintained, hence it would be:

{ "url": "s3://my-bucket/my-data/", "path": "*/*.bam"}  

and ideally it would produce:

s3://my-bucket/my-data/sample-1/chunk-1.bam
s3://my-bucket/my-data/sample-1/chunk-2.bam
s3://my-bucket/my-data/sample-2/chunk-1.bam
s3://my-bucket/my-data/sample-2/chunk-2.bam

In the context of this specification I see three alternatives:

  1. Allow only basic glob (no subdirectory).
  2. Allow only relative path containing a glob (in the same manner as NF).
  3. Allows absolute paths with glob. Only the subdirectory structure from which the first glob appears is maintained in the target URL. That should be coherent with the current implementation.
@buchanae

This comment has been minimized.

Contributor

buchanae commented Oct 24, 2017

Ok, thanks for clarifying. Another consideration is whether double star globs are supported, and whether globs follow symlinks.

@pditommaso

This comment has been minimized.

pditommaso commented Oct 24, 2017

Good points. IMO it's a yes to both of them (even tho double stars glob in practice is rarely used, it's not so critical).

@erikvdbergh

This comment has been minimized.

erikvdbergh commented Jun 8, 2018

Hi, wanted to add my 2 cents on this. We've been talking to @mr-c about globbing and it would greatly assist the efforts on running CWL workflows on a TES endpoint to have this available. glob is a very commonly used strategy for output files in CWL and it would help if this was in the TES spec too. The unfortunate truth today is still that many tools do not use customizable output filenames so a glob opotion in this case would be valuable.

In CWL, globbing follows POSIX rules as outlined here: https://www.systutorials.com/docs/linux/man/7-glob/ which are equivalent to the bash rules as far as I understand. So that should be the standard here too IMO.

@geoffjentry

This comment has been minimized.

Contributor

geoffjentry commented Jun 8, 2018

Hi @erikvdbergh - from my perspective the one thing to take into consideration here is that both WDL and CWL are in heavy use by GA4GH driver projects so any globbing solution should work smoothly for both.

The WDL spec describes how they handle globbing here: https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#globs

It seems to me as long as it's standardized on what Bash does (which seems POSIX) that it should be fine.

@adamstruck

This comment has been minimized.

Member

adamstruck commented Jun 8, 2018

I think it makes sense to add glob support as well.

My question is whether or not we need to explicitly add a GLOB file type. All output paths could be treated as a potential glob pattern. This relates to #76 where there has been debate as to whether we need the file type field in the spec at all.

@adamstruck

This comment has been minimized.

Member

adamstruck commented Jun 8, 2018

If a glob doesn't match any files would that be considered an error?

@pditommaso

This comment has been minimized.

pditommaso commented Jun 8, 2018

This should be delegated to the client. The role of GLOB file type is to allow the client to specify a pattern for an output file name(s) that will be resolved as runtime. If the resulting list of files is empty the client can choose to ignore or report the error the condition.

@geoffjentry

This comment has been minimized.

Contributor

geoffjentry commented Jun 10, 2018

This isn't strictly required. For instance in Cromwell using PAPI v1, while they provided globbing support we didn't use theirs as their globbing pattern was different from ours. We handled it manually by wrapping the job we sent to PAPI.

My personal $0.02 is to just go this route as it's a) simpler (no need to agree on globbing patterns, nothing for TES implementations to implement) and b) not strictly needed

However, I'll reiterate that based on the F2F in Toronto these sorts of decisions ultimately need to go down to the drivers - both the drivers who are using TES (which I think is just EBI at the moment) and WES implementations used by the drivers who might be talking to it (which is just Arvados and Cromwell at the moment). As one of those 3 legs, I've mentioned my own preference above but defer to @erikvdbergh

@mr-c

This comment has been minimized.

mr-c commented Sep 18, 2018

FYI, Bash globs != POSIX globs:
https://en.wikipedia.org/wiki/Glob_(programming)#Syntax

I recommend going with POSIX globs; BASH seems to be a super-set (but someone should verify that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment