New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: add a GLOB value to FileType enum #77
Comments
Not a blocker but would need to be explicit which globbing rules are in play. I know you linked to bash, but that's not always a given. |
My proposal is to stick with BASH standard wildcards, see here aka globbing patterns, hence:
I agree that BASH may not be always given, however these wildcards are quite common (maybe with the exception of the negation) and powerful. It should not to be hard to implement in any programming language (also I guess it's possible to map them to a regexp). |
I'm realising that maybe add this as an extra Another option could be to define a path pseudo-protocol
This looks more flexible and keep open for possible extensions if needed. However the |
To be clear, globs here are used to define which output files will be uploaded to storage, correct? Conversely, globs are not being used to change the output of a GetTask or ListTasks call, correct?
If you have code to handle glob paths, you probably don't need the "glob:" prefix in order to recognize a glob. Alternatively, you can achieve this with a post-processing Executor that runs a |
Yes
No. But it would be needed a way to fetch the actual list of resolved file names/paths.
Maybe no. But how the server implementation knows that is a glob instead of a concreate file name?
Smart, but it looks like more an hack. Wouldn't be better to handle in a proper manner?
If I'm not wrong a |
I think that's what TaskLog.OutputFileLog is for.
The implementation can do whatever bash does. Actually, the implementation would probably use a library that will do this automatically.
I agree globs would be more convenient. Is the convenience worth the extra work, complexity, bugs, etc?
I can see how that's confusing. |
The problem I see with this solution is that it alters the path of the resulting files, having an impact on the downstream tasks which depend on that results.
Fine. |
Good point. |
Given this output definition:
And these files:
What does TaskLog.outputs look like when the task has completed? What are the rules for how the implementation determines what the path under "s3://my-bucket/my-data/" is? |
In the same manner as the current specification states to handle:
The implementation should only applies the glob to find which file names match that pattern, then applies the current implementation/rule to resolve each of them against the target URL. |
So the resulting URL would be |
Interesting, now I'm realising your point is how to handle subdirectories. In NF the output declaration only allows relative paths and subdirectory structure is maintained, hence it would be:
and ideally it would produce:
In the context of this specification I see three alternatives:
|
Ok, thanks for clarifying. Another consideration is whether double star globs are supported, and whether globs follow symlinks. |
Good points. IMO it's a yes to both of them (even tho double stars glob in practice is rarely used, it's not so critical). |
Hi, wanted to add my 2 cents on this. We've been talking to @mr-c about globbing and it would greatly assist the efforts on running CWL workflows on a TES endpoint to have this available. In CWL, globbing follows POSIX rules as outlined here: https://www.systutorials.com/docs/linux/man/7-glob/ which are equivalent to the bash rules as far as I understand. So that should be the standard here too IMO. |
Hi @erikvdbergh - from my perspective the one thing to take into consideration here is that both WDL and CWL are in heavy use by GA4GH driver projects so any globbing solution should work smoothly for both. The WDL spec describes how they handle globbing here: https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#globs It seems to me as long as it's standardized on what Bash does (which seems POSIX) that it should be fine. |
I think it makes sense to add glob support as well. My question is whether or not we need to explicitly add a GLOB file type. All output paths could be treated as a potential glob pattern. This relates to #76 where there has been debate as to whether we need the file type field in the spec at all. |
If a glob doesn't match any files would that be considered an error? |
This should be delegated to the client. The role of GLOB file type is to allow the client to specify a pattern for an output file name(s) that will be resolved as runtime. If the resulting list of files is empty the client can choose to ignore or report the error the condition. |
This isn't strictly required. For instance in Cromwell using PAPI v1, while they provided globbing support we didn't use theirs as their globbing pattern was different from ours. We handled it manually by wrapping the job we sent to PAPI. My personal $0.02 is to just go this route as it's a) simpler (no need to agree on globbing patterns, nothing for TES implementations to implement) and b) not strictly needed However, I'll reiterate that based on the F2F in Toronto these sorts of decisions ultimately need to go down to the drivers - both the drivers who are using TES (which I think is just EBI at the moment) and WES implementations used by the drivers who might be talking to it (which is just Arvados and Cromwell at the moment). As one of those 3 legs, I've mentioned my own preference above but defer to @erikvdbergh |
FYI, Bash globs != POSIX globs: I recommend going with POSIX globs; BASH seems to be a super-set (but someone should verify that). |
Pinging this thread to see if there's an update. Looks like there is need from both the CWL and NF folks to have this available. |
We're looking at adding this for v1.1, does anyone how a PR that translate this to OpenAPI? |
Understanding that being unable to natively handle dynamically generated output files natively is a potential blocker for the wider adoption of TES backends for upstream implementers, particularly Nextflow, as well as WDL- and CWL-based workflow engines, there is a push to address this issue in TES v1.1. Before going ahead with one or multiple PRs for addressing pattern matching/globbing, I would like to summarize the discussion in this thread. Context: sending task requestsThe proposal concerns specifying pattern matching/wildcards ("globs") when sending a task request. In the current TES specification (commit 61558fd) the relevant portions of the specification are the following:
|
The purpose of this proposal is the extend the ability of TES API to handle dynamically generated output files which name may not be known at the time the task definition by adding a
GLOB
value to the possible types of aTaskParameter
output.This would allow the possibility to specify as output one or more files matching a glob pattern eg.
*.bam
orsample_x.{bam,bai}
, etc.The server would resolve the list of matching files and would return them to the client when a
FULL
task view is retrieved.The output
name
field can be used to logically group the files that resolves against the same glob pattern. For example having the following output in the task definition:the task view would return:
The text was updated successfully, but these errors were encountered: