Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: add a GLOB value to FileType enum #77

Closed
pditommaso opened this issue Aug 3, 2017 · 24 comments · Fixed by #185
Closed

Proposal: add a GLOB value to FileType enum #77

pditommaso opened this issue Aug 3, 2017 · 24 comments · Fixed by #185

Comments

@pditommaso
Copy link

pditommaso commented Aug 3, 2017

The purpose of this proposal is the extend the ability of TES API to handle dynamically generated output files which name may not be known at the time the task definition by adding a GLOB value to the possible types of a TaskParameter output.

This would allow the possibility to specify as output one or more files matching a glob pattern eg. *.bam or sample_x.{bam,bai}, etc.

The server would resolve the list of matching files and would return them to the client when a FULL task view is retrieved.

The output name field can be used to logically group the files that resolves against the same glob pattern. For example having the following output in the task definition:

output[ {name: "bam", path: "*.bam", type: "GLOB" } ]

the task view would return:

output[ 
   {name: "bam", path: "/some/path/pair_x.bam", type: "FILE" },
   {name: "bam", path: "/some/path/pair_y.bam", type: "FILE" },
   {name: "bam", path: "/some/path/pair_z.bam", type: "FILE" }
 ]  
@geoffjentry
Copy link
Contributor

Not a blocker but would need to be explicit which globbing rules are in play. I know you linked to bash, but that's not always a given.

@pditommaso
Copy link
Author

My proposal is to stick with BASH standard wildcards, see here aka globbing patterns, hence:

  • ? matches any single character
  • * matches any number of characters
  • [] species a range eg
  • {} specifies alternative choices or wildcards
  • [!] negation

I agree that BASH may not be always given, however these wildcards are quite common (maybe with the exception of the negation) and powerful. It should not to be hard to implement in any programming language (also I guess it's possible to map them to a regexp).

@pditommaso
Copy link
Author

pditommaso commented Aug 4, 2017

I'm realising that maybe add this as an extra GLOB value to the FileType enum is not the best idea (also considering the discussion on #76).

Another option could be to define a path pseudo-protocol glob: which can be used to prefix a wildcard path. For example:

output[ { name: "bam", path: "glob:*.bam" } ]

This looks more flexible and keep open for possible extensions if needed. However the type field may still be needed, at least to return the type of the matching paths (file or directory). This would be cheap to determinate on the server side (files are local) but expensive for the client for which they would be remote resources to query.

@buchanae
Copy link
Contributor

buchanae commented Aug 4, 2017

To be clear, globs here are used to define which output files will be uploaded to storage, correct? Conversely, globs are not being used to change the output of a GetTask or ListTasks call, correct?

Another option could be to define a path pseudo-protocol glob: which can be used to prefix a wildcard path.

If you have code to handle glob paths, you probably don't need the "glob:" prefix in order to recognize a glob.

Alternatively, you can achieve this with a post-processing Executor that runs a sh -c "mv path/to/*.txt outputdir" and then uploads outputdir. The output file list is available in the task logs here

@pditommaso
Copy link
Author

pditommaso commented Aug 4, 2017

To be clear, globs here are used to define which output files will be uploaded to storage, correct?

Yes

Conversely, globs are not being used to change the output of a GetTask or ListTasks call, correct?

No. But it would be needed a way to fetch the actual list of resolved file names/paths.

If you have code to handle glob paths, you probably don't need the "glob:" prefix in order to recognize a glob.

Maybe no. But how the server implementation knows that is a glob instead of a concreate file name?

Alternatively, you can achieve this with a post-processing Executor that runs a sh -c "mv path/to/*.txt outputdir"

Smart, but it looks like more an hack. Wouldn't be better to handle in a proper manner?

The output file list is available in the task logs here

If I'm not wrong a GetTask returns a Task message which contains an outputs field. Is it supposed to report the list of outputs produced by the task? or instead the one in the TaskLog? of both?

@buchanae
Copy link
Contributor

buchanae commented Aug 4, 2017

No. But it would be needed a way to fetch the actual list of resolved file names/paths.

I think that's what TaskLog.OutputFileLog is for.

Maybe no. But how the server implementation knows that is a glob instead of a concrete file name?

The implementation can do whatever bash does. Actually, the implementation would probably use a library that will do this automatically.

Smart, but it looks like more an hack. Wouldn't be better to handle in a proper manner?

I agree globs would be more convenient. Is the convenience worth the extra work, complexity, bugs, etc?

If I'm not wrong a GetTask returns a Task message which contains an outputs field. Is it supposed to report the list of outputs produced by the task? or instead the one in the TaskLog? of both?

I can see how that's confusing. Task.outputs is the output file definition from the original task message (the one sent to CreateTask). TaskLog.outputs is the final list of files that was uploaded.

@pditommaso
Copy link
Author

I agree globs would be more convenient. Is the convenience worth the extra work, complexity, bugs, etc?

The problem I see with this solution is that it alters the path of the resulting files, having an impact on the downstream tasks which depend on that results.

TaskLog.outputs is the final list of files that was uploaded

Fine.

@buchanae
Copy link
Contributor

buchanae commented Aug 4, 2017

The problem I see with this solution is that it alters the path of the resulting files, having an impact on the downstream tasks which depend on that results.

Good point.

@buchanae
Copy link
Contributor

Given this output definition:

{ "url": "s3://my-bucket/my-data/", "path": "/data/bwa/*/*.bam"}

And these files:

/data/bwa/sample-1/chunk-1.bam
/data/bwa/sample-1/chunk-2.bam
/data/bwa/sample-2/chunk-1.bam
/data/bwa/sample-2/chunk-2.bam

What does TaskLog.outputs look like when the task has completed? What are the rules for how the implementation determines what the path under "s3://my-bucket/my-data/" is?

@pditommaso
Copy link
Author

pditommaso commented Oct 24, 2017

In the same manner as the current specification states to handle:

{ "url": "s3://my-bucket/my-data/", "path": "/data/bwa/sample-1/chunk-1.bam"}

The implementation should only applies the glob to find which file names match that pattern, then applies the current implementation/rule to resolve each of them against the target URL.

@buchanae
Copy link
Contributor

buchanae commented Oct 24, 2017

So the resulting URL would be s3://my-bucket/my-data/chunk-1.bam? The two chunk-1.bam files would conflict in that case.

@pditommaso
Copy link
Author

pditommaso commented Oct 24, 2017

Interesting, now I'm realising your point is how to handle subdirectories. In NF the output declaration only allows relative paths and subdirectory structure is maintained, hence it would be:

{ "url": "s3://my-bucket/my-data/", "path": "*/*.bam"}  

and ideally it would produce:

s3://my-bucket/my-data/sample-1/chunk-1.bam
s3://my-bucket/my-data/sample-1/chunk-2.bam
s3://my-bucket/my-data/sample-2/chunk-1.bam
s3://my-bucket/my-data/sample-2/chunk-2.bam

In the context of this specification I see three alternatives:

  1. Allow only basic glob (no subdirectory).
  2. Allow only relative path containing a glob (in the same manner as NF).
  3. Allows absolute paths with glob. Only the subdirectory structure from which the first glob appears is maintained in the target URL. That should be coherent with the current implementation.

@buchanae
Copy link
Contributor

Ok, thanks for clarifying. Another consideration is whether double star globs are supported, and whether globs follow symlinks.

@pditommaso
Copy link
Author

Good points. IMO it's a yes to both of them (even tho double stars glob in practice is rarely used, it's not so critical).

@erikvdbergh
Copy link

Hi, wanted to add my 2 cents on this. We've been talking to @mr-c about globbing and it would greatly assist the efforts on running CWL workflows on a TES endpoint to have this available. glob is a very commonly used strategy for output files in CWL and it would help if this was in the TES spec too. The unfortunate truth today is still that many tools do not use customizable output filenames so a glob opotion in this case would be valuable.

In CWL, globbing follows POSIX rules as outlined here: https://www.systutorials.com/docs/linux/man/7-glob/ which are equivalent to the bash rules as far as I understand. So that should be the standard here too IMO.

@geoffjentry
Copy link
Contributor

Hi @erikvdbergh - from my perspective the one thing to take into consideration here is that both WDL and CWL are in heavy use by GA4GH driver projects so any globbing solution should work smoothly for both.

The WDL spec describes how they handle globbing here: https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#globs

It seems to me as long as it's standardized on what Bash does (which seems POSIX) that it should be fine.

@adamstruck
Copy link
Member

I think it makes sense to add glob support as well.

My question is whether or not we need to explicitly add a GLOB file type. All output paths could be treated as a potential glob pattern. This relates to #76 where there has been debate as to whether we need the file type field in the spec at all.

@adamstruck
Copy link
Member

If a glob doesn't match any files would that be considered an error?

@pditommaso
Copy link
Author

This should be delegated to the client. The role of GLOB file type is to allow the client to specify a pattern for an output file name(s) that will be resolved as runtime. If the resulting list of files is empty the client can choose to ignore or report the error the condition.

@geoffjentry
Copy link
Contributor

This isn't strictly required. For instance in Cromwell using PAPI v1, while they provided globbing support we didn't use theirs as their globbing pattern was different from ours. We handled it manually by wrapping the job we sent to PAPI.

My personal $0.02 is to just go this route as it's a) simpler (no need to agree on globbing patterns, nothing for TES implementations to implement) and b) not strictly needed

However, I'll reiterate that based on the F2F in Toronto these sorts of decisions ultimately need to go down to the drivers - both the drivers who are using TES (which I think is just EBI at the moment) and WES implementations used by the drivers who might be talking to it (which is just Arvados and Cromwell at the moment). As one of those 3 legs, I've mentioned my own preference above but defer to @erikvdbergh

@mr-c
Copy link

mr-c commented Sep 18, 2018

FYI, Bash globs != POSIX globs:
https://en.wikipedia.org/wiki/Glob_(programming)#Syntax

I recommend going with POSIX globs; BASH seems to be a super-set (but someone should verify that).

@wleepang
Copy link

Pinging this thread to see if there's an update. Looks like there is need from both the CWL and NF folks to have this available.

@kellrott kellrott added this to the 1.1 milestone Oct 12, 2021
@kellrott
Copy link
Member

We're looking at adding this for v1.1, does anyone how a PR that translate this to OpenAPI?

@uniqueg
Copy link
Contributor

uniqueg commented Jul 28, 2022

Understanding that being unable to natively handle dynamically generated output files natively is a potential blocker for the wider adoption of TES backends for upstream implementers, particularly Nextflow, as well as WDL- and CWL-based workflow engines, there is a push to address this issue in TES v1.1.

Before going ahead with one or multiple PRs for addressing pattern matching/globbing, I would like to summarize the discussion in this thread.

Context: sending task requests

The proposal concerns specifying pattern matching/wildcards ("globs") when sending a task request. In the current TES specification (commit 61558fd) the relevant portions of the specification are the following:

requestBody of the POST /tasks operation definition

requestBody:
  content:
    application/json:
      schema:
        $ref: '#/components/schemas/tesTask'
  required: true

outputs property of the tesTask schema definition

outputs:
  type: array
  description: |-
    Output files.
    Outputs will be uploaded from the executor container to long-term storage.
  items:
    $ref: '#/components/schemas/tesOutput'
  example:
    - { "path" : "/data/outfile", "url" : "s3://my-object-store/outfile-1", type: "FILE" }

tesOutput schema definition

tesOutput:
  required:
  - path
  - type
  - url
  type: object
  properties:
    name:
      type: string
      description: User-provided name of output file
    description:
      type: string
      description: Optional users provided description field, can be used for documentation.
    url:
      type: string
      description: |-
        URL for the file to be copied by the TES server after the task is complete.
        For Example:
         - `s3://my-object-store/file1`
         - `gs://my-bucket/file2`
         - `file:///path/to/my/file`
    path:
      type: string
      description: |-
        Path of the file inside the container.
        Must be an absolute path.
    type:
      $ref: '#/components/schemas/tesFileType'
  description: Output describes Task output files.

tesFileType enum definition

tesFileType:
  type: string
  default: FILE
  enum:
  - FILE
  - DIRECTORY

Context: accessing output files

URLs to output files accessible by the client are specified in tesTasks.logs, specifically in the following schemas:

logs property of the tesTask schema definition

logs:
  type: array
  description: |-
    Task logging information.
    Normally, this will contain only one entry, but in the case where
    a task fails and is retried, an entry will be appended to this list.
  readOnly: true
  items:
    $ref: '#/components/schemas/tesTaskLog'

outputs property of the tesTaskLog schema definition

outputs:
  type: array
  description: |-
    Information about all output files. Directory outputs are
    flattened into separate items.
  items:
    $ref: '#/components/schemas/tesOutputFileLog'

tesOutputFileLog schema definition

tesOutputFileLog:
  required:
  - path
  - size_bytes
  - url
  type: object
  properties:
    url:
      type: string
      description: URL of the file in storage, e.g. s3://bucket/file.txt
    path:
      type: string
      description: Path of the file inside the container. Must be an absolute
        path.
    size_bytes:
      type: string
      description: |-
        Size of the file in bytes. Note, this is currently coded as a string
        because official JSON doesn't support int64 numbers.
      format: int64
      example:
        - "1024"
  description: |-
    OutputFileLog describes a single output file. This describes
    file details after the task has completed successfully,
    for logging purposes.

Current way of addressing pattern matching

The tesOutput.path definition does not explicitly define the expected behavior of a TES implementation in cases where clients provide globs. While the the implicit assumption in this proposal/thread is that TES implementations would not support resolving pattern matching wildcards, TES implementation that were to support them would arguably not violate the specification. Hence, the expected behavior for globs is currently underspecified.

Should TES implementations be required to support pattern matching in task outputs?

Pro

The main argument for explicitly requiring TES implementations to support pattern matching in task requests is that situations where outputs are generated dynamically is common in the execution of workflows and individual tools that do not allow specifying precise output names.

Con

The main argument against supporting pattern matching is that it introduces unnecessary complexity on the server side, which may potentially be avoidable by extra work on the client side (see comment by @geoffjentry) or by using a "post-processing" executor to move matching files to an output directory (see comment by @buchanae). The former solution seems to work nicely for Cromwell, but it is unclear how well this strategy translates to other workflow engines/languages or other types of clients. Likewise, as suggested by @pditommaso here, the latter solution has the problem that it would change the task outputs from multiple files to a single directory - which might have implications for downstream tasks that may be difficult to handle for clients.

How to indicate that TES should interpret a specified output as a glob?

Three solutions have been suggested, as listed below (together with pros and cons for each):

  1. Add an item GLOB to the tesFileType enum, then set the tesOutput.type property to GLOB.
    + Non-breaking change
    + Unambiguous
    + Does not require escape characters for filenames including wildcard characters
    − Semantically incorrect: a glob is not a file type
    − There was some support for dropping the testOutput.type property altogether (see Proposal: make TaskParameter type field optional  #76); however, if Making file type an optional argument #155 is merged, this is likely not going to be a question anymore
  2. Prefix the tesOutput.path with a "protocol" prefix, e.g., glob:.
    + Non-breaking change
    + Does not require escape characters for filenames including wildcard characters
    − May still be ambigious (i.e., if a filename starts with glob:)
  3. Have TES always apply pattern matching to tesOutput.path, similar to how most shells (e.g., Bash) handle this.
    + Only requires changes at the documentation/description level
    + Unambiguous
    + Consistent with behavior that many users know from shells
    − Might constitute a breaking change (behavior/expectations for filenames containing wildcard characters change if implementations chose to literally interpret tesOutput.path values; however, TES is actually underspecified in this regard

Which globbing rules to prescribe?

Both Bash and POSIX pattern matching have been suggested. For Bash, it is important to distringuish between pattern matching and brace expansion (see this article), which are often referred to together as wildcards, because functionally they may behave similarly (i.e., select multiple files based on patterns). When excluding brace expansion, pattern matching rules for Bash and POSIX appear to be largely identical.

Another question was whether other Bash-/shell-specific features, such as double asterisks (**) should be supported.

The main arguments for Bash wildcards (pattern matching, brace expansion and double asterisks) was that they are powerful and that many users are already familiar with them. Also, as stated by @geoffjentry, WDL supports Bash globs, although it is not specified what wildcards exactly are supported and whether they include brace expansion and/or double asterisks.

The main arguments for POSIX rules are that they are clearly specified and that -due to better compatibility- there would be more implementations available to parse them, thus making it easier for TES developers to adopt/implement them. As pointed out by @erikvdbergh, CWL relies on the POSIX specification for pattern matching.

How to constuct remote storage URLs?

In the current version of the specification, tesOutput.url requires a fully qualified URL at which an individual output will be made available before a given task concludes. But when outputs are generated dynamically, clients cannot possibly know these, requiring TES servers to construct them on the fly at runtime, then broadcast the compiled URLs via (tesOutputFileLog.url).

There was agreement that in the case of globbing, tesOutput.url` should be used to indicate to the server a URL pointing to a directory on the desired storage solution where outputs selected by the glob should be provided. However, plainly copying over all matched outputs to a given directory can lead to name clashes if wildcards are present in subdirectories, because in a situation like the following (taken from this comment by @buchanae), two output files with conflicting paths would be created in the bucket:

tesOutput schema definition:

{
   "url": "s3://my-bucket/task-data/",
   "path": "/data/*/*"
}

Output files:

/data/sample-1/chunk.bam
/data/sample-2/chunk.bam

Only one output file/object would be created in the bucket:

s3://my-bucket/task-data/chunk.bam

@pditommaso summarized possible solutions to the problem in this comment, to which I have added some possible advantages and disadvantages:

  1. Do not allow pattern matching in subdirectories in tesOutput.path.
    + Avoids the problem entirely
    − Less flexibility; not all use cases may be covered
    − Unusual/unexpected pattern matching behavior that clients would need to be (made) aware of
  2. Allow only relative paths when specifying a path with pattern matching wildcards, then appending the entire path to the bucket location/directory when forming URLs.
    + Allows flexibility of using globs across the entire file path
    + Consistent with usual pattern matching behavior known, e.g., from Bash
    − Breaks with the current provision that only absolute paths are allowed for tesOutput.path
    − Introduces inconsistency between glob-containing and non-globbed outputs
    − Introduces some complexity that TES implementers need to shoulder
  3. Allow pattern matching in subdirectories and form URLs by appending the entire subdirectory tree and filename to the bucket location/directory, starting with the first directory that differs across the different output files.
    + Allows flexibility of using globs across the entire file path
    + Consistent with usual pattern matching behavior known, e.g., from Bash
    − Introduces quite a bit of complexity that TES implementers need to shoulder

In a private discussion, @kellrott further suggested a fourth option:
4. Define a new property tesOutput.path_prefix that is used to indicate to TES implementations a prefix of tesOutput.path that TES implementations should trim to identify the subdirectory tree to be recreated at the directory specified by tesOutput.url whenever wildcards are used (the property would be ignored otherwise).
+ Allows flexibility of using globs across the entire file path
+ Consistent with usual pattern matching behavior known, e.g., from Bash
+ Maximizes clients' control over eventual output URLs
− Introduces some complexity that TES implementers need to shoulder
− Introduces a new property that is only relevant when wildcards are available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants