File objects should include type #42

ghost · 2015-05-12T15:25:22Z

Current way of representing file objects:

path: /path/to/file.ext
size: 42
metadata:
  key: value
secondaryFiles:
 - "@type": "File"
   path: /path/to/file.ext.ndx
 - "@type": "File"
   path: /path/to/file.replaced_ext

It makes us rely on type definitions to tell if something is a file or not, which is bad since type defs can be unions and thus ambiguous.

I'd suggest we include type explicitly, add a "name" property, and represent secondary files with the naming convention. Also, metadata should hopefully be prefixed with ontology base. Example:

class: File
path: /path/to/file.ext
name: file.ext
size: 42
metadata:
  prefix:key: value
secondaryFiles:
 - ".ext"
 - "^.replaced"

tetron · 2015-05-23T02:10:57Z

class: File is fine. It just occurred to me that we can represent string constants in Avro by using an Enum with a single symbol.

I don't understand the purpose of a separate name field. What happens if the path and the name are not in agreement? Is that a validation error? Or is it intended to be added by CWL runtime after loading and validating the input object, for the benefit of scripts?
If there are patterns in secondaryFiles, is this just a shortcut way of expressing file.ext.ext and file.replaced, which could be converted into literals in an early preprocessing step? You can already specify secondaryFiles patterns in a File input parameter in the tool description. What is the use case? Having patterns in the input object itself seems only useful if you're a) have something for which the secondary files can't be matched in advance (which does seem like a valid case for files with dependencies, like a python script that imports modules) and b) writing the input object by hand so writing out explicit dependencies is tedious.
We haven't completely figured out metadata, so I prefer to keep it out of the spec until we have a complete design.

ghost · 2015-05-27T12:30:46Z

Or is it intended to be added by CWL runtime after loading and validating the input object, for the benefit of scripts?

Yes. I guess it's a minor benefit, so can just leave it out.

Regarding secondary files, I think they are only really useful for describing tools that create them (e.g. bamtools index would specify .bai on its "indexed" output) so that the engine can make sure that the index file always stays with data file (when moving to long-term storage or between compute nodes).

Don't think we need to require users/engines to supply the secondary files list at all, can just have it as a property of output bindings. If we do, perhaps it's simplest to always list secondary files as patterns (not whole file structs)?

tetron · 2015-05-27T20:15:32Z

secondaryFiles is important for validating that the secondary files are actually there (such as an index), and similarly for ensuring that secondary output files are included alongside the primary output file. However, I think it makes sense to be part of the input or output parameter descriptions for the tool, and not required to be written out in the input object. (What I disagree with is having the patterns be in the input object).

ghost · 2015-05-29T12:00:57Z

So, patterns in secondaryFiles property of output bindings to signify which produced files to "attach". Where would we place the patterns to signify which secondary files are required? Avro objects with type: File?

Optional list of full file objects in secondaryFiles property of class: File objects sounds good to me. Was that what you meant? Example:

class: File
size: 42
path: /path/to/file.bam
secondaryFiles:
  - class: File
    size: 13
    path: /path/to/file.bam.bai

tetron · 2015-06-01T02:02:34Z

Including class: File in the input object is now enforced in the schema.

I think we're on the same page with secondaryFiles, it remains to be documented and we'll need to write tests. However, I'm closing this issue since the main topic is now completed.

Can use refScope and mapSubject/mapPredicate on specialize.

tetron added this to the Draft 2 milestone May 13, 2015

tetron pushed a commit that referenced this issue May 25, 2015

Implement issue #42: require "class: File" for file-type input objects.

adea007

tetron closed this as completed Jun 1, 2015

tetron mentioned this issue Jun 1, 2015

Draft 2 changes #36

Closed

mr-c added a commit that referenced this issue Sep 6, 2016

Merge pull request #42 from common-workflow-language/specialize-refscope

13061e5

Can use refScope and mapSubject/mapPredicate on specialize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File objects should include type #42

File objects should include type #42

ghost commented May 12, 2015

tetron commented May 23, 2015

ghost commented May 27, 2015

tetron commented May 27, 2015

ghost commented May 29, 2015

tetron commented Jun 1, 2015

File objects should include type #42

File objects should include type #42

Comments

ghost commented May 12, 2015

tetron commented May 23, 2015

ghost commented May 27, 2015

tetron commented May 27, 2015

ghost commented May 29, 2015

tetron commented Jun 1, 2015