Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File objects should include type #42

Closed
ghost opened this issue May 12, 2015 · 5 comments
Closed

File objects should include type #42

ghost opened this issue May 12, 2015 · 5 comments
Milestone

Comments

@ghost
Copy link

ghost commented May 12, 2015

Current way of representing file objects:

path: /path/to/file.ext
size: 42
metadata:
  key: value
secondaryFiles:
 - "@type": "File"
   path: /path/to/file.ext.ndx
 - "@type": "File"
   path: /path/to/file.replaced_ext

It makes us rely on type definitions to tell if something is a file or not, which is bad since type defs can be unions and thus ambiguous.

I'd suggest we include type explicitly, add a "name" property, and represent secondary files with the naming convention. Also, metadata should hopefully be prefixed with ontology base. Example:

class: File
path: /path/to/file.ext
name: file.ext
size: 42
metadata:
  prefix:key: value
secondaryFiles:
 - ".ext"
 - "^.replaced"
@tetron tetron added this to the Draft 2 milestone May 13, 2015
@tetron
Copy link
Member

tetron commented May 23, 2015

class: File is fine. It just occurred to me that we can represent string constants in Avro by using an Enum with a single symbol.

  • I don't understand the purpose of a separate name field. What happens if the path and the name are not in agreement? Is that a validation error? Or is it intended to be added by CWL runtime after loading and validating the input object, for the benefit of scripts?
  • If there are patterns in secondaryFiles, is this just a shortcut way of expressing file.ext.ext and file.replaced, which could be converted into literals in an early preprocessing step? You can already specify secondaryFiles patterns in a File input parameter in the tool description. What is the use case? Having patterns in the input object itself seems only useful if you're a) have something for which the secondary files can't be matched in advance (which does seem like a valid case for files with dependencies, like a python script that imports modules) and b) writing the input object by hand so writing out explicit dependencies is tedious.
  • We haven't completely figured out metadata, so I prefer to keep it out of the spec until we have a complete design.

@ghost
Copy link
Author

ghost commented May 27, 2015

Or is it intended to be added by CWL runtime after loading and validating the input object, for the benefit of scripts?

Yes. I guess it's a minor benefit, so can just leave it out.

Regarding secondary files, I think they are only really useful for describing tools that create them (e.g. bamtools index would specify .bai on its "indexed" output) so that the engine can make sure that the index file always stays with data file (when moving to long-term storage or between compute nodes).

Don't think we need to require users/engines to supply the secondary files list at all, can just have it as a property of output bindings. If we do, perhaps it's simplest to always list secondary files as patterns (not whole file structs)?

@tetron
Copy link
Member

tetron commented May 27, 2015

secondaryFiles is important for validating that the secondary files are actually there (such as an index), and similarly for ensuring that secondary output files are included alongside the primary output file. However, I think it makes sense to be part of the input or output parameter descriptions for the tool, and not required to be written out in the input object. (What I disagree with is having the patterns be in the input object).

@ghost
Copy link
Author

ghost commented May 29, 2015

So, patterns in secondaryFiles property of output bindings to signify which produced files to "attach". Where would we place the patterns to signify which secondary files are required? Avro objects with type: File?

Optional list of full file objects in secondaryFiles property of class: File objects sounds good to me. Was that what you meant? Example:

class: File
size: 42
path: /path/to/file.bam
secondaryFiles:
  - class: File
    size: 13
    path: /path/to/file.bam.bai

@tetron
Copy link
Member

tetron commented Jun 1, 2015

Including class: File in the input object is now enforced in the schema.

I think we're on the same page with secondaryFiles, it remains to be documented and we'll need to write tests. However, I'm closing this issue since the main topic is now completed.

@tetron tetron closed this as completed Jun 1, 2015
@tetron tetron mentioned this issue Jun 1, 2015
mr-c added a commit that referenced this issue Sep 6, 2016
Can use refScope and mapSubject/mapPredicate on specialize.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant