RFC: Directory type #172

ghost · 2015-11-30T12:03:04Z

The spec doesn't say anything about using directories as inputs/outputs. Quite a bit of tools use directories, so we'll need to support that.

A workaround that has been used is to just treat them as files and ignore all properties except for path. However, it doesn't take much to support them completely:

Add a Directory/Dir/DIR/Folder type.
Define size in this context to refer to sum of all file sizes inside the directory.
Define (hopefully reuse) a checksum schema for directories (e.g. find -exec sha | sha).
Remove secondaryFiles and loadContent in this context.

Thoughts?

The text was updated successfully, but these errors were encountered:

portah · 2015-11-30T17:45:47Z

cwtool allows to specify directory as type File:

  - id: '#genomeDir'
    type:
      - File
      - string
    description: |
      string: path to the directory where genome files are stored (if
      runMode!=generateGenome) or will be generated (if runMode==generateGenome)
    inputBinding:
      position: 1
      prefix: '--genomeDir'

so when I use it as the type File I provide path to a directory, job file:

{
  "readFilesIn": [{"class": "File", "path": "../test-files/SRR1031972.fastq"}],
  "genomeDir": {"class": "File", "path": "../test-files/dm3/"},
  "outFileNamePrefix":"./test-files/SRR1031972.",
  "outSAMunmapped1": "Within",
  "runThreadN": 4
}

It is a bit controversial what I did in output, perhaps there a more beautiful solution:

outputs:
  - id: "#indices"
    type: ["null",File]
    outputBinding:
      glob: |
        ${
          if (inputs.runMode != "genomeGenerate")
            return [];
          return inputs.genomeDir+"/Genome";
        }
      secondaryFiles: |
        ${
          var p=inputs.genomeDir;
          return [
            {"path": p+"/SA", "class":"File"},
            {"path": p+"/SAindex", "class":"File"},
            {"path": p+"/chrNameLength.txt", "class":"File"},
            {"path": p+"/chrLength.txt", "class":"File"},
            {"path": p+"/chrStart.txt", "class":"File"},
            {"path": p+"/geneInfo.tab", "class":"File"},
            {"path": p+"/sjdbList.fromGTF.out.tab", "class":"File"},
            {"path": p+"/chrName.txt", "class":"File"},
            {"path": p+"/exonGeTrInfo.tab", "class":"File"},
            {"path": p+"/genomeParameters.txt", "class":"File"},
            {"path": p+"/sjdbList.out.tab", "class":"File"},
            {"path": p+"/exonInfo.tab", "class":"File"},
            {"path": p+"/sjdbInfo.txt", "class":"File"},
            {"path": p+"/transcriptInfo.tab", "class":"File"}
          ];
        }

  - id: "#aligned"
    type: ["null",File]
    outputBinding:
      glob: |
          ${
            if (inputs.runMode == "genomeGenerate")
              return [];

            var p = inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
            if (inputs.outSAMtype.indexOf("SAM") > -1) {
                return p+"Aligned.out.sam";
            } else {
             if ( inputs.outSAMtype.indexOf("SortedByCoordinate") > -1 )
                return p+"Aligned.sortedByCoord.out.bam";
              else
                return p+"Aligned.out.bam";
            }
          }
      secondaryFiles: |
         ${
            var p=inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
            return [
              {"path": p+"Log.final.out", "class":"File"},
              {"path": p+"SJ.out.tab", "class":"File"},
              {"path": p+"Log.out", "class":"File"}
            ];
         }

portah · 2015-11-30T17:57:53Z

Add a Directory/Dir/DIR/Folder type.

Define size in this context to refer to sum of all file sizes inside the directory.

Form the provided example, and other similar cases (tools that are creating genome indices in a directory), I think to provide the list of output files better then to list the whole directory. It is usual to use one directory for indices, you will have files from other tools also.

Define (hopefully reuse) a checksum schema for directories (e.g. find -exec sha | sha).

To provide SHA/size for the secondaryFiles might be reasonable.

Remove secondaryFiles and loadContent in this context.

I think to keep loadContext is also good :) to see actually the list of files in the directory :)

ghost · 2015-11-30T19:35:57Z

Form the provided example, and other similar cases (tools that are creating genome indices in a directory), I think to provide the list of output files better then to list the whole directory. It is usual to use one directory for indices, you will have files from other tools also.

STAR genomeGenerate is a nice example, though I'm not exactly sure how it works.

Are the indices created with only reference file as input or do you also specify other (gtf?) files? That leads to creating a set of indices specific to the combination of those two files (reference and splice junctions). There may be other arguments as well, can't recall.
If the indices are not defined only by the reference genome, it might make it much harder to re-use a single folder with other indices (.dict, .fai).

IIRC, star takes the folder as input so it may be cleaner to have genomeGenerate produce a folder. Unless these index files are used on their own.

I might be very wrong about this though. Listing them as secondary files seems totally okay and the reason feature was introduced in the first place :)

As a side note, describing individual star subcommands as separate cwl files may remove the need for so much javascript in the outputs (but it's awesome to see it can be done nicely in a single cwl file!)

I think to keep loadContext is also good :) to see actually the list of files in the directory :)

Good idea. +1

Excluding STAR and similar cases, it would still be very beneficial to pass directories as values - some tools require very specific nested directory structures as input (e.g. some sequencer outputs are nested dir structures).

mr-c · 2016-02-16T15:08:02Z

@ntijanic sounds like this request isn't needed? Or am I reading the comments here wrong?

mr-c · 2016-03-10T11:11:23Z

+1 for a proper directory input type that would automatically make the contents available to the tool

ghost · 2016-03-10T11:30:47Z

+1 for a proper directory input type that would automatically make the contents available to the tool

Yeap. I guess this is what the request represent. Still needed.

tetron · 2016-03-10T14:28:45Z

@portah "cwtool allows to specify directory as type File" that's a bug that's been recently fixed.

@ntijanic

What do you think of this?

class: Directory
listing:
  - name: file1
    class: File
    source: /path/to/file1
  - name: file2
    class: File
    source: /path/to/another_file
  - name: dir1
    class: Directory
    listing:
    - name: file3
      class: File
      source: /path/to/third_file

This would create a directory structure assembled from the files specified in the source fields.

As a convenience for the user, we could allow specifying a Directory with a source field in the input object and construct the directory object from the source listing.

We could add a requirement that specifies a Directory object as the initial contents of the designated output directory:

requirements:
  - class: InitialOutputContentsRequirement
    initialContents:
      class: Directory
      listing:
      - class: File
        path: /path/to/stuff.bam

This would obsolete CreateFileRequirement.

The idea needs to be developed a bit more, but I think this could provide a more general solution for tools that need to read and write in the same directory.

tetron · 2016-03-10T14:33:43Z

I should mention that I don't think this obsoletes secondaryFiles, it is turning out to be very useful to use secondaryFiles to represent dependencies of a file. However the interaction between directory objects and file objects that include secondaryFiles will need to be worked out.

ghost · 2016-03-11T11:04:27Z

I think I like it, but still confused on some points:

The class: Directory objects are values that appear on process ports and can be piped in workflows, right?
When is the path property used, and when is the source property used? It seems to me we should use path except when directing the engine to create a file/directory (like in the requirement you described).
Is the listing property available at all times or only created when requested by loadContents or equivalent? If former, it would allow additional file metadata to be propagated. But I think people mostly just need the latter.
How do you envision output bindings that will report directories? Just a regular glob?

We should find a couple of tools that need to work with directories and make example tool and job objects. One is mentioned at the mailing list (srepo/mcrpipeline on dockerhub, I think) and illumina bcl converters require folder structure as well.

tetron · 2016-03-25T18:38:27Z

Goal: to be able to describe, create, and capture file hierachies, in order
to support tasks where the flat structure implied by the current File concept
is inadequate.

class: CommandLineTool

# Obsoletes CreateFileRequirement, describe initial contents of output directory.
initialWorkDir:
  class: Directory
  listing:
    - {
        # "foo" file shows up in directory
        class: File
        path: /foo
      }

    # When used in "initialWorkDir", can also return file or list of file from
    # expression.
    - ${ return [{
        // "bar" file shows up in directory
        class: "File",
        path: "/bar"
      }, {
        // "baz" file shows up in directory
        class: "File",
        path: "/baz"
     }]; }

    # Reuse FileDef from CreateFileRequirement to describe files where you need
    # custom name or content.
    - fileName: quux
      fileContent:
        class: File
        path: /baz
      # specify it should be a writable copy, by default files may be symlinks
      # or read-only.
      writable: true

    - {
        # secondaryFiles also get added to the directory.
        class: File
        path: /foo2
        secondaryFiles:
          - class: File
            path: /foo2.index
      }

    # If you need to rename secondaryFiles of the primary file, use an an
    # expression based on a pattern is hard to generalize.
    - fileName: zing
      fileContent: $(inputs.baz.secondaryFiles[0])

    # Subdirectory object recursively includes a Directory object.
    - subdirName: example_subdir
      subdirContent: {
       class: Directory
       listing:
         - ...
      }

In the above example, this will be presented to the tool as a directory listing
with the following files and directories:

foo  bar  baz  quux  foo2  foo2.index  zing  example_subdir/

Directory objects can be input or output of tools. Directories are added by
path to the command line. The runtime is responsible for staging files so
they show up in a directory tree with the correct structure.

inputs:
 - id: exampleDir
   type: Directory
   inputBinding:
     prefix: --input-dir

Command line will then be something like tool --input-dir /tmp/working123
where "working123" has the contents described by the directory object.

Capturing Directories for output doesn't change much, if at all:

outputs:
 - id: exampleOutput
   type: Directory
   # Collect all txt files, including subdirectories
   # depends on issue #134 (add ** syntax for matching across subdirectories)
   inputBinding:
     glob: **.txt

mr-c · 2016-04-15T09:43:30Z

@tetron This looks great to me. @ntijanic Shall we go ahead and add this to draft-4 or should we wait until after the 1.0 release?

kmhernan · 2016-04-15T11:24:18Z

I think forcing you to explicitly write all of the files located in the directory, if that's what the comments above are implying makes it very unreasonable for situations involving annotation cache files. For example, look at the Variant Effect Predictor or Annovar or SNPeff. Writing out every single file for those would be a daunting task. Not sure if anyone has any thoughts on this.

Edit: just looked at my VEP cache, it has 21,383 files in it 😞

ghost · 2016-04-17T00:04:57Z

@kmhernan I don't think you'd need to enumerate all the files. If using it as input for the tool, you should be able to just specify the directory. If staging them, you should also be able to specify only the directory.
IIUC, enumerating the files should only be used in place of current CreateFileRequirement.

@tetron For initialWorkDir, can we only reuse the File and Directory objects already defined as CWL values and get rid of fileName, fileContent, subdirName and subdirContent? For example:

class: CommandLineTool
initialWorkDir:
  - {class: File, name: quux, path: /baz, writable: true}
  - {class: File, name: zing, path: $(inputs.baz.secondaryFiles[0].path)}
  - {class: Directory, name: example_subdir, listing: []}

psaffrey-illumina · 2016-04-20T12:32:39Z

+1 for directory support.

I see bcltofastq has already been mentioned - these work on sequencing output directories and can contain even more files than the VEP caches (>100,000). Another example is the Isaac aligner:

https://github.com/Illumina/isaac2

which produces index files with a central XML file referring to them. Directory support is probably the easiest way to accommodate this.

tetron · 2016-04-20T15:36:39Z

The Directory object needs to model the contents of the directory to ensure that the correct files are staged inside the container and/or transferred to the compute node. However, as @ntijanic says, on the user interface side we can allow users to only provide just directory path and have the implementation scan the file system directory.

mr-c · 2016-04-20T15:38:43Z

Sounds like we have broad agreement. Who is volunteering to implement this in the reference implementation so we can validate it?

psaffrey-illumina · 2016-04-20T15:51:44Z

@tetron: does this mean that the docker command itself would stage every file individually? Will that work for directories with hundreds of thousands of files?

tetron · 2016-04-20T17:16:46Z

@psaffrey-illumina It's an implementation detail because different CWL implementations will handle files and directories in different ways. I expect in most cases that directory contents will be staged separately (if necessary) and then the whole directory will be bind mounted into the container.

psaffrey-illumina · 2016-04-21T07:55:22Z

@mr-c: I would be willing to have a go at implementing this, but I've never worked on the spec before so I would need a bit of guidance. Should I just be able to copy the entry for files and modify it?

mr-c · 2016-04-21T08:32:50Z

@psaffrey-illumina let's do it together, as basically @tetron has done all the implementation up to this point.

I've rebased the draft-4 branch on the latest code: https://github.com/common-workflow-language/common-workflow-language/tree/draft-4

And I'm updating cwltool to recognize draft-4.dev1 as a valid version of CWL in https://github.com/common-workflow-language/cwltool/tree/draft-4

mr-c · 2016-07-07T21:44:51Z

Done!

mr-c added this to the Draft 4 milestone Apr 15, 2016

mr-c mentioned this issue Apr 15, 2016

Allowing directories to be assigned as a File type common-workflow-language/cwltool#65

Closed

mr-c mentioned this issue Apr 21, 2016

Implementation: Directory type #201

Closed

5 tasks

tetron mentioned this issue May 16, 2016

Filename preservation options #204

Closed

denis-yuen mentioned this issue May 24, 2016

Provide option to specify working directory for Dockstore launch dockstore/dockstore#245

Closed

mr-c modified the milestones: unscheduled, Draft 4 Jun 16, 2016

mr-c closed this as completed Jul 7, 2016

mr-c removed this from the unscheduled milestone Dec 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Directory type #172

RFC: Directory type #172

ghost commented Nov 30, 2015

portah commented Nov 30, 2015

portah commented Nov 30, 2015

ghost commented Nov 30, 2015

mr-c commented Feb 16, 2016

mr-c commented Mar 10, 2016

ghost commented Mar 10, 2016

tetron commented Mar 10, 2016

tetron commented Mar 10, 2016

ghost commented Mar 11, 2016

tetron commented Mar 25, 2016

mr-c commented Apr 15, 2016

kmhernan commented Apr 15, 2016 •

edited

ghost commented Apr 17, 2016

psaffrey-illumina commented Apr 20, 2016

tetron commented Apr 20, 2016

mr-c commented Apr 20, 2016

psaffrey-illumina commented Apr 20, 2016

tetron commented Apr 20, 2016

psaffrey-illumina commented Apr 21, 2016

mr-c commented Apr 21, 2016

mr-c commented Jul 7, 2016

RFC: Directory type #172

RFC: Directory type #172

Comments

ghost commented Nov 30, 2015

portah commented Nov 30, 2015

portah commented Nov 30, 2015

ghost commented Nov 30, 2015

mr-c commented Feb 16, 2016

mr-c commented Mar 10, 2016

ghost commented Mar 10, 2016

tetron commented Mar 10, 2016

tetron commented Mar 10, 2016

ghost commented Mar 11, 2016

tetron commented Mar 25, 2016

mr-c commented Apr 15, 2016

kmhernan commented Apr 15, 2016 • edited

ghost commented Apr 17, 2016

psaffrey-illumina commented Apr 20, 2016

tetron commented Apr 20, 2016

mr-c commented Apr 20, 2016

psaffrey-illumina commented Apr 20, 2016

tetron commented Apr 20, 2016

psaffrey-illumina commented Apr 21, 2016

mr-c commented Apr 21, 2016

mr-c commented Jul 7, 2016

kmhernan commented Apr 15, 2016 •

edited