Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Directory type #172

Closed
ghost opened this issue Nov 30, 2015 · 21 comments
Closed

RFC: Directory type #172

ghost opened this issue Nov 30, 2015 · 21 comments

Comments

@ghost
Copy link

ghost commented Nov 30, 2015

The spec doesn't say anything about using directories as inputs/outputs. Quite a bit of tools use directories, so we'll need to support that.

A workaround that has been used is to just treat them as files and ignore all properties except for path. However, it doesn't take much to support them completely:

  • Add a Directory/Dir/DIR/Folder type.
  • Define size in this context to refer to sum of all file sizes inside the directory.
  • Define (hopefully reuse) a checksum schema for directories (e.g. find -exec sha | sha).
  • Remove secondaryFiles and loadContent in this context.

Thoughts?

@portah
Copy link
Member

portah commented Nov 30, 2015

cwtool allows to specify directory as type File:

  - id: '#genomeDir'
    type:
      - File
      - string
    description: |
      string: path to the directory where genome files are stored (if
      runMode!=generateGenome) or will be generated (if runMode==generateGenome)
    inputBinding:
      position: 1
      prefix: '--genomeDir'

so when I use it as the type File I provide path to a directory, job file:

{
  "readFilesIn": [{"class": "File", "path": "../test-files/SRR1031972.fastq"}],
  "genomeDir": {"class": "File", "path": "../test-files/dm3/"},
  "outFileNamePrefix":"./test-files/SRR1031972.",
  "outSAMunmapped1": "Within",
  "runThreadN": 4
}

It is a bit controversial what I did in output, perhaps there a more beautiful solution:

outputs:
  - id: "#indices"
    type: ["null",File]
    outputBinding:
      glob: |
        ${
          if (inputs.runMode != "genomeGenerate")
            return [];
          return inputs.genomeDir+"/Genome";
        }
      secondaryFiles: |
        ${
          var p=inputs.genomeDir;
          return [
            {"path": p+"/SA", "class":"File"},
            {"path": p+"/SAindex", "class":"File"},
            {"path": p+"/chrNameLength.txt", "class":"File"},
            {"path": p+"/chrLength.txt", "class":"File"},
            {"path": p+"/chrStart.txt", "class":"File"},
            {"path": p+"/geneInfo.tab", "class":"File"},
            {"path": p+"/sjdbList.fromGTF.out.tab", "class":"File"},
            {"path": p+"/chrName.txt", "class":"File"},
            {"path": p+"/exonGeTrInfo.tab", "class":"File"},
            {"path": p+"/genomeParameters.txt", "class":"File"},
            {"path": p+"/sjdbList.out.tab", "class":"File"},
            {"path": p+"/exonInfo.tab", "class":"File"},
            {"path": p+"/sjdbInfo.txt", "class":"File"},
            {"path": p+"/transcriptInfo.tab", "class":"File"}
          ];
        }

  - id: "#aligned"
    type: ["null",File]
    outputBinding:
      glob: |
          ${
            if (inputs.runMode == "genomeGenerate")
              return [];

            var p = inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
            if (inputs.outSAMtype.indexOf("SAM") > -1) {
                return p+"Aligned.out.sam";
            } else {
             if ( inputs.outSAMtype.indexOf("SortedByCoordinate") > -1 )
                return p+"Aligned.sortedByCoord.out.bam";
              else
                return p+"Aligned.out.bam";
            }
          }
      secondaryFiles: |
         ${
            var p=inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
            return [
              {"path": p+"Log.final.out", "class":"File"},
              {"path": p+"SJ.out.tab", "class":"File"},
              {"path": p+"Log.out", "class":"File"}
            ];
         }

@portah
Copy link
Member

portah commented Nov 30, 2015

  • Add a Directory/Dir/DIR/Folder type.
  • Define size in this context to refer to sum of all file sizes inside the directory.

Form the provided example, and other similar cases (tools that are creating genome indices in a directory), I think to provide the list of output files better then to list the whole directory. It is usual to use one directory for indices, you will have files from other tools also.

  • Define (hopefully reuse) a checksum schema for directories (e.g. find -exec sha | sha).

To provide SHA/size for the secondaryFiles might be reasonable.

  • Remove secondaryFiles and loadContent in this context.

I think to keep loadContext is also good :) to see actually the list of files in the directory :)

@ghost
Copy link
Author

ghost commented Nov 30, 2015

Form the provided example, and other similar cases (tools that are creating genome indices in a directory), I think to provide the list of output files better then to list the whole directory. It is usual to use one directory for indices, you will have files from other tools also.

STAR genomeGenerate is a nice example, though I'm not exactly sure how it works.

Are the indices created with only reference file as input or do you also specify other (gtf?) files? That leads to creating a set of indices specific to the combination of those two files (reference and splice junctions). There may be other arguments as well, can't recall.
If the indices are not defined only by the reference genome, it might make it much harder to re-use a single folder with other indices (.dict, .fai).

IIRC, star takes the folder as input so it may be cleaner to have genomeGenerate produce a folder. Unless these index files are used on their own.

I might be very wrong about this though. Listing them as secondary files seems totally okay and the reason feature was introduced in the first place :)

As a side note, describing individual star subcommands as separate cwl files may remove the need for so much javascript in the outputs (but it's awesome to see it can be done nicely in a single cwl file!)

I think to keep loadContext is also good :) to see actually the list of files in the directory :)

Good idea. +1

Excluding STAR and similar cases, it would still be very beneficial to pass directories as values - some tools require very specific nested directory structures as input (e.g. some sequencer outputs are nested dir structures).

@mr-c
Copy link
Member

mr-c commented Feb 16, 2016

@ntijanic sounds like this request isn't needed? Or am I reading the comments here wrong?

@mr-c
Copy link
Member

mr-c commented Mar 10, 2016

+1 for a proper directory input type that would automatically make the contents available to the tool

@ghost
Copy link
Author

ghost commented Mar 10, 2016

+1 for a proper directory input type that would automatically make the contents available to the tool

Yeap. I guess this is what the request represent. Still needed.

@tetron
Copy link
Member

tetron commented Mar 10, 2016

@portah "cwtool allows to specify directory as type File" that's a bug that's been recently fixed.

@ntijanic

What do you think of this?

class: Directory
listing:
  - name: file1
    class: File
    source: /path/to/file1
  - name: file2
    class: File
    source: /path/to/another_file
  - name: dir1
    class: Directory
    listing:
    - name: file3
      class: File
      source: /path/to/third_file

This would create a directory structure assembled from the files specified in the source fields.

As a convenience for the user, we could allow specifying a Directory with a source field in the input object and construct the directory object from the source listing.

We could add a requirement that specifies a Directory object as the initial contents of the designated output directory:

requirements:
  - class: InitialOutputContentsRequirement
    initialContents:
      class: Directory
      listing:
      - class: File
        path: /path/to/stuff.bam

This would obsolete CreateFileRequirement.

The idea needs to be developed a bit more, but I think this could provide a more general solution for tools that need to read and write in the same directory.

@tetron
Copy link
Member

tetron commented Mar 10, 2016

I should mention that I don't think this obsoletes secondaryFiles, it is turning out to be very useful to use secondaryFiles to represent dependencies of a file. However the interaction between directory objects and file objects that include secondaryFiles will need to be worked out.

@ghost
Copy link
Author

ghost commented Mar 11, 2016

I think I like it, but still confused on some points:

  • The class: Directory objects are values that appear on process ports and can be piped in workflows, right?
  • When is the path property used, and when is the source property used? It seems to me we should use path except when directing the engine to create a file/directory (like in the requirement you described).
  • Is the listing property available at all times or only created when requested by loadContents or equivalent? If former, it would allow additional file metadata to be propagated. But I think people mostly just need the latter.
  • How do you envision output bindings that will report directories? Just a regular glob?

We should find a couple of tools that need to work with directories and make example tool and job objects. One is mentioned at the mailing list (srepo/mcrpipeline on dockerhub, I think) and illumina bcl converters require folder structure as well.

@tetron
Copy link
Member

tetron commented Mar 25, 2016

Goal: to be able to describe, create, and capture file hierachies, in order
to support tasks where the flat structure implied by the current File concept
is inadequate.

class: CommandLineTool

# Obsoletes CreateFileRequirement, describe initial contents of output directory.
initialWorkDir:
  class: Directory
  listing:
    - {
        # "foo" file shows up in directory
        class: File
        path: /foo
      }

    # When used in "initialWorkDir", can also return file or list of file from
    # expression.
    - ${ return [{
        // "bar" file shows up in directory
        class: "File",
        path: "/bar"
      }, {
        // "baz" file shows up in directory
        class: "File",
        path: "/baz"
     }]; }

    # Reuse FileDef from CreateFileRequirement to describe files where you need
    # custom name or content.
    - fileName: quux
      fileContent:
        class: File
        path: /baz
      # specify it should be a writable copy, by default files may be symlinks
      # or read-only.
      writable: true

    - {
        # secondaryFiles also get added to the directory.
        class: File
        path: /foo2
        secondaryFiles:
          - class: File
            path: /foo2.index
      }

    # If you need to rename secondaryFiles of the primary file, use an an
    # expression based on a pattern is hard to generalize.
    - fileName: zing
      fileContent: $(inputs.baz.secondaryFiles[0])

    # Subdirectory object recursively includes a Directory object.
    - subdirName: example_subdir
      subdirContent: {
       class: Directory
       listing:
         - ...
      }

In the above example, this will be presented to the tool as a directory listing
with the following files and directories:

foo  bar  baz  quux  foo2  foo2.index  zing  example_subdir/

Directory objects can be input or output of tools. Directories are added by
path to the command line. The runtime is responsible for staging files so
they show up in a directory tree with the correct structure.

inputs:
 - id: exampleDir
   type: Directory
   inputBinding:
     prefix: --input-dir

Command line will then be something like tool --input-dir /tmp/working123
where "working123" has the contents described by the directory object.

Capturing Directories for output doesn't change much, if at all:

outputs:
 - id: exampleOutput
   type: Directory
   # Collect all txt files, including subdirectories
   # depends on issue #134 (add ** syntax for matching across subdirectories)
   inputBinding:
     glob: **.txt

@mr-c
Copy link
Member

mr-c commented Apr 15, 2016

@tetron This looks great to me. @ntijanic Shall we go ahead and add this to draft-4 or should we wait until after the 1.0 release?

@kmhernan
Copy link

kmhernan commented Apr 15, 2016

I think forcing you to explicitly write all of the files located in the directory, if that's what the comments above are implying makes it very unreasonable for situations involving annotation cache files. For example, look at the Variant Effect Predictor or Annovar or SNPeff. Writing out every single file for those would be a daunting task. Not sure if anyone has any thoughts on this.

Edit: just looked at my VEP cache, it has 21,383 files in it 😞

@ghost
Copy link
Author

ghost commented Apr 17, 2016

@kmhernan I don't think you'd need to enumerate all the files. If using it as input for the tool, you should be able to just specify the directory. If staging them, you should also be able to specify only the directory.
IIUC, enumerating the files should only be used in place of current CreateFileRequirement.

@tetron For initialWorkDir, can we only reuse the File and Directory objects already defined as CWL values and get rid of fileName, fileContent, subdirName and subdirContent? For example:

class: CommandLineTool
initialWorkDir:
  - {class: File, name: quux, path: /baz, writable: true}
  - {class: File, name: zing, path: $(inputs.baz.secondaryFiles[0].path)}
  - {class: Directory, name: example_subdir, listing: []}

@psaffrey-illumina
Copy link
Contributor

+1 for directory support.

I see bcltofastq has already been mentioned - these work on sequencing output directories and can contain even more files than the VEP caches (>100,000). Another example is the Isaac aligner:

https://github.com/Illumina/isaac2

which produces index files with a central XML file referring to them. Directory support is probably the easiest way to accommodate this.

@tetron
Copy link
Member

tetron commented Apr 20, 2016

The Directory object needs to model the contents of the directory to ensure that the correct files are staged inside the container and/or transferred to the compute node. However, as @ntijanic says, on the user interface side we can allow users to only provide just directory path and have the implementation scan the file system directory.

@mr-c
Copy link
Member

mr-c commented Apr 20, 2016

Sounds like we have broad agreement. Who is volunteering to implement this in the reference implementation so we can validate it?

@psaffrey-illumina
Copy link
Contributor

@tetron: does this mean that the docker command itself would stage every file individually? Will that work for directories with hundreds of thousands of files?

@tetron
Copy link
Member

tetron commented Apr 20, 2016

@psaffrey-illumina It's an implementation detail because different CWL implementations will handle files and directories in different ways. I expect in most cases that directory contents will be staged separately (if necessary) and then the whole directory will be bind mounted into the container.

@psaffrey-illumina
Copy link
Contributor

@mr-c: I would be willing to have a go at implementing this, but I've never worked on the spec before so I would need a bit of guidance. Should I just be able to copy the entry for files and modify it?

@mr-c
Copy link
Member

mr-c commented Apr 21, 2016

@psaffrey-illumina let's do it together, as basically @tetron has done all the implementation up to this point.

I've rebased the draft-4 branch on the latest code: https://github.com/common-workflow-language/common-workflow-language/tree/draft-4

And I'm updating cwltool to recognize draft-4.dev1 as a valid version of CWL in https://github.com/common-workflow-language/cwltool/tree/draft-4

@mr-c
Copy link
Member

mr-c commented Jul 7, 2016

Done!

@mr-c mr-c closed this as completed Jul 7, 2016
@mr-c mr-c removed this from the unscheduled milestone Dec 13, 2016
@mr-c mr-c removed this from the unscheduled milestone Dec 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants