-
-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Directory type #172
Comments
cwtool allows to specify directory as type File: - id: '#genomeDir'
type:
- File
- string
description: |
string: path to the directory where genome files are stored (if
runMode!=generateGenome) or will be generated (if runMode==generateGenome)
inputBinding:
position: 1
prefix: '--genomeDir' so when I use it as the type File I provide path to a directory, job file: {
"readFilesIn": [{"class": "File", "path": "../test-files/SRR1031972.fastq"}],
"genomeDir": {"class": "File", "path": "../test-files/dm3/"},
"outFileNamePrefix":"./test-files/SRR1031972.",
"outSAMunmapped1": "Within",
"runThreadN": 4
} It is a bit controversial what I did in output, perhaps there a more beautiful solution: outputs:
- id: "#indices"
type: ["null",File]
outputBinding:
glob: |
${
if (inputs.runMode != "genomeGenerate")
return [];
return inputs.genomeDir+"/Genome";
}
secondaryFiles: |
${
var p=inputs.genomeDir;
return [
{"path": p+"/SA", "class":"File"},
{"path": p+"/SAindex", "class":"File"},
{"path": p+"/chrNameLength.txt", "class":"File"},
{"path": p+"/chrLength.txt", "class":"File"},
{"path": p+"/chrStart.txt", "class":"File"},
{"path": p+"/geneInfo.tab", "class":"File"},
{"path": p+"/sjdbList.fromGTF.out.tab", "class":"File"},
{"path": p+"/chrName.txt", "class":"File"},
{"path": p+"/exonGeTrInfo.tab", "class":"File"},
{"path": p+"/genomeParameters.txt", "class":"File"},
{"path": p+"/sjdbList.out.tab", "class":"File"},
{"path": p+"/exonInfo.tab", "class":"File"},
{"path": p+"/sjdbInfo.txt", "class":"File"},
{"path": p+"/transcriptInfo.tab", "class":"File"}
];
}
- id: "#aligned"
type: ["null",File]
outputBinding:
glob: |
${
if (inputs.runMode == "genomeGenerate")
return [];
var p = inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
if (inputs.outSAMtype.indexOf("SAM") > -1) {
return p+"Aligned.out.sam";
} else {
if ( inputs.outSAMtype.indexOf("SortedByCoordinate") > -1 )
return p+"Aligned.sortedByCoord.out.bam";
else
return p+"Aligned.out.bam";
}
}
secondaryFiles: |
${
var p=inputs.outFileNamePrefix?inputs.outFileNamePrefix:"";
return [
{"path": p+"Log.final.out", "class":"File"},
{"path": p+"SJ.out.tab", "class":"File"},
{"path": p+"Log.out", "class":"File"}
];
} |
Form the provided example, and other similar cases (tools that are creating genome indices in a directory), I think to provide the list of output files better then to list the whole directory. It is usual to use one directory for indices, you will have files from other tools also.
To provide SHA/size for the secondaryFiles might be reasonable.
I think to keep loadContext is also good :) to see actually the list of files in the directory :) |
STAR genomeGenerate is a nice example, though I'm not exactly sure how it works. Are the indices created with only reference file as input or do you also specify other (gtf?) files? That leads to creating a set of indices specific to the combination of those two files (reference and splice junctions). There may be other arguments as well, can't recall. IIRC, star takes the folder as input so it may be cleaner to have genomeGenerate produce a folder. Unless these index files are used on their own. I might be very wrong about this though. Listing them as secondary files seems totally okay and the reason feature was introduced in the first place :) As a side note, describing individual star subcommands as separate cwl files may remove the need for so much javascript in the outputs (but it's awesome to see it can be done nicely in a single cwl file!)
Good idea. +1 Excluding STAR and similar cases, it would still be very beneficial to pass directories as values - some tools require very specific nested directory structures as input (e.g. some sequencer outputs are nested dir structures). |
@ntijanic sounds like this request isn't needed? Or am I reading the comments here wrong? |
+1 for a proper directory input type that would automatically make the contents available to the tool |
Yeap. I guess this is what the request represent. Still needed. |
@portah "cwtool allows to specify directory as type File" that's a bug that's been recently fixed. @ntijanic What do you think of this? class: Directory
listing:
- name: file1
class: File
source: /path/to/file1
- name: file2
class: File
source: /path/to/another_file
- name: dir1
class: Directory
listing:
- name: file3
class: File
source: /path/to/third_file This would create a directory structure assembled from the files specified in the As a convenience for the user, we could allow specifying a Directory with a We could add a requirement that specifies a Directory object as the initial contents of the designated output directory: requirements:
- class: InitialOutputContentsRequirement
initialContents:
class: Directory
listing:
- class: File
path: /path/to/stuff.bam This would obsolete The idea needs to be developed a bit more, but I think this could provide a more general solution for tools that need to read and write in the same directory. |
I should mention that I don't think this obsoletes |
I think I like it, but still confused on some points:
We should find a couple of tools that need to work with directories and make example tool and job objects. One is mentioned at the mailing list (srepo/mcrpipeline on dockerhub, I think) and illumina bcl converters require folder structure as well. |
Goal: to be able to describe, create, and capture file hierachies, in order class: CommandLineTool
# Obsoletes CreateFileRequirement, describe initial contents of output directory.
initialWorkDir:
class: Directory
listing:
- {
# "foo" file shows up in directory
class: File
path: /foo
}
# When used in "initialWorkDir", can also return file or list of file from
# expression.
- ${ return [{
// "bar" file shows up in directory
class: "File",
path: "/bar"
}, {
// "baz" file shows up in directory
class: "File",
path: "/baz"
}]; }
# Reuse FileDef from CreateFileRequirement to describe files where you need
# custom name or content.
- fileName: quux
fileContent:
class: File
path: /baz
# specify it should be a writable copy, by default files may be symlinks
# or read-only.
writable: true
- {
# secondaryFiles also get added to the directory.
class: File
path: /foo2
secondaryFiles:
- class: File
path: /foo2.index
}
# If you need to rename secondaryFiles of the primary file, use an an
# expression based on a pattern is hard to generalize.
- fileName: zing
fileContent: $(inputs.baz.secondaryFiles[0])
# Subdirectory object recursively includes a Directory object.
- subdirName: example_subdir
subdirContent: {
class: Directory
listing:
- ...
} In the above example, this will be presented to the tool as a directory listing
Directory objects can be input or output of tools. Directories are added by inputs:
- id: exampleDir
type: Directory
inputBinding:
prefix: --input-dir Command line will then be something like Capturing Directories for output doesn't change much, if at all: outputs:
- id: exampleOutput
type: Directory
# Collect all txt files, including subdirectories
# depends on issue #134 (add ** syntax for matching across subdirectories)
inputBinding:
glob: **.txt |
@tetron This looks great to me. @ntijanic Shall we go ahead and add this to draft-4 or should we wait until after the 1.0 release? |
I think forcing you to explicitly write all of the files located in the directory, if that's what the comments above are implying makes it very unreasonable for situations involving annotation cache files. For example, look at the Variant Effect Predictor or Annovar or SNPeff. Writing out every single file for those would be a daunting task. Not sure if anyone has any thoughts on this. Edit: just looked at my VEP cache, it has 21,383 files in it 😞 |
@kmhernan I don't think you'd need to enumerate all the files. If using it as input for the tool, you should be able to just specify the directory. If staging them, you should also be able to specify only the directory. @tetron For initialWorkDir, can we only reuse the class: CommandLineTool
initialWorkDir:
- {class: File, name: quux, path: /baz, writable: true}
- {class: File, name: zing, path: $(inputs.baz.secondaryFiles[0].path)}
- {class: Directory, name: example_subdir, listing: []} |
+1 for directory support. I see bcltofastq has already been mentioned - these work on sequencing output directories and can contain even more files than the VEP caches (>100,000). Another example is the Isaac aligner: https://github.com/Illumina/isaac2 which produces index files with a central XML file referring to them. Directory support is probably the easiest way to accommodate this. |
The Directory object needs to model the contents of the directory to ensure that the correct files are staged inside the container and/or transferred to the compute node. However, as @ntijanic says, on the user interface side we can allow users to only provide just directory path and have the implementation scan the file system directory. |
Sounds like we have broad agreement. Who is volunteering to implement this in the reference implementation so we can validate it? |
@tetron: does this mean that the docker command itself would stage every file individually? Will that work for directories with hundreds of thousands of files? |
@psaffrey-illumina It's an implementation detail because different CWL implementations will handle files and directories in different ways. I expect in most cases that directory contents will be staged separately (if necessary) and then the whole directory will be bind mounted into the container. |
@mr-c: I would be willing to have a go at implementing this, but I've never worked on the spec before so I would need a bit of guidance. Should I just be able to copy the entry for files and modify it? |
@psaffrey-illumina let's do it together, as basically @tetron has done all the implementation up to this point. I've rebased the draft-4 branch on the latest code: https://github.com/common-workflow-language/common-workflow-language/tree/draft-4 And I'm updating cwltool to recognize |
Done! |
The spec doesn't say anything about using directories as inputs/outputs. Quite a bit of tools use directories, so we'll need to support that.
A workaround that has been used is to just treat them as files and ignore all properties except for
path
. However, it doesn't take much to support them completely:size
in this context to refer to sum of all file sizes inside the directory.secondaryFiles
andloadContent
in this context.Thoughts?
The text was updated successfully, but these errors were encountered: