Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve folder hierarchies on the platform #415

Open
RachelDuffin opened this issue Jan 9, 2023 · 10 comments
Open

Preserve folder hierarchies on the platform #415

RachelDuffin opened this issue Jan 9, 2023 · 10 comments

Comments

@RachelDuffin
Copy link

My team would find the previously requested behaviour in dnanexus/dxWDL#168 very useful.

We are investigating switching over to WDL workflows and are almost there for one of our pipelines. However, one of the problems is that whilst the '--stage-relative-output-folder' argument allows specifying the relative output folder per task for all files produced by that task, it does not allow for individual files to be placed in different directories.

It would be very helpful if the platform allowed for this behaviour through specification in the outputs section of each WDL task, rather than needing to create a separate reorg app (this just adds to the code base that needs to be maintained and means that files are only moved to the correct location at the end of the workflow rather than at the point of delocalisation).

I noticed it has been several years since the last comment on closed issue dnanexus/dxWDL#168 and was hoping you would be able to provide an update as to whether the exploratory work went anywhere and whether the professional services department decided that this functionality would be useful to incorporate into the dxCompiler.

Many thanks!

@sclan
Copy link
Collaborator

sclan commented Jan 10, 2023

Directory output feature from WDL 2.0 is supported:
https://github.com/dnanexus/dxCompiler/blob/develop/doc/ExpertOptions.md#directory-outputs

@RachelDuffin
Copy link
Author

Hi Stanley, I did try to use this but I wasn't able to get it to work. I will give it another try and let you know if I am still having issues.

@sclan
Copy link
Collaborator

sclan commented Jan 10, 2023

Hi Rachel you may find the test case helpful as template:
https://github.com/dnanexus/dxCompiler/tree/develop/test/wdl_2_0
We include the tests from the test folder for our integration test so all tests will have to pass in order for the dxCompiler possible. Please make sure that the latest release is used for compile since the directory output support is relatively new.

@RachelDuffin
Copy link
Author

RachelDuffin commented Jan 11, 2023 via email

@Gvaihir
Copy link
Contributor

Gvaihir commented Jan 17, 2023

Could you elaborate what specifically does not work?
The expected and tested behavior is that the code in the task organizes the outputs on a worker into directories - according to the logic of the given task. Like the following:

mkdir outputs/folder_{1..4}
for i in {1..4};
do
echo "Hello" > outputs/folder${i}/my_file_${i}
done

then make your task's outputs look like that:

output {
    Directory outdir = "outputs/"
  }

your next task will have Directory inputs "outputs/" from the previous task (above)
Where does this behavior break for you?

@sclan
Copy link
Collaborator

sclan commented Jan 17, 2023

I think Rachel's goal is to be able to specify (or find a way to specify) one individual file from a task directory output and use it as the input for the next task.

I don't believe such feature is supported by either WDL or dxCompiler/executor. If it is directory output the task generates, that output can only be passed as directory as input for the downstream task(s) via WDL. I can't find any case in the WDL spec that allows pick-and-choose among the directory outputs:

https://github.com/openwdl/wdl/blob/main/versions/development/SPEC.md

@Gvaihir
Copy link
Contributor

Gvaihir commented Jan 18, 2023

Then just simply use Directory as an input and point to a specific file in your command section of the task

@RachelDuffin
Copy link
Author

My goal is to be able to specify relative output locations for individual files.

Using directory as an output would not work because in my case I am using both WDL tasks and native applets imported as WDL tasks (these have file inputs) in my workflow. The native applets require file inputs whereas the outputs from the previous task passed as inputs to the next task would be directories. It isn't feasible to go through and change all our existing apps to take a different input given the lengthy validation / release / quality management process we have to go through to make changes in a clinical setting. Also dependent upon the task you may not necessarily always want the outputs in the same directory. We would also only want the next task/ app to run if the inputs consisted of all the required files - i.e. the input files exist. If you were to run a workflow that took a directory input, it could be that a previous task/app finished but did not output all the required files for the next task to run (not all required files exist within that directory). With a directory input the next task would still run and fail with an internal error as opposed to just failing to start. This incurs a cost to the customer. Not only that but if you want to group all files within one directory, e.g. a 'bams' directory and there are multiple bams within that directory (e.g. across many samples), specifying that directory as input would mean that all filees within that directory would be uploaded to the worker, again incurring greater cost and a greater runtime for the app despite potentially only needing a single file from that directory. In my case I would be running many concurrent workflows - the app would have no way of knowing which bam file would be the correct one to use for the command, i.e. the bam file that was specific to just that workflow.

I know it is possible to output files from a task to a specific subdirectory using the '--stage-relative-output-folder' argument however this specifies the output directory for all files output by that task and this is not always desirable and doesn't allow to specify output locations for individual files within that task (which is something that is possible and we already do for native applets).

The behaviour makes WDL workflows inflexible in comparison to native workflows and I think there would be benefit to the user in supporting the relative output locations specified within the task output section being reflected within the project heirarchy / relative to the specified destination folder.

I also know that a reorg app is an option however this is also not ideal as it only moves the files at the end of the completion of the workflow.

When compiling/ running a workflow using Cromwell, if a relative output location is specified within the task for an output file e.g. "output/file.txt", the file is then placed within an output directory relative to the execution directory. Is it not possible to replicate similar behaviour but relative to the project for the dxCompiler?

@sclan
Copy link
Collaborator

sclan commented Jan 19, 2023

If it is the individual output file among the output directory that your downstream / native app needs, can you declare a File output along with the directory output in the (upstream) task?

That way you get both the output directory as well as the individual output File from the task, and the later can be used as input for downstream.

@RachelDuffin
Copy link
Author

Hi, I suppose this would be possible however it can be misleading as we have multiple apps that output files into the same folder - if a folder is declared as an output from one file I don't think it would be clear which files from within that folder came from that app and which came from a different app. I think that behaviour allowing you to specify relative output for individual files is the desired behaviour as it is explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants