Preserve folder hierarchies on the platform #415

RachelDuffin · 2023-01-09T14:15:44Z

My team would find the previously requested behaviour in dnanexus/dxWDL#168 very useful.

We are investigating switching over to WDL workflows and are almost there for one of our pipelines. However, one of the problems is that whilst the '--stage-relative-output-folder' argument allows specifying the relative output folder per task for all files produced by that task, it does not allow for individual files to be placed in different directories.

It would be very helpful if the platform allowed for this behaviour through specification in the outputs section of each WDL task, rather than needing to create a separate reorg app (this just adds to the code base that needs to be maintained and means that files are only moved to the correct location at the end of the workflow rather than at the point of delocalisation).

I noticed it has been several years since the last comment on closed issue dnanexus/dxWDL#168 and was hoping you would be able to provide an update as to whether the exploratory work went anywhere and whether the professional services department decided that this functionality would be useful to incorporate into the dxCompiler.

Many thanks!

sclan · 2023-01-10T14:05:21Z

Directory output feature from WDL 2.0 is supported:
https://github.com/dnanexus/dxCompiler/blob/develop/doc/ExpertOptions.md#directory-outputs

RachelDuffin · 2023-01-10T17:24:14Z

Hi Stanley, I did try to use this but I wasn't able to get it to work. I will give it another try and let you know if I am still having issues.

sclan · 2023-01-10T22:05:01Z

Hi Rachel you may find the test case helpful as template:
https://github.com/dnanexus/dxCompiler/tree/develop/test/wdl_2_0
We include the tests from the test folder for our integration test so all tests will have to pass in order for the dxCompiler possible. Please make sure that the latest release is used for compile since the directory output support is relatively new.

RachelDuffin · 2023-01-11T08:35:58Z

Hi Stanley, Thanks for sharing this. It doesn't look like there is a way to be able to use the Directory outputs to output the files to a specific directory but then to still be able to specify individual files from that task as inputs to the next task - is that correct or is there a way to do this? Many thanks, Rachel

…

On Tue, 10 Jan 2023, 22:05 Stanley Lan, ***@***.***> wrote: Hi Rachel you may find the test case helpful as template: https://github.com/dnanexus/dxCompiler/tree/develop/test/wdl_2_0 We include the tests from the test folder for our integration test so all tests will have to pass in order for the dxCompiler possible. Please make sure that the latest release is used for compile since the directory output support is relatively new. — Reply to this email directly, view it on GitHub <#415 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGJK3KUWACRXQ4XCMWCI7ZTWRXMJPANCNFSM6AAAAAATVPLZFA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Gvaihir · 2023-01-17T07:02:49Z

Could you elaborate what specifically does not work?
The expected and tested behavior is that the code in the task organizes the outputs on a worker into directories - according to the logic of the given task. Like the following:

mkdir outputs/folder_{1..4}
for i in {1..4};
do
echo "Hello" > outputs/folder${i}/my_file_${i}
done

then make your task's outputs look like that:

output {
    Directory outdir = "outputs/"
  }

your next task will have Directory inputs "outputs/" from the previous task (above)
Where does this behavior break for you?

sclan · 2023-01-17T13:16:20Z

I think Rachel's goal is to be able to specify (or find a way to specify) one individual file from a task directory output and use it as the input for the next task.

I don't believe such feature is supported by either WDL or dxCompiler/executor. If it is directory output the task generates, that output can only be passed as directory as input for the downstream task(s) via WDL. I can't find any case in the WDL spec that allows pick-and-choose among the directory outputs:

https://github.com/openwdl/wdl/blob/main/versions/development/SPEC.md

Gvaihir · 2023-01-18T19:22:00Z

Then just simply use Directory as an input and point to a specific file in your command section of the task

RachelDuffin · 2023-01-19T13:57:57Z

My goal is to be able to specify relative output locations for individual files.

Using directory as an output would not work because in my case I am using both WDL tasks and native applets imported as WDL tasks (these have file inputs) in my workflow. The native applets require file inputs whereas the outputs from the previous task passed as inputs to the next task would be directories. It isn't feasible to go through and change all our existing apps to take a different input given the lengthy validation / release / quality management process we have to go through to make changes in a clinical setting. Also dependent upon the task you may not necessarily always want the outputs in the same directory. We would also only want the next task/ app to run if the inputs consisted of all the required files - i.e. the input files exist. If you were to run a workflow that took a directory input, it could be that a previous task/app finished but did not output all the required files for the next task to run (not all required files exist within that directory). With a directory input the next task would still run and fail with an internal error as opposed to just failing to start. This incurs a cost to the customer. Not only that but if you want to group all files within one directory, e.g. a 'bams' directory and there are multiple bams within that directory (e.g. across many samples), specifying that directory as input would mean that all filees within that directory would be uploaded to the worker, again incurring greater cost and a greater runtime for the app despite potentially only needing a single file from that directory. In my case I would be running many concurrent workflows - the app would have no way of knowing which bam file would be the correct one to use for the command, i.e. the bam file that was specific to just that workflow.

I know it is possible to output files from a task to a specific subdirectory using the '--stage-relative-output-folder' argument however this specifies the output directory for all files output by that task and this is not always desirable and doesn't allow to specify output locations for individual files within that task (which is something that is possible and we already do for native applets).

The behaviour makes WDL workflows inflexible in comparison to native workflows and I think there would be benefit to the user in supporting the relative output locations specified within the task output section being reflected within the project heirarchy / relative to the specified destination folder.

I also know that a reorg app is an option however this is also not ideal as it only moves the files at the end of the completion of the workflow.

When compiling/ running a workflow using Cromwell, if a relative output location is specified within the task for an output file e.g. "output/file.txt", the file is then placed within an output directory relative to the execution directory. Is it not possible to replicate similar behaviour but relative to the project for the dxCompiler?

sclan · 2023-01-19T16:35:54Z

If it is the individual output file among the output directory that your downstream / native app needs, can you declare a File output along with the directory output in the (upstream) task?

That way you get both the output directory as well as the individual output File from the task, and the later can be used as input for downstream.

RachelDuffin · 2023-03-10T09:59:40Z

Hi, I suppose this would be possible however it can be misleading as we have multiple apps that output files into the same folder - if a folder is declared as an output from one file I don't think it would be clear which files from within that folder came from that app and which came from a different app. I think that behaviour allowing you to specify relative output for individual files is the desired behaviour as it is explicit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve folder hierarchies on the platform #415

Preserve folder hierarchies on the platform #415

RachelDuffin commented Jan 9, 2023

sclan commented Jan 10, 2023

RachelDuffin commented Jan 10, 2023

sclan commented Jan 10, 2023

RachelDuffin commented Jan 11, 2023 via email

Gvaihir commented Jan 17, 2023 •

edited

Loading

sclan commented Jan 17, 2023

Gvaihir commented Jan 18, 2023

RachelDuffin commented Jan 19, 2023

sclan commented Jan 19, 2023

RachelDuffin commented Mar 10, 2023

Preserve folder hierarchies on the platform #415

Preserve folder hierarchies on the platform #415

Comments

RachelDuffin commented Jan 9, 2023

sclan commented Jan 10, 2023

RachelDuffin commented Jan 10, 2023

sclan commented Jan 10, 2023

RachelDuffin commented Jan 11, 2023 via email

Gvaihir commented Jan 17, 2023 • edited Loading

sclan commented Jan 17, 2023

Gvaihir commented Jan 18, 2023

RachelDuffin commented Jan 19, 2023

sclan commented Jan 19, 2023

RachelDuffin commented Mar 10, 2023

Gvaihir commented Jan 17, 2023 •

edited

Loading