Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows implicit conversions #5384

Open
EngyNasr opened this issue Jul 2, 2023 · 2 comments
Open

Workflows implicit conversions #5384

EngyNasr opened this issue Jul 2, 2023 · 2 comments

Comments

@EngyNasr
Copy link
Contributor

EngyNasr commented Jul 2, 2023

Implicit Conversions:

Some tools fails when they run in a workflow and succeed when they run alone without a workflow, for example Krakentools: Extract Kraken Reads By ID (older version) and Filter Sequence by ID.

After a bit of investigation we noticed that when these tools run alone (without being in a workflow) they perform an implicit decompressing of the input zipped files, which make the output successful, however when these same tools run with the same exact inputs with-in a workflow this implicit decompressing does not take place, which cause the output to fail.

Example of the implicit datatype conversion performed by the tools while running stand alone without a workflow in a history:
image

The initial solution was to add Convert compressed file to uncompressed tool in the workflow before these tools, as shown in green in the figure below

image

However, this initial solution is not the optimal, since by hundreds and thousands of sequence files the size will increase dramatically in the user's history by running the workflow.

For that we have proposed another solution by updating the tools wrappers themselves to perform the decompression internally without the need to use the Convert compressed file to uncompressed tool as we did to Krakentools: Extract Kraken Reads By ID (current version).

The most optimal solution would be updating Galaxy workflow to perform implicit conversions similar to the ones done when running the tool without a workflow

Important note: this implicit conversion issue only occur when the input is a collection of zipped files, so if the input is a single zipped file these tools work fine within and without a workflow

@bernt-matthias
Copy link
Contributor

However, this initial solution is not the optimal, since by hundreds and thousands of sequence files the size will increase dramatically in the user's history by running the workflow.

Note that implicitly converted datasets are also added to the history, but as hidden datasets (as far as I know).. they have the same name and number but are hidden. Therefore I would not agree with your statement:

The most optimal solution would be updating Galaxy workflow to perform implicit conversions similar to the ones done when running the tool without a workflow

But I would rather say that the following is optimal (trading space for run time):

For that we have proposed another solution by updating the tools wrappers themselves to perform the decompression internally without the need to use the Convert compressed file to uncompressed tool as we did to Krakentools: Extract Kraken Reads By ID (current version).

Please note the ongoing efforts to have as many tools as possible to accept zipped and unzipped inputs: #2312 .. in the linked PRs you may also find some approached to do so on the tool side.

Nevertheless I would say that the problem with the implicitly converted datasets is a bug, or @mvdbeek ?

This implicit conversion issue only occur when the input is a collection of zipped files, so if the input is a single zipped file these tools work fine within and without a workflow.

@EngyNasr do you have a small example workflow where the failure occurs?

@EngyNasr
Copy link
Contributor Author

EngyNasr commented Jul 4, 2023

Here is a small example workflow: https://usegalaxy.eu/u/engy.nasr/w/collection-implicit-conversion-example

and a small example History: https://usegalaxy.eu/u/engy.nasr/h/collection-trial-implicit-conversion

you can reproduce the history by running the workflow on the first collection "Spiked Samples" in the history.

As you can see the Filter sequences by ID failed, but if you rerun the tool in the history it will succeed since it ran outside the workflow
That is due to the implicit conversion (in the form of decompression) that the tool perform with running without the workflow, and it doesnot when its within the workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants