Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Collection output created via structured_like from a data input with multiple=True #7392

Open
blankenberg opened this issue Feb 20, 2019 · 4 comments

Comments

@blankenberg
Copy link
Member

I have a tool with an input of type data and multiple=True:

<param name="input_input" type="data" label="Input" format="anvio_db" optional="True" multiple="True" argument="" help="Anvi'o database for migration"/>

I want to create an output collection in a 1:1 fashion against this input dataset list:

<collection name="output_input" type="list" label="${tool.name} on ${on_string}: Input" structured_like="input_input" format_source="input_input" metadata_source="input_input"/>

If I use the ui-tab to switch the input to 'Dataset collections' it almost works as expected. However, the metadata is not properly propagated before command line generation. The values for every item in the new collection are inherited from the first element of the input dataset 'list', where they should be copied from each parallel element. In the below tool example case metadata.anvio_basename is incorrect in the generated command-line. I could work-around this metadata issue by instead directly using the metadata of the actual input, but that is bad hack.

When selecting 1+ datasets in the standard 'Multiple datasets' mode, it doesn't work at all, and instead complains that there is no input collection (there isn't, its just a 'list' of datasets). But this should work.

galaxy.tools DEBUG 2019-02-20 08:49:38,905 [p:64318,w:1,m:0] [uWSGIWorker1Core2] Validated and populated state for tool request (20.455 ms)
galaxy.tools ERROR 2019-02-20 08:49:38,946 [p:64318,w:1,m:0] [uWSGIWorker1Core2] Exception caught while attempting tool execution:
Traceback (most recent call last):
  File "lib/galaxy/tools/__init__.py", line 1435, in handle_single_execution
    collection_info=collection_info,
  File "lib/galaxy/tools/__init__.py", line 1517, in execute
    return self.tool_action.execute(self, trans, incoming=incoming, set_output_hid=set_output_hid, history=history, **kwargs)
  File "lib/galaxy/tools/actions/__init__.py", line 438, in execute
    known_outputs = output.known_outputs(input_collections, collections_manager.type_registry)
  File "lib/galaxy/tools/parser/output_objects.py", line 126, in known_outputs
    collection_prototype = self.structure.collection_prototype(inputs, type_registry)
  File "lib/galaxy/tools/parser/output_objects.py", line 203, in collection_prototype
    collection_prototype = inputs[self.structured_like].collection
KeyError: 'input_input'
galaxy.tools.execute WARNING 2019-02-20 08:49:38,946 [p:64318,w:1,m:0] [uWSGIWorker1Core2] There was a failure executing a job for tool [anvi_migrate_db] - Error executing tool: 'input_input'
galaxy.tools.execute DEBUG 2019-02-20 08:49:38,946 [p:64318,w:1,m:0] [uWSGIWorker1Core2] Executed 1 job(s) for tool anvi_migrate_db request: (40.876 ms)

here is an example tool xml:

<tool id="anvi_migrate_db" name="anvi-migrate-db" version="5.3.0">
    <requirements>
        <requirement type="package" version="5.3.0">anvio</requirement>
    </requirements>
    <stdio>
        <exit_code range="1:" />
    </stdio>
    <version_command>anvi-migrate-db --version</version_command>
    <command><![CDATA[
        
    #if $input_input:
        
                #for $GXY_I, ($gxy_input_input, $gxy_output_input) in $enumerate( $zip( $input_input, $output_input ) ):
                    #if $GXY_I != 0:
                    &&
                    #end if
                    cp -R '${gxy_input_input.extra_files_path}' '${gxy_output_input.extra_files_path}'
                #end for
                
    #else
        echo ''
    #end if
 &&
 anvi-migrate-db

            #for $gxy_output_input in $output_input:
                 "${gxy_output_input.extra_files_path}/${gxy_output_input.metadata.anvio_basename}"
            #end for
            
--just-do-it

#if $str( $target_version ):
    --target-version '${target_version}'
#end if

&> '${GALAXY_ANVIO_LOG}'

    ]]></command>
    <inputs>
        <param name="input_input" type="data" label="Input" format="anvio_db" optional="True" multiple="True" argument="" help="Anvi'o database for migration"/>
        <param name="target_version" type="text" label="Target Version" value="" optional="True" argument="--target-version" help="Anvi'o will stop upgrading your database when it reaches to this version."/>
    </inputs>
    <outputs>
        <collection name="output_input" type="list" label="${tool.name} on ${on_string}: Input" structured_like="input_input" format_source="input_input" metadata_source="input_input"/>
        <data name="GALAXY_ANVIO_LOG" format="txt" label="${tool.name} on ${on_string}: Log"/>
    </outputs>
</tool> 
@mvdbeek
Copy link
Member

mvdbeek commented Feb 20, 2019

I'm probably missing something here, but structured_like only works for collection input, so you'd need to make this a list input I think. The documentation says This is the name of input collection or dataset to derive "structure" but that's wrong, it only works and is designed for collections.

If you want collection in / collection out while needing access to all input elements you need to make the input a collection, I think.

@blankenberg
Copy link
Member Author

I agree that what you are saying is what is happening, that in my second case there is no 'real' collection. But it is not correct behavior. Abstractly, a list is a list.

Additionally, IIRC, by best practice standards, tool inputs taking a list should use a standard data input with multiple=True, and not a collection=list input; this standard is predicated on treating each kind of 'list' equally.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 20, 2019

Additionally, IIRC, by best practice standards, tool inputs taking a list should use a standard data input with multiple=True, and not a collection=list input; this standard is predicated on treating each kind of 'list' equally.

I think for reductions you should use data input and multiple="True", I disagree for the case when you need to keep the structure. There it should be a normal data input if all elements are independent, otherwise a list input / list output. I mean there are substantial differences in the flow of a multiple="true" input compared to a list input, for instance the multiple="true" input would let you create a collection from a single input. Worth implementing for sure, but I don't think this is a bug (except for the documentation ...).

@blankenberg
Copy link
Member Author

Worth implementing for sure, but I don't think this is a bug (except for the documentation ...).

I am not convinced that the stated documentation is incorrect in the ideal behavior. I think if you select a list of datasets under multiple=True then it should behave the same as if you gave it collection=list containing datasets. In fact, this is exactly what happens, with the (imho incorrect) exception of the output creation.

I mean there are substantial differences in the flow of a multiple="true" input compared to a list input, for instance the multiple="true" input would let you create a collection from a single input.

I am not sure I see these differences here. When I put the interface into 'Data collection' mode (with multiple=True), it displays as HID: collection_name (as list), note the '(as list)', and the structured output is created properly -- it should also work when given just a standard dataset list.

Otherwise, the UX greatly suffers. I can already create a collection=list with a single dataset using the 'build list' tool, but that doesn't mean it is a good idea. But if it explicitly required a collection=list, we would force a user to manually create a collection=list in order to use a tool that just consumes a list of things and creates an equivalent output.

I am not sure if I am missing something important here, as the behavior I expect seems really non-ambiguous. A list is a list when multiple=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants