Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection Operations (Limited) #2434

Merged
merged 10 commits into from
Jun 13, 2016

Conversation

jmchilton
Copy link
Member

Overview

This PR introduces Tool-derived framework-level plumbing for dealing collections at the model level instead of at the file level, allowing operations that generate new HDAs and collections without duplicating Dataset objects. Together operations vastly expand the expressiveness of Galaxy workflows.

What's Different

This is progress toward #1644 and contains only the parts of the ill-fated #1313 that do not touch JavaScript or the idea of operations. Therefore this PR doesn't provide the ability to filter out datasets in a collection based on metadata or group datasets into lists of lists based on expressions. Loosing the former means it is impossible to use different tools in workflow for instance based on sample metadata (switch algorithms or switch various flags based on read statistics for instance) and loosing the latter vastly decreases the power of lists of lists - and makes it much harder to imagine workflows that use statistical clustering for instance to affect workflow structure.

This variant of these tools however does include better dataset labeling, actual help text, and introduces a mechanism to ensure they are in the tool panel by default.

The Collection Operations:

  • Zip (two datasets -> paired collection). Like all these tools, it can be mapped over - so it can easily be used to take two dataset lists and build a list of pairs for instance.
  • Unzip (paired collection -> two datasets).
  • Filter failed datasets (list -> list). Given a list it produces another list without any failed datasets. (Most commonly requested of these operations.)
  • Flatten collection (* -> list). Produces a flat list from any collection, joining identifiers on user supplied character.

Issues and Notes

Unlike more traditional tools, these do interactive checking of inputs so they cannot be "queued up" ahead of time during interactive use. They are workflow aware though, so there is no problem using them with workflows.

Testing

Each operation contains tests (either in the form of API tests or simple tool tests that will be included with -framwork tests). These can all be executed using:

./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_unzip_collection
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_inputs
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_list_inputs
./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_workflow_run_zip_collections
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_filter_failed

@jmchilton
Copy link
Member Author

Test were passing but I rebased to fix conflict in bundled artifacts.

@martenson
Copy link
Member

martenson commented Jun 7, 2016

I started to review this and will update this post with findings.

  • 'filter_failed' tool does not allow me to choose input despite having two collections in history (one of them even has a failed dataset)
    screenshot 2016-06-07 11 29 55
  • while running 'zip_colection' I see many of these:
    .venv/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:185: SAWarning: Unicode type received non-unicode bind param value 'paired'. (this warning may be suppressed after 10 occurrences)
  • I tried to use 'zip collection' on 2 collections to get list of collections. It seems to succeed
galaxy.tools.actions.model_operations INFO 2016-06-07 11:45:37,667 Calling produce_outputs, tool is <galaxy.tools.ZipCollectionTool object at 0x10a190410>
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,671 Tool [__ZIP_COLLECTION__] created job [201] (221.519 ms)
galaxy.tools.actions INFO 2016-06-07 11:45:37,757 Verified access to datasets for Job[unflushed,tool_id=__ZIP_COLLECTION__] (3.814 ms)
galaxy.tools.actions.model_operations INFO 2016-06-07 11:45:37,781 Calling produce_outputs, tool is <galaxy.tools.ZipCollectionTool object at 0x10a190410>
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,784 Tool [__ZIP_COLLECTION__] created job [202] (100.084 ms)
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,792 Executed 2 job(s) for tool __ZIP_COLLECTION__ request: (365.056 ms)
galaxy.managers.collections DEBUG 2016-06-07 11:45:37,838 Created collection with 2 elements

However I cannot see the result in the history. I can see it as an option in the tool input though...
screenshot 2016-06-07 11 46 47
Furthermore if I flatten the 'invisible list of collections' I will get proper collection of datasets that is visible in history. Note the dataset with hid7 that is still missing.
screenshot 2016-06-07 11 51 08
However if I try to unzip the 'invisible list of collections' I will get a 'bad request' 400
screenshot 2016-06-07 13 30 43

alaxy.tools ERROR 2016-06-07 13:30:26,390 Exception caught while attempting tool execution:
Traceback (most recent call last):
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 1156, in handle_single_execution
    job, out_data = self.execute( trans, incoming=params, history=history, rerun_remap_job_id=rerun_remap_job_id, mapping_over_collection=mapping_over_collection, execution_cache=execution_cache )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 1236, in execute
    return self.tool_action.execute( self, trans, incoming=incoming, set_output_hid=set_output_hid, history=history, **kwargs )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/actions/model_operations.py", line 51, in execute
    self._produce_outputs( trans, tool, out_data, output_collections, incoming=incoming, history=history )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/actions/model_operations.py", line 65, in _produce_outputs
    tool.produce_outputs( trans, out_data, output_collections, incoming, history=history )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 2220, in produce_outputs
    assert hdca.collection.collection_type == "paired"
AssertionError

</test>
</tests>
<help>
This tool takes a list dataset collction and filters out the failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear on what input this tool wants.

jmchilton and others added 8 commits June 10, 2016 13:34
This special class of tools leverages the infrastructure for tool inputs, tool state tracking, tool module for workflows, tool API, etc... without actually producing command-line jobs. Instead these tools are provided the input model objects and are expected to produce output model objects directly. This provides an oppertunity to copy HDAs without copying the underlying datasets.

The first driving use case for these tools are also included - namely tools that allow zipping and unzipping paired collections. These tools can be mapped over lists (e.g. list:paired to (list, list) or the inverse) using much of the existing infrastructure for tools. Test cases included that validate these work with mapping operations and in workflows.

The most obvious advantage of these versus traditional tools that do the same thing is that the data isn't copied on disk - new HDAs are created directly from the source datasets.

Testing:

This PR includes various API test cases for functionality, these can be run with the following command:

```
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_unzip_collection
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_inputs
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_list_inputs
./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_workflow_run_zip_collections
```
This differs from a traditional tool in that its inputs don't need to be in an 'ok' state and instead of creating new datasets and duplicating data on disk, new HDAs are created from the existing datasets.
Testing:

```
./run_tests.sh -framework -id __FLATTEN__
```
@jmchilton
Copy link
Member Author

@martenson Thanks for the detailed review.

  • I fixed the problem where "Filter Failed Datasets" would only allow "Ok" datasets in the drop down. Obviously this wasn't a problem when testing from the API - but for the drop down I added a framework that allows Tool subclasses to define what a valid dataset input state for that tool is.
  • The unzip was broken when you would "map" over a collection - so it could unzip a pair into two datasets - but it wouldn't unzip list of paired datasets into two lists - it does this now.
  • There are some general UI problems with this - I think this would be improved with No Visual Indication of Mapping over a Collection #2467 - an apparent unintentional regression of the tool form since 15.10.
  • This warning already exists when using collections - I've seen it many times when running tests but never noticed any problems result. SAWarning: Unicode type received non-unicode bind param value 'paired'. (this warning may be suppressed after 10 occurrences). For instance it occurs if you run the tests in managers.test_CollectionManager.
  • I was not able to replicate the invisible collection problem. If you manually refresh the history does it show up?

This still needs more polish - I'll admit that but I'd like to have something in by the GCC and something at least a little usable for large collections that encounter errors.

@martenson
Copy link
Member

martenson commented Jun 13, 2016

@jmchilton

  • filter_failed still does not allow me to select some collections (pair of datasets in this case)
    screenshot 2016-06-13 15 16 12
  • the invisible collection reliably happens when you try to zip_collection on 2 collections (and yes, I tried refreshing it :)
  • I can now unzip the invisible collections with success and it seems they are unzipped correctly (so I have the two collections in history the invisible one was created from)

Should address some confusion caused by the tool and reference in this review (galaxyproject#2434 (comment)).
@jmchilton
Copy link
Member Author

@martenson - my response to your three points.

  • Filtering a pair of datasets doesn't make sense - the pair is atomic and requires two things. The collection_type of the tool is explicitly list. Now absolutely it should take a list:paired and filter that - removing any pair that has one or two failures - that is an enhancement that should be done but I don't think should block this?

    The help for this tool explicitly mentions lists and not collections, though I just pushed an update to the tool "description" as well to indicate this.

  • Refreshing doesn't work... have you tried turning it off and on again? I don't know I'll look into this some more.

  • Great.

@martenson
Copy link
Member

@jmchilton I did try to turn it off and back on again in fact

@martenson
Copy link
Member

And yes, the filtering on paired is definitely not a blocker

I will dig around the invis collection a bit more. Can you still not reproduce it?

@jmchilton
Copy link
Member Author

Okay - I can get an invisible collection if I map something that doesn't result in a collection type that the GUI knows how to handle... so if I map over a list:paired and another list:paired - it would make a list:paired:paired. Which is ... well... non-sense. It is the correct thing to produce but it isn't something that the user should ever want to do. I wish the GUI would display these - but that is unrelated to these tools and a known issue (I mean list:paired:paireds are mostly useless - but the GUI wouldn't display a list:list or a list:list:paired or a list:list:list or a list:list:list:paired which would all be useful in this setting or that I think).

I can get other tools that produce collections to result in the same behavior - if you map a tool that produces a list over a list - a totally reasonable and useful action - you'd get a list:list which I think likewise the GUI just doesn't display. It isn't these tools.

@martenson
Copy link
Member

nice work @jmchilton - thank you for pushing this forward!

@martenson martenson merged commit 0ed456e into galaxyproject:dev Jun 13, 2016
@Takadonet
Copy link
Contributor

@jmchilton Is there any chance that this can be backported to 16.01 release?

@martenson
Copy link
Member

@Takadonet we backport only bugfixes; two releases backport would have to be a security bug or something similar...

@Takadonet
Copy link
Contributor

Sorry, let me rephrase, you THINK it will work on 16.01? Have user begging for this feature for many months.

@nsoranzo nsoranzo deleted the fewer_collection_opts branch June 22, 2018 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants