Collection Operations (Limited) #2434

jmchilton · 2016-05-31T18:09:47Z

Overview

This PR introduces Tool-derived framework-level plumbing for dealing collections at the model level instead of at the file level, allowing operations that generate new HDAs and collections without duplicating Dataset objects. Together operations vastly expand the expressiveness of Galaxy workflows.

What's Different

This is progress toward #1644 and contains only the parts of the ill-fated #1313 that do not touch JavaScript or the idea of operations. Therefore this PR doesn't provide the ability to filter out datasets in a collection based on metadata or group datasets into lists of lists based on expressions. Loosing the former means it is impossible to use different tools in workflow for instance based on sample metadata (switch algorithms or switch various flags based on read statistics for instance) and loosing the latter vastly decreases the power of lists of lists - and makes it much harder to imagine workflows that use statistical clustering for instance to affect workflow structure.

This variant of these tools however does include better dataset labeling, actual help text, and introduces a mechanism to ensure they are in the tool panel by default.

The Collection Operations:

Zip (two datasets -> paired collection). Like all these tools, it can be mapped over - so it can easily be used to take two dataset lists and build a list of pairs for instance.
Unzip (paired collection -> two datasets).
Filter failed datasets (list -> list). Given a list it produces another list without any failed datasets. (Most commonly requested of these operations.)
Flatten collection (* -> list). Produces a flat list from any collection, joining identifiers on user supplied character.

Issues and Notes

Unlike more traditional tools, these do interactive checking of inputs so they cannot be "queued up" ahead of time during interactive use. They are workflow aware though, so there is no problem using them with workflows.

Testing

Each operation contains tests (either in the form of API tests or simple tool tests that will be included with -framwork tests). These can all be executed using:

./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_unzip_collection
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_inputs
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_list_inputs
./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_workflow_run_zip_collections
./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_filter_failed

jmchilton · 2016-06-07T01:08:01Z

Test were passing but I rebased to fix conflict in bundled artifacts.

martenson · 2016-06-07T15:37:51Z

I started to review this and will update this post with findings.

'filter_failed' tool does not allow me to choose input despite having two collections in history (one of them even has a failed dataset)
while running 'zip_colection' I see many of these:
.venv/lib/python2.7/site-packages/sqlalchemy/sql/sqltypes.py:185: SAWarning: Unicode type received non-unicode bind param value 'paired'. (this warning may be suppressed after 10 occurrences)
I tried to use 'zip collection' on 2 collections to get list of collections. It seems to succeed

galaxy.tools.actions.model_operations INFO 2016-06-07 11:45:37,667 Calling produce_outputs, tool is <galaxy.tools.ZipCollectionTool object at 0x10a190410>
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,671 Tool [__ZIP_COLLECTION__] created job [201] (221.519 ms)
galaxy.tools.actions INFO 2016-06-07 11:45:37,757 Verified access to datasets for Job[unflushed,tool_id=__ZIP_COLLECTION__] (3.814 ms)
galaxy.tools.actions.model_operations INFO 2016-06-07 11:45:37,781 Calling produce_outputs, tool is <galaxy.tools.ZipCollectionTool object at 0x10a190410>
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,784 Tool [__ZIP_COLLECTION__] created job [202] (100.084 ms)
galaxy.tools.execute DEBUG 2016-06-07 11:45:37,792 Executed 2 job(s) for tool __ZIP_COLLECTION__ request: (365.056 ms)
galaxy.managers.collections DEBUG 2016-06-07 11:45:37,838 Created collection with 2 elements

However I cannot see the result in the history. I can see it as an option in the tool input though...

Furthermore if I flatten the 'invisible list of collections' I will get proper collection of datasets that is visible in history. Note the dataset with hid7 that is still missing.

However if I try to unzip the 'invisible list of collections' I will get a 'bad request' 400

alaxy.tools ERROR 2016-06-07 13:30:26,390 Exception caught while attempting tool execution:
Traceback (most recent call last):
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 1156, in handle_single_execution
    job, out_data = self.execute( trans, incoming=params, history=history, rerun_remap_job_id=rerun_remap_job_id, mapping_over_collection=mapping_over_collection, execution_cache=execution_cache )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 1236, in execute
    return self.tool_action.execute( self, trans, incoming=incoming, set_output_hid=set_output_hid, history=history, **kwargs )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/actions/model_operations.py", line 51, in execute
    self._produce_outputs( trans, tool, out_data, output_collections, incoming=incoming, history=history )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/actions/model_operations.py", line 65, in _produce_outputs
    tool.produce_outputs( trans, out_data, output_collections, incoming, history=history )
  File "/Users/marten/devel/git/galaxy/lib/galaxy/tools/__init__.py", line 2220, in produce_outputs
    assert hdca.collection.collection_type == "paired"
AssertionError

martenson · 2016-06-07T17:35:34Z

lib/galaxy/tools/filter_failed_collection.xml

+        </test>
+    </tests>
+    <help>
+        This tool takes a list dataset collction and filters out the failed 


I am unclear on what input this tool wants.

This special class of tools leverages the infrastructure for tool inputs, tool state tracking, tool module for workflows, tool API, etc... without actually producing command-line jobs. Instead these tools are provided the input model objects and are expected to produce output model objects directly. This provides an oppertunity to copy HDAs without copying the underlying datasets. The first driving use case for these tools are also included - namely tools that allow zipping and unzipping paired collections. These tools can be mapped over lists (e.g. list:paired to (list, list) or the inverse) using much of the existing infrastructure for tools. Test cases included that validate these work with mapping operations and in workflows. The most obvious advantage of these versus traditional tools that do the same thing is that the data isn't copied on disk - new HDAs are created directly from the source datasets. Testing: This PR includes various API test cases for functionality, these can be run with the following command: ``` ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_unzip_collection ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_inputs ./run_tests.sh -api test/api/test_tools.py:ToolsTestCase.test_zip_list_inputs ./run_tests.sh -api test/api/test_workflows.py:WorkflowsApiTestCase.test_workflow_run_zip_collections ```

This differs from a traditional tool in that its inputs don't need to be in an 'ok' state and instead of creating new datasets and duplicating data on disk, new HDAs are created from the existing datasets.

Testing: ``` ./run_tests.sh -framework -id __FLATTEN__ ```

jmchilton · 2016-06-10T19:34:27Z

@martenson Thanks for the detailed review.

I fixed the problem where "Filter Failed Datasets" would only allow "Ok" datasets in the drop down. Obviously this wasn't a problem when testing from the API - but for the drop down I added a framework that allows Tool subclasses to define what a valid dataset input state for that tool is.
The unzip was broken when you would "map" over a collection - so it could unzip a pair into two datasets - but it wouldn't unzip list of paired datasets into two lists - it does this now.
There are some general UI problems with this - I think this would be improved with No Visual Indication of Mapping over a Collection #2467 - an apparent unintentional regression of the tool form since 15.10.
This warning already exists when using collections - I've seen it many times when running tests but never noticed any problems result. SAWarning: Unicode type received non-unicode bind param value 'paired'. (this warning may be suppressed after 10 occurrences). For instance it occurs if you run the tests in managers.test_CollectionManager.
I was not able to replicate the invisible collection problem. If you manually refresh the history does it show up?

This still needs more polish - I'll admit that but I'd like to have something in by the GCC and something at least a little usable for large collections that encounter errors.

martenson · 2016-06-13T19:26:39Z

@jmchilton

filter_failed still does not allow me to select some collections (pair of datasets in this case)
the invisible collection reliably happens when you try to zip_collection on 2 collections (and yes, I tried refreshing it :)
I can now unzip the invisible collections with success and it seems they are unzipped correctly (so I have the two collections in history the invisible one was created from)

Should address some confusion caused by the tool and reference in this review (galaxyproject#2434 (comment)).

jmchilton · 2016-06-13T19:47:51Z

@martenson - my response to your three points.

Filtering a pair of datasets doesn't make sense - the pair is atomic and requires two things. The collection_type of the tool is explicitly list. Now absolutely it should take a list:paired and filter that - removing any pair that has one or two failures - that is an enhancement that should be done but I don't think should block this?

The help for this tool explicitly mentions lists and not collections, though I just pushed an update to the tool "description" as well to indicate this.
Refreshing doesn't work... have you tried turning it off and on again? I don't know I'll look into this some more.
Great.

martenson · 2016-06-13T19:57:54Z

@jmchilton I did try to turn it off and back on again in fact

martenson · 2016-06-13T19:59:37Z

And yes, the filtering on paired is definitely not a blocker

I will dig around the invis collection a bit more. Can you still not reproduce it?

jmchilton · 2016-06-13T20:00:40Z

Okay - I can get an invisible collection if I map something that doesn't result in a collection type that the GUI knows how to handle... so if I map over a list:paired and another list:paired - it would make a list:paired:paired. Which is ... well... non-sense. It is the correct thing to produce but it isn't something that the user should ever want to do. I wish the GUI would display these - but that is unrelated to these tools and a known issue (I mean list:paired:paireds are mostly useless - but the GUI wouldn't display a list:list or a list:list:paired or a list:list:list or a list:list:list:paired which would all be useful in this setting or that I think).

I can get other tools that produce collections to result in the same behavior - if you map a tool that produces a list over a list - a totally reasonable and useful action - you'd get a list:list which I think likewise the GUI just doesn't display. It isn't these tools.

martenson · 2016-06-13T20:20:38Z

nice work @jmchilton - thank you for pushing this forward!

Takadonet · 2016-06-13T20:34:21Z

@jmchilton Is there any chance that this can be backported to 16.01 release?

martenson · 2016-06-13T20:38:15Z

@Takadonet we backport only bugfixes; two releases backport would have to be a security bug or something similar...

Takadonet · 2016-06-13T20:39:21Z

Sorry, let me rephrase, you THINK it will work on 16.01? Have user begging for this feature for many months.

jmchilton added kind/enhancement status/review area/framework area/tools area/dataset-collections area/tool-framework labels May 31, 2016

jmchilton added this to the 16.07 milestone May 31, 2016

jmchilton force-pushed the fewer_collection_opts branch from ce263a0 to c886f1e Compare May 31, 2016 18:10

jmchilton mentioned this pull request May 31, 2016

Planemo 2016 Abstract galaxyproject/planemo#408

Closed

51 tasks

jmchilton force-pushed the fewer_collection_opts branch from c886f1e to 9a86a03 Compare June 7, 2016 01:07

martenson reviewed Jun 7, 2016
View reviewed changes

martenson mentioned this pull request Jun 7, 2016

fix few typos jmchilton/galaxy#48

Merged

jmchilton and others added 8 commits June 10, 2016 13:34

Reduce duplication in galaxy.tools.parameters.wrapped.

a80eaae

New model tool that filters failed datasets out of a collection.

ddeff4e

This differs from a traditional tool in that its inputs don't need to be in an 'ok' state and instead of creating new datasets and duplicating data on disk, new HDAs are created from the existing datasets.

New model tool that flattens any collection into a list.

68a7206

Testing: ``` ./run_tests.sh -framework -id __FLATTEN__ ```

Improve configuration and display of collection operation tools.

8a00565

Update unzip collection action to allow it to be mapped over.

a2578eb

fix few typos

d9a108a

More general handling of what tool inputs are required in what state.

80ac816

jmchilton force-pushed the fewer_collection_opts branch from 4420830 to 80ac816 Compare June 10, 2016 18:24

Test and linting fixes for fewer_collection_opts branch.

d8ca09b

Update "description" of filter_failed_collection to mention only lists.

af5f3bb

Should address some confusion caused by the tool and reference in this review (galaxyproject#2434 (comment)).

martenson merged commit 0ed456e into galaxyproject:dev Jun 13, 2016

jmchilton mentioned this pull request Jun 13, 2016

Limited Collection Types Render in History #2495

Closed

nsoranzo deleted the fewer_collection_opts branch June 22, 2018 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection Operations (Limited) #2434

Collection Operations (Limited) #2434

jmchilton commented May 31, 2016

jmchilton commented Jun 7, 2016

martenson commented Jun 7, 2016 •

edited

Loading

martenson Jun 7, 2016

jmchilton commented Jun 10, 2016

martenson commented Jun 13, 2016 •

edited

Loading

jmchilton commented Jun 13, 2016

martenson commented Jun 13, 2016

martenson commented Jun 13, 2016

jmchilton commented Jun 13, 2016

martenson commented Jun 13, 2016

Takadonet commented Jun 13, 2016

martenson commented Jun 13, 2016

Takadonet commented Jun 13, 2016

Collection Operations (Limited) #2434

Collection Operations (Limited) #2434

Conversation

jmchilton commented May 31, 2016

Overview

What's Different

The Collection Operations:

Issues and Notes

Testing

jmchilton commented Jun 7, 2016

martenson commented Jun 7, 2016 • edited Loading

martenson Jun 7, 2016

Choose a reason for hiding this comment

jmchilton commented Jun 10, 2016

martenson commented Jun 13, 2016 • edited Loading

jmchilton commented Jun 13, 2016

martenson commented Jun 13, 2016

martenson commented Jun 13, 2016

jmchilton commented Jun 13, 2016

martenson commented Jun 13, 2016

Takadonet commented Jun 13, 2016

martenson commented Jun 13, 2016

Takadonet commented Jun 13, 2016

martenson commented Jun 7, 2016 •

edited

Loading

martenson commented Jun 13, 2016 •

edited

Loading