Implement Collection Operation to Relabelling Identifiers #3603

jmchilton · 2017-02-13T19:33:39Z

... against supplied labels from an input dataset of type txt.

I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is:

Validate uniqueness of identifiers and provide nice messages if they are not unique.
Validate that at least the required number of lines are present in the file and provide a nice message if not.
Add strict mode to ensure exactly the correct number of lines is added.
Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created.
Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
~~Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.~~

A collection operation variant of the tool from @pjbriggs at https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml.

mvdbeek · 2017-02-13T20:24:53Z

Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

The last point could be solved with a tool that just dumps the element identifiers (that wouldn't necessarily need to be shipped with galaxy), and then you're free to use all the fantastic column/ regex tools in galaxy land to build a 2 column list (if we make this tool accept a 2 column list ...) .

pjbriggs · 2017-02-24T12:33:12Z

@jmchilton thanks again for making this first implementation, it looks really useful. I've patched our local Galaxy and the tool works as advertised for me.

A couple of observations/suggestions from initial usage:

Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)
It's potentially quite easy to accidentally mislabel collection items with the single column list, as it relies on the user correctly ordering the new labels in the input file. The option of using a 2 column list specifying old and new identifiers would help to mitigate this (as it's easier to see what will map to what).

HTH. Thanks again!

nsoranzo · 2017-02-24T12:44:28Z

Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)

If you click on the collection in the history panel, it will show the list of the collection datasets. There you can rename the collection by clicking on its name.

pjbriggs · 2017-02-24T13:49:40Z

@nsoranzo thanks! I never noticed that before (I was looking for something like the "pencil" icon on regular datasets).

mvdbeek · 2017-03-20T17:14:02Z

@pjbriggs @jmchilton I've opened a PR with a simple implementation of a 2-column list rename here

mvdbeek · 2017-03-21T10:05:51Z

lib/galaxy/tools/relabel_from_file.xml

+                    </element>
+                </collection>
+            </param>
+            <param name="labels" value="new_labels_2.txt" ftype="txt" />


ftype needs to be tabular, that's why the test failed :(.

jmchilton · 2017-04-13T20:45:58Z

@mvdbeek I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is: - Validate uniqueness of identifiers and provide nice messages if they are not unique. - Validate that at least the required number of lines are present in the file and provide a nice message if not. - Add strict mode to ensure exactly the correct number of lines is added. - Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created. - Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration. - Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

…beek

…beek.

- Better error handling (check for bad characters when creating collections). - Implement a strict mode parameter to do even more validation. - Rework tabular vs txt mode to be explicit user choice. - Mirror fix in release_17.01 for datasets not having a history,

jmchilton · 2017-04-14T03:23:56Z

A couple more fixes rebased into 09bcdd9 so I am pulling this out of WIP.

@mvdbeek Your mapping approach is pretty cool and it makes me think the two missing collections - "group by" and "filter" could easily be implemented by inspecting a table this way. I'd say someday we would still want a "filter by expression" and "group by expression" tools - just like we'd like to have "relabel by expression" - but these should exist beside "relabel by file", "group by file", and "filter by file" tools. I'll try to get something ready for 17.05 - it might be a stretch though.

mvdbeek · 2017-04-14T11:30:19Z

lib/galaxy/tools/__init__.py

+        if how_type == "tabular":
+            # We have a tabular file, where the first column is an existing element identifier,
+            # and the second column is the new element identifier.
+            source_new_label = (line.strip().split('\t') for line in new_labels)


Maybe we should use strip('\r\n') here? See the discussion here: galaxyproject/tools-iuc#1233 (comment)

mvdbeek · 2017-04-14T11:35:49Z

I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

That's better, I agree.

I'll try to get something ready for 17.05 - it might be a stretch though.

Is that for the "filter by expression" and "group by expression" tools ? I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

…e identifiers.

jmchilton · 2017-04-17T17:16:20Z

Is that for the "filter by expression" and "group by expression" tools ?

No - I meant I hope to open the "from file" variants this week. The more I think about it - the more I like them. The expression stuff will come in a future release.

I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

Now those bullet points are indeed - I added one more test case and fix to cover duplicate identifiers in particular. I think the only bullet not addressed now is "Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool." Which I am happy to push off for now.

jmchilton added area/dataset-collections area/tools kind/enhancement status/WIP labels Feb 13, 2017

jmchilton added this to the 17.05 milestone Feb 13, 2017

This was referenced Feb 14, 2017

WIP Galaxy tool to relabel samples in a "list-of-pairs" dataset collection pjbriggs/Amplicon_analysis-galaxy#20

Closed

Rename pairs in a "list of pairs" dataset collection for input to pipeline pjbriggs/Amplicon_analysis-galaxy#18

Closed

mvdbeek reviewed Mar 21, 2017

View reviewed changes

jmchilton mentioned this pull request Mar 22, 2017

Add a pencil to dataset collections. #3800

Open

martenson mentioned this pull request Mar 28, 2017

The Roadmap #1928

Closed

jmchilton and others added 4 commits April 13, 2017 23:11

Add possibility to rename collection items from a tabular file

d21c14f

Fix failing test case in relable_from_file based on comment from @mvd…

338130a

…beek.

jmchilton force-pushed the collection_op_relabel_from_file branch from 0057dc6 to 09bcdd9 Compare April 14, 2017 03:16

jmchilton added status/review and removed status/WIP labels Apr 14, 2017

jmchilton changed the title ~~[WIP] Implement Collection Operation to Relabelling Identifiers~~ Implement Collection Operation to Relabelling Identifiers Apr 14, 2017

mvdbeek reviewed Apr 14, 2017

View reviewed changes

Fix relabel_from_file collection operation error handling if duplicat…

2288a88

…e identifiers.

jmchilton mentioned this pull request Apr 17, 2017

Collection Operation - Filtering from a File #3940

Merged

bgruening merged commit 2288a88 into galaxyproject:dev Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Collection Operation to Relabelling Identifiers #3603

Implement Collection Operation to Relabelling Identifiers #3603

jmchilton commented Feb 13, 2017 •

edited

Loading

mvdbeek commented Feb 13, 2017 •

edited

Loading

pjbriggs commented Feb 24, 2017

nsoranzo commented Feb 24, 2017

pjbriggs commented Feb 24, 2017

mvdbeek commented Mar 20, 2017

mvdbeek Mar 21, 2017

jmchilton commented Apr 13, 2017

jmchilton commented Apr 14, 2017

mvdbeek Apr 14, 2017 •

edited

Loading

mvdbeek commented Apr 14, 2017

jmchilton commented Apr 17, 2017

Implement Collection Operation to Relabelling Identifiers #3603

Implement Collection Operation to Relabelling Identifiers #3603

Conversation

jmchilton commented Feb 13, 2017 • edited Loading

mvdbeek commented Feb 13, 2017 • edited Loading

pjbriggs commented Feb 24, 2017

nsoranzo commented Feb 24, 2017

pjbriggs commented Feb 24, 2017

mvdbeek commented Mar 20, 2017

mvdbeek Mar 21, 2017

Choose a reason for hiding this comment

jmchilton commented Apr 13, 2017

jmchilton commented Apr 14, 2017

mvdbeek Apr 14, 2017 • edited Loading

Choose a reason for hiding this comment

mvdbeek commented Apr 14, 2017

jmchilton commented Apr 17, 2017

jmchilton commented Feb 13, 2017 •

edited

Loading

mvdbeek commented Feb 13, 2017 •

edited

Loading

mvdbeek Apr 14, 2017 •

edited

Loading