Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Collection Operation to Relabelling Identifiers #3603

Merged

Conversation

Projects
None yet
5 participants
@jmchilton
Copy link
Member

commented Feb 13, 2017

... against supplied labels from an input dataset of type txt.

I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is:

  • Validate uniqueness of identifiers and provide nice messages if they are not unique.
  • Validate that at least the required number of lines are present in the file and provide a nice message if not.
  • Add strict mode to ensure exactly the correct number of lines is added.
  • Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created.
  • Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
  • Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

A collection operation variant of the tool from @pjbriggs at https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml.

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Feb 13, 2017

Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

The last point could be solved with a tool that just dumps the element identifiers (that wouldn't necessarily need to be shipped with galaxy), and then you're free to use all the fantastic column/ regex tools in galaxy land to build a 2 column list (if we make this tool accept a 2 column list ...) .

@pjbriggs

This comment has been minimized.

Copy link

commented Feb 24, 2017

@jmchilton thanks again for making this first implementation, it looks really useful. I've patched our local Galaxy and the tool works as advertised for me.

A couple of observations/suggestions from initial usage:

  • Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)
  • It's potentially quite easy to accidentally mislabel collection items with the single column list, as it relies on the user correctly ordering the new labels in the input file. The option of using a 2 column list specifying old and new identifiers would help to mitigate this (as it's easier to see what will map to what).

HTH. Thanks again!

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Feb 24, 2017

Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)

If you click on the collection in the history panel, it will show the list of the collection datasets. There you can rename the collection by clicking on its name.

@pjbriggs

This comment has been minimized.

Copy link

commented Feb 24, 2017

@nsoranzo thanks! I never noticed that before (I was looking for something like the "pencil" icon on regular datasets).

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Mar 20, 2017

@pjbriggs @jmchilton I've opened a PR with a simple implementation of a 2-column list rename here

</element>
</collection>
</param>
<param name="labels" value="new_labels_2.txt" ftype="txt" />

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Mar 21, 2017

Member

ftype needs to be tabular, that's why the test failed :(.

@martenson martenson referenced this pull request Mar 28, 2017

Closed

The Roadmap #1928

13 of 110 tasks complete
@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Apr 13, 2017

@mvdbeek I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

jmchilton and others added some commits Feb 13, 2017

Start work on a relabelling collection operation.
I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is:

 - Validate uniqueness of identifiers and provide nice messages if they are not unique.
 - Validate that at least the required number of lines are present in the file and provide a nice message if not.
 - Add strict mode to ensure exactly the correct number of lines is added.
 - Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created.
 - Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
 - Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.
Various improvements to relabel from file collection operation.
- Better error handling (check for bad characters when creating collections).
- Implement a strict mode parameter to do even more validation.
- Rework tabular vs txt mode to be explicit user choice.
- Mirror fix in release_17.01 for datasets not having a history,

@jmchilton jmchilton force-pushed the jmchilton:collection_op_relabel_from_file branch from 0057dc6 to 09bcdd9 Apr 14, 2017

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Apr 14, 2017

A couple more fixes rebased into 09bcdd9 so I am pulling this out of WIP.

@mvdbeek Your mapping approach is pretty cool and it makes me think the two missing collections - "group by" and "filter" could easily be implemented by inspecting a table this way. I'd say someday we would still want a "filter by expression" and "group by expression" tools - just like we'd like to have "relabel by expression" - but these should exist beside "relabel by file", "group by file", and "filter by file" tools. I'll try to get something ready for 17.05 - it might be a stretch though.

@jmchilton jmchilton added status/review and removed status/WIP labels Apr 14, 2017

@jmchilton jmchilton changed the title [WIP] Implement Collection Operation to Relabelling Identifiers Implement Collection Operation to Relabelling Identifiers Apr 14, 2017

if how_type == "tabular":
# We have a tabular file, where the first column is an existing element identifier,
# and the second column is the new element identifier.
source_new_label = (line.strip().split('\t') for line in new_labels)

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Apr 14, 2017

Member

Maybe we should use strip('\r\n') here? See the discussion here: galaxyproject/tools-iuc#1233 (comment)

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 14, 2017

I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

That's better, I agree.

I'll try to get something ready for 17.05 - it might be a stretch though.

Is that for the "filter by expression" and "group by expression" tools ? I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2017

Is that for the "filter by expression" and "group by expression" tools ?

No - I meant I hope to open the "from file" variants this week. The more I think about it - the more I like them. The expression stuff will come in a future release.

I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

Now those bullet points are indeed - I added one more test case and fix to cover duplicate identifiers in particular. I think the only bullet not addressed now is "Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool." Which I am happy to push off for now.

@bgruening bgruening merged commit 2288a88 into galaxyproject:dev Apr 20, 2017

5 checks passed

api test Build finished. 274 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 146 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 34 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.