Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Collection Operation to Relabelling Identifiers #3603

Merged

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Feb 13, 2017

... against supplied labels from an input dataset of type txt.

I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is:

  • Validate uniqueness of identifiers and provide nice messages if they are not unique.
  • Validate that at least the required number of lines are present in the file and provide a nice message if not.
  • Add strict mode to ensure exactly the correct number of lines is added.
  • Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created.
  • Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
  • Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

A collection operation variant of the tool from @pjbriggs at https://github.com/pjbriggs/Amplicon_analysis-galaxy/blob/77340d8bb2470a646deba4933625413fc70985d1/relabel_samples.xml.

@mvdbeek
Copy link
Member

mvdbeek commented Feb 13, 2017

Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.

The last point could be solved with a tool that just dumps the element identifiers (that wouldn't necessarily need to be shipped with galaxy), and then you're free to use all the fantastic column/ regex tools in galaxy land to build a 2 column list (if we make this tool accept a 2 column list ...) .

@pjbriggs
Copy link

@jmchilton thanks again for making this first implementation, it looks really useful. I've patched our local Galaxy and the tool works as advertised for me.

A couple of observations/suggestions from initial usage:

  • Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)
  • It's potentially quite easy to accidentally mislabel collection items with the single column list, as it relies on the user correctly ordering the new labels in the input file. The option of using a 2 column list specifying old and new identifiers would help to mitigate this (as it's easier to see what will map to what).

HTH. Thanks again!

@nsoranzo
Copy link
Member

Could you allow the user to also specify the new name for the relabelled collection? (There doesn't seem to be a way to rename a collection once it's been created.)

If you click on the collection in the history panel, it will show the list of the collection datasets. There you can rename the collection by clicking on its name.

@pjbriggs
Copy link

@nsoranzo thanks! I never noticed that before (I was looking for something like the "pencil" icon on regular datasets).

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2017

@pjbriggs @jmchilton I've opened a PR with a simple implementation of a 2-column list rename here

</element>
</collection>
</param>
<param name="labels" value="new_labels_2.txt" ftype="txt" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftype needs to be tabular, that's why the test failed :(.

@jmchilton
Copy link
Member Author

@mvdbeek I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

jmchilton and others added 4 commits April 13, 2017 23:11
I think it needs to be touched up but the basic operation seems to work so far. I think what remains to be done is:

 - Validate uniqueness of identifiers and provide nice messages if they are not unique.
 - Validate that at least the required number of lines are present in the file and provide a nice message if not.
 - Add strict mode to ensure exactly the correct number of lines is added.
 - Find where validation of identifiers happens in the API and apply same validation here - try not to let unsafe identifiers be created.
 - Consider more advanced modes - selecting a column, apply a regex replace, pick two columns for nested lists, etc.... None of this may need to be needed in the first iteration.
 - Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool.
- Better error handling (check for bad characters when creating collections).
- Implement a strict mode parameter to do even more validation.
- Rework tabular vs txt mode to be explicit user choice.
- Mirror fix in release_17.01 for datasets not having a history,
@jmchilton jmchilton force-pushed the collection_op_relabel_from_file branch from 0057dc6 to 09bcdd9 Compare April 14, 2017 03:16
@jmchilton
Copy link
Member Author

A couple more fixes rebased into 09bcdd9 so I am pulling this out of WIP.

@mvdbeek Your mapping approach is pretty cool and it makes me think the two missing collections - "group by" and "filter" could easily be implemented by inspecting a table this way. I'd say someday we would still want a "filter by expression" and "group by expression" tools - just like we'd like to have "relabel by expression" - but these should exist beside "relabel by file", "group by file", and "filter by file" tools. I'll try to get something ready for 17.05 - it might be a stretch though.

@jmchilton jmchilton changed the title [WIP] Implement Collection Operation to Relabelling Identifiers Implement Collection Operation to Relabelling Identifiers Apr 14, 2017
if how_type == "tabular":
# We have a tabular file, where the first column is an existing element identifier,
# and the second column is the new element identifier.
source_new_label = (line.strip().split('\t') for line in new_labels)
Copy link
Member

@mvdbeek mvdbeek Apr 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use strip('\r\n') here? See the discussion here: galaxyproject/tools-iuc#1233 (comment)

@mvdbeek
Copy link
Member

mvdbeek commented Apr 14, 2017

I reworked that different modality to be an explicit user choice - I hope that is okay (see 0057dc6). I'd rather not have such a large change in behavior depend on a subtle datatype difference if that is okay.

That's better, I agree.

I'll try to get something ready for 17.05 - it might be a stretch though.

Is that for the "filter by expression" and "group by expression" tools ? I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

@jmchilton
Copy link
Member Author

Is that for the "filter by expression" and "group by expression" tools ?

No - I meant I hope to open the "from file" variants this week. The more I think about it - the more I like them. The expression stuff will come in a future release.

I think this PR looks good to me, from your bullet-points I think 1,2 and 3 are addressed, right?

Now those bullet points are indeed - I added one more test case and fix to cover duplicate identifiers in particular. I think the only bullet not addressed now is "Consider another mode where a collection is labelled against an existing collection - should that be a separate tool of the same tool." Which I am happy to push off for now.

@bgruening bgruening merged commit 2288a88 into galaxyproject:dev Apr 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants