Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add emboss datatypes [WIP] #148

Merged
merged 2 commits into from Jun 14, 2015
Merged

Conversation

bgruening
Copy link
Member

This PR is to start discussions and get some more feedback.

@bgruening
Copy link
Member Author

Is there any way to deprecate data types?

@nsoranzo
Copy link
Member

Do we want to have display_in_upload="true" or display_in_upload="false" for all emboss data types?

I normally see no reason for display_in_upload="false".

ncbi is a FASTA format, but with a special header. Do we want to subclass from fasta?

Sure, or just get rid of it.

@blankenberg
Copy link
Member

Is this for a targeting a new EMBOSS version? One of the really nice things about having these datatypes out of the distribution is that if you don't install the emboss suite, they aren't very useful and so won't show up in the format drop downs for upload/changing datatype, etc.

If you do have the emboss suite installed, you'll want to be able to upload most of the datatypes though, so changing the display_in_upload to false doesn't fix the long list issue.

<datatype extension="embl" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="fitch" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="gcg" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="genbank" type="galaxy.datatypes.data:Text" subclass="True"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, really don't like this being a plain text format without a useful sniffer, but it's such a bother to write sniffers for ALL of these formats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add this as needed later and hopefully with some help from biopython or a general sniffing library. Branching from galaxy in it's own self-standing project.

@peterjc
Copy link
Contributor

peterjc commented Apr 23, 2015

I'd also not bother adding "ncbi" given it is just a means of telling EMBOSS to parse FASTA file identifier lines a particular way.

I fact, I would prefer not adding all these in one big commit. Rather I would selectively add things gradually, focussing on file formats being used elsewhere (e.g. I would find "genbank" and "embl").

@bgruening
Copy link
Member Author

@peterjc yes a lot of this seems to be very specific. Selectively adding formats but result in trouble with people using the emboss_datatype package, isn't it?

@bgruening
Copy link
Member Author

@blankenberg It's somewhere on my todo list and we even have build the binaries now. But writing a ACD Galaxy wrapper is not so easy.

ref: (https://docs.google.com/document/d/1kwuBZXAFnXOHBQp5XHqPLof9QJSsWMKBD-AYUeJiaGM/edit)

I see the datatypes_conf.xml.samle as example for admins. So if you want to have emboss, enable it, tweak it.

@peterjc
Copy link
Contributor

peterjc commented Apr 24, 2015

What would break if Galaxy has a datatype (e.g. "genbank") which is also (re)defined by a tool shed repository? Likewise competing definitions from two tool shed repositories?

This ought to work, and we should file bugs if not.

@bgruening
Copy link
Member Author

If we really enhance the definitions of genbank or emble this will conflict and afaik we do not have a mechanism to handle datatype conflicts.
We could just merge this and fill bugs afterwards, but I'm a little bit hesitant to do this because we also have no (or do we have?) a datatype deprecation guide?

@peterjc
Copy link
Contributor

peterjc commented Apr 24, 2015

Adding a whole bunch of extra datatypes to Galaxy, and then removing some of them would be bad. As @bgruening says we don't have a policy or mechanism for this kind of thing.

So I am -1 on this PR.

@bgruening
Copy link
Member Author

@peterjc any idea how we can fix this? We want to have theses datatypes in Galaxy. At least genbank, embl are highly requested by all that are dealing with genome annotation. Installing emboss datatypes only to have clustalw datatype seems also to be bad: galaxyproject/tools-devteam#117

I once started to split the datatypes up into single repositories, but stoped because we somehow agreed to put new datatypes into Galaxy: https://github.com/bgruening/galaxytools/tree/master/datatypes/emboss_datatypes

Is putting every datatype (with comments) into the .sample file worse than using datatypes from the TS that are conflicting?
Would it work for you to put them into the .sample file but commented out? Any enhance our documentation so that admins can simple activate certain datatypes?

I appreciate your '-1', because I'm also not sure. But we should find a solution here. This situations bugs me since years now (I have a lot of chemical datatypes as well). As I see it, datatypes are easier to handle than tools (we do not need to version them) and I don't see them in the TS currently. I also think it's easy enough for administrators to enable them via the datatype_conf.xml.

@peterjc
Copy link
Contributor

peterjc commented Apr 24, 2015

I would support adding embl, genbank and clustal (note not clustalw) datatypes to the Galaxy core.

I guess adding all the EMBOSS derived datatypes into the sample configuration but commented out might be a way forward...

Another option is splitting out bits of emboss_datatypes into sub-repositories (which it can then depend on)?

What's best depends really on the ToolShed roadmap; CC @davebx

@jmchilton
Copy link
Member

@peterjc Galaxy's definition of the datatype will be used I believe - so I don't believe bringing these in will break emboss.

Competing tool sheds, competing repositories with same definitions, multiple revisions of the same repository are all broken in my opinion and there really isn't an easy fix. The tool shed is designed to distributed decentralized, multi-versioned repositories and datatypes in Galaxy need to have a single authoritative version.

We have discussed this on the IUC mailing list and in team meetings and my position has been in the short and medium term that we should be bringing the useful stuff into core and work on metadata as an option to replace custom datatypes that are of limited scope. Longer term we need to have a datatype registry I think.

In light of #80 and related efforts I would particularly love to see any of these corresponding to EDAM datatypes in core.

@peterjc
Copy link
Contributor

peterjc commented Apr 24, 2015

OK - so how many of the emboss_datatypes are "useful" enough to warrant immediate inclusion in the Galaxy core? I would suggest maybe only embl, genbank, clustal, phylip, nexus?

(There is a real usability cost to having too many datatypes listed)

<datatype extension="nexus" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="nexusnon" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="phylip" type="galaxy.datatypes.data:Text" subclass="True"/>
<datatype extension="phylipnon" type="galaxy.datatypes.data:Text" subclass="True"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be subclassing with nexus and nexusnon, and phylip and phylipnon?

See http://emboss.sourceforge.net/docs/themes/SequenceFormats.html - phylip is the base class covering both interlaced and non-interlaced PHYLIP files, phylipnon ought to be a sublcass covering only non-interlaced PHYLIP files.

There is also a separate possible split between strict and relaxed interpretations of the PHYLIP format, something Biopython does (strict has a hard taxon name limit): http://biopython.org/wiki/AlignIO#File_Formats

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be subclassing with nexus and nexusnon, and phylip and phylipnon?

Is subclassing a subclass working?
How does this look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if/how to do this without defining Python classes (which we ought to do anyway in order to add sniffer methods, see discussion about only adding useful datatypes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can specify type_extension="ext" instead of type="module..."

example:

    <datatype extension="bgzip" type="galaxy.datatypes.binary:Binary" subclass="True" />
    <datatype extension="vcf_bgzip" type_extension="bgzip" subclass="True" >
      <display file="igv/vcf.xml" />
      <converter file="vcf_bgzip_to_tabix_converter.xml" target_datatype="tabix"/>
    </datatype>

@blankenberg
Copy link
Member

If we are going to be pulling new datatypes into Galaxy, I'd like to see them be really useful, and for the most part simply dynamically subclassing another datatype, and not adding at least sniffing (and good metadata, nice peeks, etc), in most cases doesn't really fit the bill.

Not that there aren't good use cases for only using subclass="True" (its a great feature ;) ), just that we can do some neat things with well-defined datatypes beyond simplifying input option filtering.

I will admit though, that I was on the losing side of the "just pull all these datatype (back) into Galaxy" arguments -- I think fixing the underlying issues re e.g. multiplicity would have been the better way to go.

@bgruening
Copy link
Member Author

@peterjc

(There is a real usability cost to having too many datatypes listed)

Is this still valid with the new select2 and search-able drop-down lists?

@bgruening
Copy link
Member Author

@blankenberg is there anything wrong in adding them as sub-class and subsequently adding classes around it? I think we all agree that we want to have more powerful datatype definitions.

@blankenberg
Copy link
Member

If you want to have good metadata, you'll want to have it included in the original implementation. Otherwise, generally, when you decide to 'fix' the datatype and add metadata, you'll need to make each one 'optional', otherwise you'll make the previously existing datasets of that datatype end up with 'missing metadata' errors when run through tools.

@jmchilton
Copy link
Member

@blankenberg I feel like arguing that general point... but for these particular datatypes can't we admit this is already a problem? They have existed for a long time and are popular and so tool authors would have to take potentially missing metadata into account anyway?

@blankenberg
Copy link
Member

@jmchilton agree it would cause problems already anyway with these specific types on main and elsewhere, but wanted folks to be aware of specific longterm drawbacks of a stub-datatype vs. a full datatype, if the full datatype is the intended target.

@jmchilton
Copy link
Member

Dan has expressed that he is -.5 on this - but not a -1. The consensus of the remaining devteam in the chat and the IUC was to merge this. I will merge this tomorrow - giving everyone one last day to review it (... also if someone wanted to add EDAM formats in that time as well :)).

galaxy-iuc/standards#13 (comment)

@jmchilton jmchilton removed the wip label Jun 11, 2015
@bgruening
Copy link
Member Author

@jmchilton I will try to get on this and add EDAM, but I would need 1 day more. I can do this in a second PR as well. As you wish.

Conflicts:
	config/datatypes_conf.xml.sample
jmchilton added a commit that referenced this pull request Jun 14, 2015
@jmchilton jmchilton merged commit 723c107 into galaxyproject:dev Jun 14, 2015
mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Jan 24, 2017
Remove parameters deprecated in v0.5.3 .
@nsoranzo nsoranzo deleted the emboss_datatypes branch January 8, 2021 13:38
nsoranzo added a commit to nsoranzo/galaxy that referenced this pull request Jan 8, 2021
Missed in galaxyproject#148

Used as output format by some EMBOSS tools, and also as optional output
by T-Coffee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants