New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add emboss datatypes [WIP] #148
Conversation
Is there any way to deprecate data types? |
I normally see no reason for
Sure, or just get rid of it. |
Is this for a targeting a new EMBOSS version? One of the really nice things about having these datatypes out of the distribution is that if you don't install the emboss suite, they aren't very useful and so won't show up in the format drop downs for upload/changing datatype, etc. If you do have the emboss suite installed, you'll want to be able to upload most of the datatypes though, so changing the display_in_upload to false doesn't fix the long list issue. |
<datatype extension="embl" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="fitch" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="gcg" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="genbank" type="galaxy.datatypes.data:Text" subclass="True"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, really don't like this being a plain text format without a useful sniffer, but it's such a bother to write sniffers for ALL of these formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add this as needed later and hopefully with some help from biopython or a general sniffing library. Branching from galaxy in it's own self-standing project.
I'd also not bother adding "ncbi" given it is just a means of telling EMBOSS to parse FASTA file identifier lines a particular way. I fact, I would prefer not adding all these in one big commit. Rather I would selectively add things gradually, focussing on file formats being used elsewhere (e.g. I would find "genbank" and "embl"). |
@peterjc yes a lot of this seems to be very specific. Selectively adding formats but result in trouble with people using the |
@blankenberg It's somewhere on my todo list and we even have build the binaries now. But writing a ACD Galaxy wrapper is not so easy. ref: (https://docs.google.com/document/d/1kwuBZXAFnXOHBQp5XHqPLof9QJSsWMKBD-AYUeJiaGM/edit) I see the |
What would break if Galaxy has a datatype (e.g. "genbank") which is also (re)defined by a tool shed repository? Likewise competing definitions from two tool shed repositories? This ought to work, and we should file bugs if not. |
If we really enhance the definitions of genbank or emble this will conflict and afaik we do not have a mechanism to handle datatype conflicts. |
Adding a whole bunch of extra datatypes to Galaxy, and then removing some of them would be bad. As @bgruening says we don't have a policy or mechanism for this kind of thing. So I am -1 on this PR. |
@peterjc any idea how we can fix this? We want to have theses datatypes in Galaxy. At least I once started to split the datatypes up into single repositories, but stoped because we somehow agreed to put new datatypes into Galaxy: https://github.com/bgruening/galaxytools/tree/master/datatypes/emboss_datatypes Is putting every datatype (with comments) into the I appreciate your '-1', because I'm also not sure. But we should find a solution here. This situations bugs me since years now (I have a lot of chemical datatypes as well). As I see it, datatypes are easier to handle than tools (we do not need to version them) and I don't see them in the TS currently. I also think it's easy enough for administrators to enable them via the |
I would support adding I guess adding all the EMBOSS derived datatypes into the sample configuration but commented out might be a way forward... Another option is splitting out bits of What's best depends really on the ToolShed roadmap; CC @davebx |
@peterjc Galaxy's definition of the datatype will be used I believe - so I don't believe bringing these in will break emboss. Competing tool sheds, competing repositories with same definitions, multiple revisions of the same repository are all broken in my opinion and there really isn't an easy fix. The tool shed is designed to distributed decentralized, multi-versioned repositories and datatypes in Galaxy need to have a single authoritative version. We have discussed this on the IUC mailing list and in team meetings and my position has been in the short and medium term that we should be bringing the useful stuff into core and work on metadata as an option to replace custom datatypes that are of limited scope. Longer term we need to have a datatype registry I think. In light of #80 and related efforts I would particularly love to see any of these corresponding to EDAM datatypes in core. |
OK - so how many of the (There is a real usability cost to having too many datatypes listed) |
<datatype extension="nexus" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="nexusnon" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="phylip" type="galaxy.datatypes.data:Text" subclass="True"/> | ||
<datatype extension="phylipnon" type="galaxy.datatypes.data:Text" subclass="True"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be subclassing with nexus
and nexusnon
, and phylip
and phylipnon
?
See http://emboss.sourceforge.net/docs/themes/SequenceFormats.html - phylip
is the base class covering both interlaced and non-interlaced PHYLIP files, phylipnon
ought to be a sublcass covering only non-interlaced PHYLIP files.
There is also a separate possible split between strict and relaxed interpretations of the PHYLIP format, something Biopython does (strict has a hard taxon name limit): http://biopython.org/wiki/AlignIO#File_Formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be subclassing with nexus and nexusnon, and phylip and phylipnon?
Is subclassing a subclass working?
How does this look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if/how to do this without defining Python classes (which we ought to do anyway in order to add sniffer methods, see discussion about only adding useful datatypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can specify type_extension="ext"
instead of type="module..."
example:
<datatype extension="bgzip" type="galaxy.datatypes.binary:Binary" subclass="True" />
<datatype extension="vcf_bgzip" type_extension="bgzip" subclass="True" >
<display file="igv/vcf.xml" />
<converter file="vcf_bgzip_to_tabix_converter.xml" target_datatype="tabix"/>
</datatype>
If we are going to be pulling new datatypes into Galaxy, I'd like to see them be really useful, and for the most part simply dynamically subclassing another datatype, and not adding at least sniffing (and good metadata, nice peeks, etc), in most cases doesn't really fit the bill. Not that there aren't good use cases for only using I will admit though, that I was on the losing side of the "just pull all these datatype (back) into Galaxy" arguments -- I think fixing the underlying issues re e.g. multiplicity would have been the better way to go. |
Is this still valid with the new |
@blankenberg is there anything wrong in adding them as sub-class and subsequently adding classes around it? I think we all agree that we want to have more powerful datatype definitions. |
If you want to have good metadata, you'll want to have it included in the original implementation. Otherwise, generally, when you decide to 'fix' the datatype and add metadata, you'll need to make each one 'optional', otherwise you'll make the previously existing datasets of that datatype end up with 'missing metadata' errors when run through tools. |
@blankenberg I feel like arguing that general point... but for these particular datatypes can't we admit this is already a problem? They have existed for a long time and are popular and so tool authors would have to take potentially missing metadata into account anyway? |
@jmchilton agree it would cause problems already anyway with these specific types on main and elsewhere, but wanted folks to be aware of specific longterm drawbacks of a stub-datatype vs. a full datatype, if the full datatype is the intended target. |
Dan has expressed that he is -.5 on this - but not a -1. The consensus of the remaining devteam in the chat and the IUC was to merge this. I will merge this tomorrow - giving everyone one last day to review it (... also if someone wanted to add EDAM formats in that time as well :)). |
@jmchilton I will try to get on this and add EDAM, but I would need 1 day more. I can do this in a second PR as well. As you wish. |
Conflicts: config/datatypes_conf.xml.sample
Remove parameters deprecated in v0.5.3 .
Missed in galaxyproject#148 Used as output format by some EMBOSS tools, and also as optional output by T-Coffee.
This PR is to start discussions and get some more feedback.