Add emboss datatypes [WIP] #148

bgruening · 2015-04-23T13:51:25Z

This PR is to start discussions and get some more feedback.

bgruening · 2015-04-23T14:07:33Z

Is there any way to deprecate data types?

nsoranzo · 2015-04-23T17:08:01Z

Do we want to have display_in_upload="true" or display_in_upload="false" for all emboss data types?

I normally see no reason for display_in_upload="false".

ncbi is a FASTA format, but with a special header. Do we want to subclass from fasta?

Sure, or just get rid of it.

blankenberg · 2015-04-23T20:11:35Z

Is this for a targeting a new EMBOSS version? One of the really nice things about having these datatypes out of the distribution is that if you don't install the emboss suite, they aren't very useful and so won't show up in the format drop downs for upload/changing datatype, etc.

If you do have the emboss suite installed, you'll want to be able to upload most of the datatypes though, so changing the display_in_upload to false doesn't fix the long list issue.

hexylena · 2015-04-23T20:49:33Z

config/datatypes_conf.xml.sample

+    <datatype extension="embl" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="fitch" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="gcg" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="genbank" type="galaxy.datatypes.data:Text" subclass="True"/>


ugh, really don't like this being a plain text format without a useful sniffer, but it's such a bother to write sniffers for ALL of these formats.

I think we can add this as needed later and hopefully with some help from biopython or a general sniffing library. Branching from galaxy in it's own self-standing project.

peterjc · 2015-04-23T22:03:21Z

I'd also not bother adding "ncbi" given it is just a means of telling EMBOSS to parse FASTA file identifier lines a particular way.

I fact, I would prefer not adding all these in one big commit. Rather I would selectively add things gradually, focussing on file formats being used elsewhere (e.g. I would find "genbank" and "embl").

bgruening · 2015-04-24T09:29:35Z

@peterjc yes a lot of this seems to be very specific. Selectively adding formats but result in trouble with people using the emboss_datatype package, isn't it?

bgruening · 2015-04-24T09:33:25Z

@blankenberg It's somewhere on my todo list and we even have build the binaries now. But writing a ACD Galaxy wrapper is not so easy.

ref: (https://docs.google.com/document/d/1kwuBZXAFnXOHBQp5XHqPLof9QJSsWMKBD-AYUeJiaGM/edit)

I see the datatypes_conf.xml.samle as example for admins. So if you want to have emboss, enable it, tweak it.

peterjc · 2015-04-24T10:07:25Z

What would break if Galaxy has a datatype (e.g. "genbank") which is also (re)defined by a tool shed repository? Likewise competing definitions from two tool shed repositories?

This ought to work, and we should file bugs if not.

bgruening · 2015-04-24T11:38:42Z

If we really enhance the definitions of genbank or emble this will conflict and afaik we do not have a mechanism to handle datatype conflicts.
We could just merge this and fill bugs afterwards, but I'm a little bit hesitant to do this because we also have no (or do we have?) a datatype deprecation guide?

peterjc · 2015-04-24T11:44:02Z

Adding a whole bunch of extra datatypes to Galaxy, and then removing some of them would be bad. As @bgruening says we don't have a policy or mechanism for this kind of thing.

So I am -1 on this PR.

bgruening · 2015-04-24T12:52:26Z

@peterjc any idea how we can fix this? We want to have theses datatypes in Galaxy. At least genbank, embl are highly requested by all that are dealing with genome annotation. Installing emboss datatypes only to have clustalw datatype seems also to be bad: galaxyproject/tools-devteam#117

I once started to split the datatypes up into single repositories, but stoped because we somehow agreed to put new datatypes into Galaxy: https://github.com/bgruening/galaxytools/tree/master/datatypes/emboss_datatypes

Is putting every datatype (with comments) into the .sample file worse than using datatypes from the TS that are conflicting?
Would it work for you to put them into the .sample file but commented out? Any enhance our documentation so that admins can simple activate certain datatypes?

I appreciate your '-1', because I'm also not sure. But we should find a solution here. This situations bugs me since years now (I have a lot of chemical datatypes as well). As I see it, datatypes are easier to handle than tools (we do not need to version them) and I don't see them in the TS currently. I also think it's easy enough for administrators to enable them via the datatype_conf.xml.

peterjc · 2015-04-24T13:09:56Z

I would support adding embl, genbank and clustal (note not clustalw) datatypes to the Galaxy core.

I guess adding all the EMBOSS derived datatypes into the sample configuration but commented out might be a way forward...

Another option is splitting out bits of emboss_datatypes into sub-repositories (which it can then depend on)?

What's best depends really on the ToolShed roadmap; CC @davebx

jmchilton · 2015-04-24T13:21:12Z

@peterjc Galaxy's definition of the datatype will be used I believe - so I don't believe bringing these in will break emboss.

Competing tool sheds, competing repositories with same definitions, multiple revisions of the same repository are all broken in my opinion and there really isn't an easy fix. The tool shed is designed to distributed decentralized, multi-versioned repositories and datatypes in Galaxy need to have a single authoritative version.

We have discussed this on the IUC mailing list and in team meetings and my position has been in the short and medium term that we should be bringing the useful stuff into core and work on metadata as an option to replace custom datatypes that are of limited scope. Longer term we need to have a datatype registry I think.

In light of #80 and related efforts I would particularly love to see any of these corresponding to EDAM datatypes in core.

peterjc · 2015-04-24T13:27:44Z

OK - so how many of the emboss_datatypes are "useful" enough to warrant immediate inclusion in the Galaxy core? I would suggest maybe only embl, genbank, clustal, phylip, nexus?

(There is a real usability cost to having too many datatypes listed)

peterjc · 2015-04-24T13:32:11Z

config/datatypes_conf.xml.sample

+    <datatype extension="nexus" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="nexusnon" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="phylip" type="galaxy.datatypes.data:Text" subclass="True"/>
+    <datatype extension="phylipnon" type="galaxy.datatypes.data:Text" subclass="True"/>


Should there be subclassing with nexus and nexusnon, and phylip and phylipnon?

See http://emboss.sourceforge.net/docs/themes/SequenceFormats.html - phylip is the base class covering both interlaced and non-interlaced PHYLIP files, phylipnon ought to be a sublcass covering only non-interlaced PHYLIP files.

There is also a separate possible split between strict and relaxed interpretations of the PHYLIP format, something Biopython does (strict has a hard taxon name limit): http://biopython.org/wiki/AlignIO#File_Formats

Should there be subclassing with nexus and nexusnon, and phylip and phylipnon?

Is subclassing a subclass working?
How does this look like?

Not sure if/how to do this without defining Python classes (which we ought to do anyway in order to add sniffer methods, see discussion about only adding useful datatypes).

You can specify type_extension="ext" instead of type="module..."

example:

<datatype extension="bgzip" type="galaxy.datatypes.binary:Binary" subclass="True" /> <datatype extension="vcf_bgzip" type_extension="bgzip" subclass="True" > <display file="igv/vcf.xml" /> <converter file="vcf_bgzip_to_tabix_converter.xml" target_datatype="tabix"/> </datatype>

blankenberg · 2015-04-24T14:20:29Z

If we are going to be pulling new datatypes into Galaxy, I'd like to see them be really useful, and for the most part simply dynamically subclassing another datatype, and not adding at least sniffing (and good metadata, nice peeks, etc), in most cases doesn't really fit the bill.

Not that there aren't good use cases for only using subclass="True" (its a great feature ;) ), just that we can do some neat things with well-defined datatypes beyond simplifying input option filtering.

I will admit though, that I was on the losing side of the "just pull all these datatype (back) into Galaxy" arguments -- I think fixing the underlying issues re e.g. multiplicity would have been the better way to go.

bgruening · 2015-04-24T15:36:38Z

@peterjc

(There is a real usability cost to having too many datatypes listed)

Is this still valid with the new select2 and search-able drop-down lists?

bgruening · 2015-04-24T15:55:06Z

@blankenberg is there anything wrong in adding them as sub-class and subsequently adding classes around it? I think we all agree that we want to have more powerful datatype definitions.

blankenberg · 2015-04-24T15:59:08Z

If you want to have good metadata, you'll want to have it included in the original implementation. Otherwise, generally, when you decide to 'fix' the datatype and add metadata, you'll need to make each one 'optional', otherwise you'll make the previously existing datasets of that datatype end up with 'missing metadata' errors when run through tools.

jmchilton · 2015-04-24T16:10:26Z

@blankenberg I feel like arguing that general point... but for these particular datatypes can't we admit this is already a problem? They have existed for a long time and are popular and so tool authors would have to take potentially missing metadata into account anyway?

blankenberg · 2015-04-24T18:02:46Z

@jmchilton agree it would cause problems already anyway with these specific types on main and elsewhere, but wanted folks to be aware of specific longterm drawbacks of a stub-datatype vs. a full datatype, if the full datatype is the intended target.

jmchilton · 2015-06-11T17:24:53Z

Dan has expressed that he is -.5 on this - but not a -1. The consensus of the remaining devteam in the chat and the IUC was to merge this. I will merge this tomorrow - giving everyone one last day to review it (... also if someone wanted to add EDAM formats in that time as well :)).

galaxy-iuc/standards#13 (comment)

bgruening · 2015-06-12T13:51:02Z

@jmchilton I will try to get on this and add EDAM, but I would need 1 day more. I can do this in a second PR as well. As you wish.

Conflicts: config/datatypes_conf.xml.sample

Add emboss datatypes [WIP]

Remove parameters deprecated in v0.5.3 .

Missed in galaxyproject#148 Used as output format by some EMBOSS tools, and also as optional output by T-Coffee.

add emboss datatypes

ce74ed5

blankenberg mentioned this pull request Apr 23, 2015

clustalw tool outputs in 2 unknown data format galaxyproject/tools-devteam#117

Closed

hexylena reviewed Apr 23, 2015
View reviewed changes

peterjc reviewed Apr 24, 2015
View reviewed changes

jmchilton added the wip label May 5, 2015

hexylena mentioned this pull request May 5, 2015

Including MSA Formats in PR to Galaxy galaxyproject/tools-iuc#125

Closed

bgruening mentioned this pull request May 31, 2015

Galaxy and datatype management galaxy-iuc/standards#13

Open

jmchilton removed the wip label Jun 11, 2015

Merge remote-tracking branch 'jmchilton/dev' into HEAD

bdf1738

Conflicts: config/datatypes_conf.xml.sample

jmchilton added a commit that referenced this pull request Jun 14, 2015

Merge pull request #148 from bgruening/emboss_datatypes

723c107

Add emboss datatypes [WIP]

jmchilton merged commit 723c107 into galaxyproject:dev Jun 14, 2015

jmchilton mentioned this pull request Jun 15, 2015

Allow specifying EDAM format for dynamic subclass datatypes. #342

Merged

jmchilton added the kind/enhancement label Aug 12, 2015

mvdbeek pushed a commit to mvdbeek/galaxy that referenced this pull request Jan 24, 2017

Merge pull request galaxyproject#148 from nsoranzo/master

874c230

Remove parameters deprecated in v0.5.3 .

nsoranzo deleted the emboss_datatypes branch January 8, 2021 13:38

nsoranzo added a commit to nsoranzo/galaxy that referenced this pull request Jan 8, 2021

Add minimal msf datatype

cbc19be

Missed in galaxyproject#148 Used as output format by some EMBOSS tools, and also as optional output by T-Coffee.

nsoranzo mentioned this pull request Jan 8, 2021

Add minimal msf datatype #11084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add emboss datatypes [WIP] #148

Add emboss datatypes [WIP] #148

bgruening commented Apr 23, 2015

bgruening commented Apr 23, 2015

nsoranzo commented Apr 23, 2015

blankenberg commented Apr 23, 2015

hexylena Apr 23, 2015

bgruening Apr 24, 2015

peterjc commented Apr 23, 2015

bgruening commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

jmchilton commented Apr 24, 2015

peterjc commented Apr 24, 2015

peterjc Apr 24, 2015

bgruening Apr 24, 2015

peterjc Apr 24, 2015

blankenberg Apr 24, 2015

blankenberg commented Apr 24, 2015

bgruening commented Apr 24, 2015

bgruening commented Apr 24, 2015

blankenberg commented Apr 24, 2015

jmchilton commented Apr 24, 2015

blankenberg commented Apr 24, 2015

jmchilton commented Jun 11, 2015

bgruening commented Jun 12, 2015

Add emboss datatypes [WIP] #148

Add emboss datatypes [WIP] #148

Conversation

bgruening commented Apr 23, 2015

bgruening commented Apr 23, 2015

nsoranzo commented Apr 23, 2015

blankenberg commented Apr 23, 2015

hexylena Apr 23, 2015

Choose a reason for hiding this comment

bgruening Apr 24, 2015

Choose a reason for hiding this comment

peterjc commented Apr 23, 2015

bgruening commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

bgruening commented Apr 24, 2015

peterjc commented Apr 24, 2015

jmchilton commented Apr 24, 2015

peterjc commented Apr 24, 2015

peterjc Apr 24, 2015

Choose a reason for hiding this comment

bgruening Apr 24, 2015

Choose a reason for hiding this comment

peterjc Apr 24, 2015

Choose a reason for hiding this comment

blankenberg Apr 24, 2015

Choose a reason for hiding this comment

blankenberg commented Apr 24, 2015

bgruening commented Apr 24, 2015

bgruening commented Apr 24, 2015

blankenberg commented Apr 24, 2015

jmchilton commented Apr 24, 2015

blankenberg commented Apr 24, 2015

jmchilton commented Jun 11, 2015

bgruening commented Jun 12, 2015