Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import mothur datatypes #2038

Merged
merged 19 commits into from Apr 27, 2016

Conversation

Projects
None yet
6 participants
@shiltemann
Copy link
Member

commented Mar 30, 2016

adds mothur datatypes to Galaxy

xRef: galaxyproject/tools-iuc/pull/449

There are some example files for each of the sniffers here: https://bioinf-galaxian.erasmusmc.nl/public/mothur/galaxy-sniffertest/

ping @IyadKandalaft @jj-umn @oxyko

@@ -0,0 +1,12 @@
<tool id="CONVERTER_ref_to_seq_taxomony" name="Convert Ref taxonomy to Seq Taxonomy" version="1.0.0">
<description>converts 2 or 3 column sequence taxonomy file to a 2 column mothur taxonomy_outline format</description>
<command interpreter="python">ref_to_seq_taxonomy_converter.py "$input" "$output"</command>

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

interpreter is deprecated please use python $__tool_directory__ instead.

file_ext = 'mothur.sabund'
def __init__(self, **kwd):
"""
# http://www.mothur.org/wiki/Sabund_file

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

hash is not needed

Determines whether the file is a otu (operational taxonomic unit) format
"""
try:
with open( filename ) as fh:

This comment has been minimized.

def sniff( self, filename ):
"""
Determines whether the file is a otu (operational taxonomic unit) format
"""

This comment has been minimized.

Copy link
@bgruening

This comment has been minimized.

Copy link
@shiltemann

shiltemann Mar 30, 2016

Author Member

ah, cool, will do, thanks :)

def __init__(self, **kwd):
Otu.__init__( self, **kwd )
# self.column_names[0] = ['label']
# self.column_names[1] = ['group']

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

you can remove obsolete stuff if you want

comment_lines = 0
ncols = 0
try:
with open( dataset.file_name ) as fh:

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

get_header()

The first line is column headings as of Mothur v 1.20
"""
try:
with open( filename ) as fh:

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

get_header()


class AlignReport(Tabular):
"""
QueryName QueryLength TemplateName TemplateLength SearchMethod SearchScore AlignmentMethod QueryStart QueryEnd TemplateStart TemplateEnd PairwiseAlignmentLength GapsInQuery GapsInTemplate LongestInsert SimBtwnQuery&Template

This comment has been minimized.

Copy link
@bgruening

bgruening Mar 30, 2016

Member

indentation

@bgruening

This comment has been minimized.

Copy link
Member

commented Mar 30, 2016

@shiltemann this is great! I guess you need to fix many pep8 warnings and usage of get_header() is recommended as well as doc-tests.

Thanks a lot!

@shiltemann

This comment has been minimized.

Copy link
Member Author

commented Mar 30, 2016

@bgruening will do, thanks! :)

assert sys.version_info[:2] >= (2, 4)


def stop_err(msg):

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

stop_err is not used elsewhere

outfile = open(sys.argv[2], 'w')
for i, line in enumerate(file(infile_name)):
line = line.rstrip()
if not line or line.startswith('#'):

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

What's your opinion on:
if line and not line.startswith('#'): and increase indentation of the following 3 lines


headers = get_headers(dataset.file_name, sep='\t')
for line in headers:
try:

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

if len(line) > 1 allows you to get rid of the except + pass construction

count = 0
for line in headers:
if not line[0].startswith('@') and not line[0].startswith('#'):
if len(line) == 2 and re.match('forward|reverse', line[0]):

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

line[0] in ['forward','reverse'] ?

if len(line) == 2 and re.match('forward|reverse', line[0]):
count += 1
continue
elif len(line) == 3 and re.match('barcode', line[0]):

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

regex necessary?

"""
headers = get_headers(filename, sep='\t', count=300)
count = 0
pat = '^([^ \t\n\r\x0c\x0b;]+([(]\\d+[)])?(;[^ \t\n\r\x0c\x0b;]+([(]\\d+[)])?)*(;)?)$'

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

Could you make a static class member of the compiled regex (https://docs.python.org/2/library/re.html#re.compile) ?

E.g at line 744:
pat_prog = re.compile('^([^ \t\n\r\x0c\x0b;]+([(]\\d+[)])?(;[^ \t\n\r\x0c\x0b;]+([(]\\d+[)])?)*(;)?)$')

and at line 783: self.pat_prog.match(line[1])

return False
if not re.match(pat, line[1]):
return False
if not found_semicolons and str(line[1]).count(';') > 0:

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

count will count all occurances while find should stop after the first. what do you think of:

str(line[1]).find(';') > -1

I also think str() is unnecessary, otherwise the regex would fail in advance

col_cnt = None
all_integers = True
for line in headers:
if count == 0:

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

How about:
if count != 0:

  • indent and remove pass
flow_values = int(headers[0][0])
dataset.metadata.flow_values = flow_values
except:
pass

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

I think this pass should at least give a warning to the log file? You can take a look at the following example: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/binary.py#L512


def make_html_table(self, dataset, skipchars=[]):
"""Create HTML table, used for displaying peek"""
out = ['<table cellspacing="0" cellpadding="3">']

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 12, 2016

Member

I think using arrays or lists is unnecessary and you could use the following instead:

 def make_html_table(self, dataset, skipchars=[]):
     """Create HTML table, used for displaying peek"""
     try:
         out = '<table cellspacing="0" cellpadding="3">'

         # Generate column header
         out += '<tr>'
         out += '<th>%d. Name</th>' % 1
         out += '<th>%d. Flows</th>' % 2
         for i in range(3, dataset.metadata.columns + 1):
             base = dataset.metadata.flow_order[(i + 1) % 4]
             out += '<th>%d. %d %s</th>' % (i - 2, base)
         out += '</tr>'
         out += self.make_html_peek_rows(dataset, skipchars=skipchars)
         out += '</table>'
     except Exception, exc:
         out = "Can't create peek %s" % str(exc)
     return out

headers = get_headers(dataset.file_name, sep='\t', count=-1)
for line in headers:
if len(line) >= 2:

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 13, 2016

Member

I was wondering whether for OTU files a comment line doesn't start with @. If so, I think it would be better to use startswith('@') which is consistent with the other code

dataset.metadata.columns = len(colnames)
if len(colnames) > 2:
dataset.metadata.groups = colnames[2:]
column_types = ['str']

This comment has been minimized.

Copy link
@yhoogstrate

yhoogstrate Apr 13, 2016

Member

If I understand it correctly, if len(colnames) is 2, no column_types will be set?

You could use:

dataset.metadata.column_types = ['str'] + (['int'] * ( len(headers[0]) -1))

outside of an if statement.

shiltemann and others added some commits Apr 14, 2016

Add a sample sniff test for new Mothur datatype.
The existing sniffer tests with the mothur stuff tests the sniffers in isolation - this is more of an integration test that tests the sample datatype configuration for the newly added mothur stuff.
@jmchilton

This comment has been minimized.

Copy link
Member

commented Apr 27, 2016

I love these big multi-author PRs. These are the same datatypes JJ added to our MSI instance like 5 years ago I think - they've been on a long journey 😄 and I'm excited for them to be in Galaxy proper. I've opened a PR with some small tweaks and I'll happily merge these after that is merged downstream. Thanks a bunch @shiltemann et. al. - great work, great community effort.

Merge pull request #1 from jmchilton/mothur_datatypes
Small tweaks to mothur dataytpes.

@jmchilton jmchilton merged commit c72381c into galaxyproject:dev Apr 27, 2016

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@martenson

This comment has been minimized.

Copy link
Member

commented Apr 27, 2016

yay! thanks a bunch @shiltemann @yhoogstrate @bgruening

@bgruening

This comment has been minimized.

Copy link
Member

commented Apr 27, 2016

This made my day! Thanks a bunch to all people involved. Good day for metagenomics!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.