New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Phylip datatype #5301

Merged
merged 9 commits into from Jan 25, 2018

Conversation

Projects
None yet
6 participants
@khillion
Contributor

khillion commented Jan 11, 2018

Hi, we had to add a sniffer for the phylip format so we wrote a datatype for it. We thought it could be useful to the community.
Thanks!

@galaxybot galaxybot added the triage label Jan 11, 2018

@galaxybot galaxybot added this to the 18.01 milestone Jan 11, 2018

@martenson

This comment has been minimized.

Member

martenson commented Jan 17, 2018

Feel free to review and merge @blankenberg @bgruening @mvdbeek if you see it important to put into 18.01. Otherwise it will be pushed back to 18.05.

@mvdbeek

Looks good, thanks @khillion, just some minor comments. You'll get bonus points if you could include a minimal Phylip file in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/datatypes/test and add a small test for the sniffer as in

>>> dataset.file_name = get_test_fname( 'mothur_datatypetest_true.mothur.otu' )

"""
data_lines = 0
sequences = 0
for line in open(dataset.file_name):

This comment has been minimized.

@mvdbeek

mvdbeek Jan 17, 2018

Member

Could you replace this with

dataset.metadata.data_lines = self.count_data_lines()
dataset.metadata.sequences = int(get_headers(dataset.file_name, count=1)[0][0])

(you can import get_headers with from galaxy.datatypes.sniff import get_headers)

dataset.blurb = 'file purged from disk'
def __init__(self, **kwd):
"""Initialize datatype"""

This comment has been minimized.

@mvdbeek

mvdbeek Jan 17, 2018

Member

You can drop this

for line in f:
if not line.split():
break
count += 1

This comment has been minimized.

@mvdbeek

mvdbeek Jan 17, 2018

Member

Can we also return False if the count is already higher than nb_seq, to avoid iterating through the whole file ?

break
count += 1
if count == nb_seq:

This comment has been minimized.

@mvdbeek

mvdbeek Jan 17, 2018

Member

You can just

return count == nb_seq
@khillion

This comment has been minimized.

Contributor

khillion commented Jan 18, 2018

Thanks @mvdbeek for the suggestions. I have edited as you suggested with minor modifications as well as added an example file. I gave the test in the docstring of the sniffer, but is there any other file I need to fill in?

@martenson

This comment has been minimized.

Member

martenson commented Jan 18, 2018

@galaxybot test this

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Jan 19, 2018

@khillion nope, this is great, you can see the passing tests here

"""
dataset.metadata.data_lines = self.count_data_lines(dataset)
try:
dataset.metadata.sequences = int(get_headers(dataset.file_name, '\t', count=1)[0][0].split()[0])

This comment has been minimized.

@mvdbeek

mvdbeek Jan 19, 2018

Member

You can use int(get_headers(dataset.file_name, '\t', count=1)[0][0]), get_headers return a list of lists with the outer list corresponding to lines and the inner list fields as split by sep.

In [7]: int(get_headers('/Users/mvandenb/src/galaxy/t', '\t', count=1)[0][0])
Out[7]: 12

This comment has been minimized.

@khillion

khillion Jan 19, 2018

Contributor

This is what I intended to do first. But then I realized the header of the PHYLIP format was tricky and somehow separated by spaces. I am not sure the number of spaces is always the same so I am not able to be 100% to refer to the right element of the list (I am not sure to be 100% clear).

This comment has been minimized.

@mvdbeek

mvdbeek Jan 19, 2018

Member

That shouldn't matter:

In [5]: int('      \t  12    ')
Out[5]: 12

This comment has been minimized.

@khillion

khillion Jan 19, 2018

Contributor

It is just that I am not sure how to properly split the header. In my example it is separated by 3 spaces but there is no rule about it. Here it says that "The header consists of a single line describing the dimensions of the alignment. It must be the first line in the file. The header consists of optional spaces, followed by two positive integers (n and m) separated by one or more spaces."

This comment has been minimized.

@mvdbeek

mvdbeek Jan 19, 2018

Member

Alright, in that case I'd go back to int(open(dataset.file_name).readline().split()[0]) or similar. We don't gain anything by using get_headers, sorry about that.

with open(filename, "r") as f:
# Get number of sequence from first line
try:
nb_seq = int(f.readline().split()[0])

This comment has been minimized.

@mvdbeek

mvdbeek Jan 19, 2018

Member

No need for the try/except clause, if an exception is raised we move on to the next sniffer.

This comment has been minimized.

@khillion

khillion Jan 19, 2018

Contributor

I assumed Galaxy had this behaviour but I was not sure so I used try/except. I'll get rid of it.

@martenson martenson modified the milestones: 18.05, 18.01 Jan 25, 2018

@martenson martenson merged commit 3772912 into galaxyproject:dev Jan 25, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment