Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sniff fastqsanger and prefer it over fastq if the quality scores match #4237

Merged
merged 9 commits into from Jun 27, 2017

Conversation

Projects
None yet
4 participants
@dpryan79
Copy link
Contributor

commented Jun 27, 2017

This is one possible implementation of #3571. In short, the fastq sniffer now looks at the quality scores of the first 25000 sequences and sees if they're compatible with the fastqsanger format. If they are, then the fastq datatype won't be assigned, but instead the fastqsanger datatype. This should nicely prevent people from needing to switch datatypes or running fastqgroomer after upload.

One important note: The current format produced by Illumina machines goes up to Phred J. Since they increase that every few years, I currently have the quality score checking function allow up to M. This future proofs things, but it may be a bit too liberal.

@dpryan79

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2017

Guess I have some unit tests to fix

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Jun 27, 2017

It would be nice to add also a couple of unit tests to galaxy.datatypes.sniff.guess_ext() function to test that a fastqsanger and a non-fastqsanger files are correctly assigned the fastqsanger and fastq extension.

@dpryan79

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2017

Good idea, I'll just shrink down the ones I used for testing and add those.

if isinstance(self, FastqSanger) or isinstance(self, FastqSangerGz) or isinstance(self, FastqSangerBz2):
if not self.sangerQualities(headers):
return False
else:

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

I think this else: branch may prevent some non-Sanger FASTQ from being correctly sniffed, if all quality values used in the file are in the overlapping subset.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

Correct, but if there's going to be a preference for sangerfastq then there's no way around that.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

Isn't the way the sniff order?

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

Which is what galaxy.datatypes.sniff.guess_ext() unit tests would actually check.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

I would have thought so, but for whatever reason that's not working for me when I actually test it unless I keep the else:. I'll try to figure that out.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

It took a while, but I think I found why the else was needed for binary files, it's because the sniffing order for them is set by the order of registering them. I'm changing that and will retest that it works.

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

Removed the else and it still works. Thanks for the motivation to track that down :)

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

Uh, cool! Maybe you can enforce a sniffing order for them in lib/galaxy/datatypes/registry.py as you've done for sequence.FastqSanger().

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

That should already be done. It was only the gzipped files that needed the else: stuff before.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

Uh, you're right, the sniffing order for binaries is determined only by the order in which they're registered with Binary.register_sniffable_binary_format(), that looks super-fragile!
But that's not for this PR to fix.

"""Presuming lines are lines from a fastq file, return True if the qualities are compatible with sanger encoding"""
for line in lines[3::4]:
_ = [ord( c ) for c in line[0]]
if max( _ ) > ord( 'M' ):

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Jun 27, 2017

Member

What about replacing the previous 2 lines with:

            if not all(_ >= '!' and _ <= 'M'  for _ in line[0]):

?

This comment has been minimized.

Copy link
@dpryan79

dpryan79 Jun 27, 2017

Author Contributor

Yup, good idea!

@mvdbeek mvdbeek merged commit 6cd026b into galaxyproject:dev Jun 27, 2017

5 checks passed

api test Build finished. 279 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 150 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 34 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Copy link
Member

commented Jun 27, 2017

Excellent stuff, this was about time! Thanks a lot @dpryan79 !

@dpryan79 dpryan79 deleted the dpryan79:fastqSniffer_implement3571 branch Jun 27, 2017

@jennaj

This comment has been minimized.

Copy link
Member

commented Sep 29, 2017

17.09 with this is on https://usegalaxy.org and passes a raft of tests, including those covered in test cases. This is great.

non-fastqsanger == fastq
fastqsanger == fastqsanger
any-fastq.gz == (via URL) fastq.gz
any-fastq.gz == (browser upload w/o datatype specified) uncompressed fastqsanger IF fastqsanger, IF NOT, just fastq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.