-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InvalidFASTQFileFormat: sequence and quality scores length mismatch #249
Comments
Thank you @sjackman for your report. Looks like we need to improve that error message. Can you share that FASTQ.gz file via Globus or iPlant's DiscoveryEnvironment so I can take a look at it? Info on the second option: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Using+Public+Links |
I corrected |
Luckily enough the first 500 reads are able to replicate the error. Here's the gist:
|
Confirmed, here is the trace
Aside: we need to document how to debug the C++ code. Here was my invocation as inspired by http://docs.python.org/2/faq/extending.html#how-do-i-debug-an-extension
|
This appears to be the edge case as documented at https://github.com/ged-lab/khmer/blob/master/lib/read_parsers.cc#L1541 :-( @sjackman For now you can get past this by not multithreading this file. |
@sjackman To clarify, don't use multithreading while processing this file (-T 1 will fix your problem) |
I'm a little confused by this comment.
It sounds as though you're allowing FASTQ quality scores to be split over multiple lines. I really don't think that's valid FASTQ. At least, I've never run into a FASTQ file with either the sequence or the quality scores broken over multiple lines. |
Disclaimer: I didn't write this code. Agreed, I read the comment the same way. Alas, according to On Thu, Jan 9, 2014 at 3:14 PM, Shaun Jackman notifications@github.comwrote:
|
"It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence. Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers." I vote for simplifying our parser and suggesting reformating up front if needed. @ctb & @camillescott what say you two? |
I'm sure I've seen a fair amount of sequence with line breaks; a lot of unwitting biologists do this to make their files prettier. The question is whether we want to support such silliness. |
I've often seen line breaks in FASTA files, but never in FASTQ files. |
On Thu, Jan 09, 2014 at 12:35:26PM -0800, Michael R. Crusoe wrote:
I think, as a broad policy, we should only support what comes out of the --titusC. Titus Brown, ctb@msu.edu |
Wild suggestion - what if we learn the first ~6 characters from the first line of the FASTQ and pick a line as SEQ header only if the |
Would this bug be fixed if the FASTQ parser were modified to accept only a single line of quality data? |
Yep, that's the plan.
|
IMHO, that's a tiny bit inconsistent. We should either allow wrapping in Ram On Thu, Jan 23, 2014 at 1:51 PM, Michael R. Crusoe <notifications@github.com
|
We will not allow wrapping for either in FASTQ
|
So, ideally, we could just |
We operate on a streaming basis. The python equivalent of This parser seems over-complicated in many ways. Simple fixes for this are Since there is a work around and I'm trying to get the 0.8 release done I'm On Thu, Jan 23, 2014 at 2:45 PM, Ram RS notifications@github.com wrote:
|
I am still getting the "Invalid FASTQ" for load-into-counting.py from khmer 1.0 when using --threads > 1. Sometimes it doesn't core dump but runs in some non-sequence-consumin never-ending cpu-chomping state. Is this still the expected behaviour? |
@tseemann Sadly this is still the expected, broken behavior as documented in https://github.com/ged-lab/khmer/blob/master/doc/known-issues.txt We know how to fix this but many things compete for our time. Our apologies for the inconvenience. |
Michael, thanks for the friendly RTFM pointer :) I somehow missed that file. Appreciate all the work you've put into cleaning up the khmer code base! |
Thanks, Michael. |
I converted a BAM file to FASTQ using
bedtools bamtofastq
:I then ran
load-into-counting.py
and got an error message:The FASTQ file appears to be okay:
The FASTQ.gz file is 47 GB, and I haven't yet tried reducing it to a smaller test set that reproduces the error.
The machine has 32 processors and 128 GB of RAM.
The text was updated successfully, but these errors were encountered: