InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

sjackman · 2013-12-23T20:43:27Z

I converted a BAM file to FASTQ using bedtools bamtofastq:

bedtools bamtofastq -i D0G2RACXX_1.bam -fq /dev/stdout -fq2 /dev/stdout |gzip >D0G2RACXX_1.bam.fq.gz

I then ran load-into-counting.py and got an error message:

load-into-counting.py -k 32 -x 32e9 -T 32  D0G2RACXX_1.bam.kh D0G2RACXX_1.bam.fq.gz

PARAMETERS:
 - kmer size =    32        (-k)
 - n hashes =     4         (-N)
 - min hashsize = 3.2e+10   (-x)

Estimated memory usage is 1.3e+11 bytes (n_hashes x min_hashsize)
--------
terminate called after throwing an instance of 'khmer::read_parsers::InvalidFASTQFileFormat'
  what():  InvalidFASTQFileFormat: sequence and quality scores length mismatch
Saving hashtable to D0G2RACXX_1.bam.kh
Loading kmers from sequences in ['D0G2RACXX_1.bam.fq.gz']
making hashtable
consuming input D0G2RACXX_1.bam.fq.gz
Makefile:14: recipe for target 'D0G2RACXX_1.bam.kh' failed
make: *** [D0G2RACXX_1.bam.kh] Aborted

The FASTQ file appears to be okay:

bioawk -cfastx 'length($seq) != length($qual)' D0G2RACXX_1.bam.fq.gz |wc
      0       0       0

$ zcat D0G2RACXX_1.bam.fq.gz |head
@HS5_219:1:1101:10000:101286/1
TTGCTTGCAATTCTTCCCCCTTTTCACCATGATAGAGCCTTCTTATATAGGTTAAGAGAAGCTTAACCTCTTGAGTTATCTTTAGAACAAGTTGAGGAATTCCCTCCTTAGATCAGGGATCTTCTGATGGAACCAAGAATTCGAATGAAA
+
88?D??;DH?BFBEGBCFHGIAFH@BA@?B?E<C<CDFCHGH@FGD@4?9?BFGB<F:3B3=3=FEH@@=CH:47=.7AACAAEDDCE@C@;;6;A5;;5@CCCC989@(5@@AC>:?B(2>:>C>CAC:A@>8?B?8CCCC0:?@>AA#
@HS5_219:1:1101:10000:101286/2
CATAGCGATGACACTAAGGAGGCACTTAAGACATCTCATACACTTGATTTGGAGGAGGAAAAATGATTAGCTGAAGTCAAGGAACTTTTATCACTTGAGAAGAGTTCTTAAACAAAGGTTTTGAGTGTGAGCGCTCCCCTACATCAGCCT
+
+1?;++A@:0C2+<AG@:+<F6+AGC<G49C3D4?DFF?9B??GHGD@GGIG9B1C;A5@=E(.7?CCC33?)..7(..;;;?C5(;C>>::@(:(55:2(5+8(4:C@C@(:3(2>3:3@@<??(:+::>350<.<?BD##########
@HS5_219:1:1101:10000:104120/1
AAGTCCATGCAAATACATCCTTGAATTCTTTAAATAATTTAACAAAAGCAGCTTTTTCTTGTCTAGAACATCCCAAGCCTAGATTGATACACTCAGGATTTTCACTAGTCCCTAAATTGACTTTTTCATAAGACAATGAGGAACTGTTGG

The FASTQ.gz file is 47 GB, and I haven't yet tried reducing it to a smaller test set that reproduces the error.

The machine has 32 processors and 128 GB of RAM.

$ pip show khmer

---
Name: khmer
Version: 0.7.1
Location: /home/sjackman/local/lib/python2.7/site-packages
Requires: screed, argparse

The text was updated successfully, but these errors were encountered:

mr-c · 2013-12-23T21:01:51Z

Thank you @sjackman for your report. Looks like we need to improve that error message. Can you share that FASTQ.gz file via Globus or iPlant's DiscoveryEnvironment so I can take a look at it?

Info on the second option: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Using+Public+Links

sjackman · 2013-12-23T21:24:39Z

I corrected length(seq) != length(qual) to the intended length($seq) != length($qual). The results remained the same. I fixed a bug in my own bug report. How meta.

sjackman · 2013-12-23T21:38:10Z

Luckily enough the first 500 reads are able to replicate the error. Here's the gist:
https://gist.github.com/sjackman/8105180

load-into-counting.py -k 32 -x 1e5 -T 32 khmer-test.kh khmer-test.fastq

PARAMETERS:
 - kmer size =    32        (-k)
 - n hashes =     4         (-N)
 - min hashsize = 1e+05     (-x)

Estimated memory usage is 4e+05 bytes (n_hashes x min_hashsize)
--------
Saving hashtable to khmer-test.kh
Loading kmers from sequences in ['khmer-test.fastq']
making hashtable
consuming input khmer-test.fastq
terminate called after throwing an instance of 'khmer::read_parsers::InvalidFASTQFileFormat'
  what():  InvalidFASTQFileFormat: sequence and quality scores length mismatch
Aborted

mr-c · 2014-01-09T19:34:23Z

Confirmed, here is the trace

(gdb) backtrace
#0  0x00007ffff782bf77 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff782f5e8 in __GI_abort () at abort.c:90
#2  0x00007ffff644e6e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff644c856 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff644c883 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff644caf6 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff6717be5 in khmer::read_parsers::IParser::imprint_next_read (this=0xb41550, the_read=...) at lib/read_parsers.cc:1705
#7  0x00007ffff6719fc1 in get_next_read (this=0xb41550) at lib/read_parsers.hh:423
#8  khmer::Hashtable::consume_fasta (this=this@entry=0xb413c0, parser=parser@entry=0xb41550, total_reads=@0x7ffff5892650: 195, n_consumed=@0x7ffff5892680: 23205, callback=callback@entry=0x7ffff670f120 <_report_fn(char const*, void*, unsigned long long, unsigned long long)>, 
    callback_data=<optimized out>) at lib/hashtable.cc:202
#9  0x00007ffff6708114 in hash_consume_fasta_with_reads_parser (self=<optimized out>, args=<optimized out>) at khmer/_khmermodule.cc:1508
#10 0x00000000005313b5 in PyEval_EvalFrameEx ()
#11 0x000000000052e672 in PyEval_EvalFrameEx ()
#12 0x000000000052e672 in PyEval_EvalFrameEx ()
#13 0x000000000050697a in ?? ()
#14 0x00000000004d53f4 in ?? ()
#15 0x000000000053bf0b in PyEval_CallObjectWithKeywords ()
#16 0x0000000000579a22 in _start ()

Aside: we need to document how to debug the C++ code. Here was my invocation as inspired by http://docs.python.org/2/faq/extending.html#how-do-i-debug-an-extension

(env)mcrusoe@athyra:~/khmer/gl-master/tests/test-data$ gdb ~/khmer/gl-master/env/bin/python
[...]
(gdb) br _PyImport_LoadDynamicModule
Breakpoint 1 at 0x4265ba
(gdb) run /home/mcrusoe/khmer/gl-master/env/bin/load-into-counting.py -k 32 -x 1e5 -T 32 invalidFASTQFileFormat-false-postive.kh invalidFASTQFileFormat-false-postive.fq
Starting program: /home/mcrusoe/khmer/gl-master/env/bin/python /home/mcrusoe/khmer/gl-master/env/bin/load-into-counting.py -k 32 -x 1e5 -T 32 invalidFASTQFileFormat-false-postive.kh invalidFASTQFileFormat-false-postive.fq
[...]
(gdb) c
Continuing.
[repeat until crash]
(gdb) backtrace

mr-c · 2014-01-09T20:02:58Z

This appears to be the edge case as documented at https://github.com/ged-lab/khmer/blob/master/lib/read_parsers.cc#L1541 :-(

@sjackman For now you can get past this by not multithreading this file.

mr-c · 2014-01-09T20:04:25Z

@sjackman To clarify, don't use multithreading while processing this file (-T 1 will fix your problem)

sjackman · 2014-01-09T20:14:02Z

I'm a little confused by this comment.

There are four or more lines of quality scores remaining.

It sounds as though you're allowing FASTQ quality scores to be split over multiple lines. I really don't think that's valid FASTQ. At least, I've never run into a FASTQ file with either the sequence or the quality scores broken over multiple lines.

mr-c · 2014-01-09T20:34:10Z

Disclaimer: I didn't write this code.

Agreed, I read the comment the same way.

Alas, according to
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/#__sec7title the
sequence or quality lines can indeed be wrapped

On Thu, Jan 9, 2014 at 3:14 PM, Shaun Jackman notifications@github.comwrote:

I'm a little confused by this comment.

There are four or more lines of quality scores remaining.

It sounds as though you're allowing FASTQ quality scores to be split over
multiple lines. I really don't think that's valid FASTQ. At least, I've
never run into a FASTQ file with either the sequence or the quality scores
broken over multiple lines.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-31971136
.

mr-c · 2014-01-09T20:35:26Z

"It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers."

I vote for simplifying our parser and suggesting reformating up front if needed.

@ctb & @camillescott what say you two?

camillescott · 2014-01-09T23:25:59Z

I'm sure I've seen a fair amount of sequence with line breaks; a lot of unwitting biologists do this to make their files prettier. The question is whether we want to support such silliness.

sjackman · 2014-01-10T00:05:14Z

I've often seen line breaks in FASTA files, but never in FASTQ files.

ctb · 2014-01-11T15:37:52Z

On Thu, Jan 09, 2014 at 12:35:26PM -0800, Michael R. Crusoe wrote:

"It is vital to note that the ???@??? marker character (ASCII 64) may occur anywhere in the quality string???including at the start of any of the quality lines. This means that any parser must not treat a line starting with ???@??? as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers."

I vote for simplifying our parser and suggesting reformating up front if needed.

@ctb & @camillescott what say you two?

I think, as a broad policy, we should only support what comes out of the
machine and maybe what ultra common tools do (although we should yell at people
that write stupid software). If people have reformatted things in their own
weird broken way then I would worry about all sorts of other things having gone
wrong.

--titus

C. Titus Brown, ctb@msu.edu

RamRS · 2014-01-23T18:10:48Z

Wild suggestion - what if we learn the first ~6 characters from the first line of the FASTQ and pick a line as SEQ header only if the [0]==@ and [1:7] =~ @XXXXXX?

sjackman · 2014-01-23T18:35:58Z

Would this bug be fixed if the FASTQ parser were modified to accept only a single line of quality data?
https://github.com/ged-lab/khmer/blob/master/lib/read_parsers.cc#L1610

mr-c · 2014-01-23T18:51:18Z

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com wrote:

Would this bug be fixed if the FASTQ parser were modified to accept only a
single line of quality data?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33153825
.

RamRS · 2014-01-23T19:02:00Z

IMHO, that's a tiny bit inconsistent. We should either allow wrapping in
both SEQ and QUAL lines, or disallow both.

Ram

On Thu, Jan 23, 2014 at 1:51 PM, Michael R. Crusoe <notifications@github.com

wrote:

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com
wrote:

Would this bug be fixed if the FASTQ parser were modified to accept only
a
single line of quality data?

—
Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33153825>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33155354
.

mr-c · 2014-01-23T19:37:34Z

We will not allow wrapping for either in FASTQ
On Jan 23, 2014 2:02 PM, "Ram RS" notifications@github.com wrote:

IMHO, that's a tiny bit inconsistent. We should either allow wrapping in
both SEQ and QUAL lines, or disallow both.

Ram

On Thu, Jan 23, 2014 at 1:51 PM, Michael R. Crusoe <
notifications@github.com

wrote:

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com
wrote:

Would this bug be fixed if the FASTQ parser were modified to accept
only
a
single line of quality data?

—
Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33153825>
.

—
Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33155354>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33156424
.

RamRS · 2014-01-23T19:45:02Z

So, ideally, we could just wc -l % 4 == 0 ? continue : raise ?

mr-c · 2014-01-24T15:58:53Z

We operate on a streaming basis. The python equivalent of wc would
require us read the entire file first.

This parser seems over-complicated in many ways. Simple fixes for this are
welcome but I suspect it needs more attention than that.

Since there is a work around and I'm trying to get the 0.8 release done I'm
going to ignore this issue for a bit.

On Thu, Jan 23, 2014 at 2:45 PM, Ram RS notifications@github.com wrote:

So, ideally, we could just wc -l % 4 == 0 ? continue : raise ?

Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33160738
.

tseemann · 2014-04-19T00:54:29Z

I am still getting the "Invalid FASTQ" for load-into-counting.py from khmer 1.0 when using --threads > 1. Sometimes it doesn't core dump but runs in some non-sequence-consumin never-ending cpu-chomping state. Is this still the expected behaviour?

mr-c · 2014-04-19T01:01:42Z

@tseemann Sadly this is still the expected, broken behavior as documented in https://github.com/ged-lab/khmer/blob/master/doc/known-issues.txt

We know how to fix this but many things compete for our time. Our apologies for the inconvenience.

tseemann · 2014-04-19T02:03:55Z

Michael, thanks for the friendly RTFM pointer :) I somehow missed that file. Appreciate all the work you've put into cleaning up the khmer code base!

mr-c · 2014-12-15T18:42:03Z

Fixed by #642 / #643

sjackman · 2014-12-16T02:14:02Z

Thanks, Michael.

mr-c modified the milestones: 1.1+ Release, 1.0 release Apr 2, 2014

mr-c mentioned this issue Apr 6, 2014

Need: tool/API to autodetect fasta/fastq file type #68

Closed

ctb mentioned this issue Apr 20, 2014

load-graph.py aborts after filter-stoptags.py #369

Closed

mr-c modified the milestones: 1.1.1+ Release, unscheduled Aug 5, 2014

This was referenced Oct 29, 2014

first pass seqan impl #642

Merged

replacement C++ FAST[AQ] parser #643

Closed

mr-c closed this as completed Dec 15, 2014

dansteiert mentioned this issue Mar 2, 2024

Mismatch between read length and quals length writing read HiDiHlabs/umi_alignment#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

sjackman commented Dec 23, 2013

mr-c commented Dec 23, 2013

sjackman commented Dec 23, 2013

sjackman commented Dec 23, 2013

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

sjackman commented Jan 9, 2014

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

camillescott commented Jan 9, 2014

sjackman commented Jan 10, 2014

ctb commented Jan 11, 2014

RamRS commented Jan 23, 2014

sjackman commented Jan 23, 2014

mr-c commented Jan 23, 2014

RamRS commented Jan 23, 2014

mr-c commented Jan 23, 2014

RamRS commented Jan 23, 2014

mr-c commented Jan 24, 2014

tseemann commented Apr 19, 2014

mr-c commented Apr 19, 2014

tseemann commented Apr 19, 2014

mr-c commented Dec 15, 2014

sjackman commented Dec 16, 2014

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

Comments

sjackman commented Dec 23, 2013

mr-c commented Dec 23, 2013

sjackman commented Dec 23, 2013

sjackman commented Dec 23, 2013

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

sjackman commented Jan 9, 2014

mr-c commented Jan 9, 2014

mr-c commented Jan 9, 2014

camillescott commented Jan 9, 2014

sjackman commented Jan 10, 2014

ctb commented Jan 11, 2014

--titus

RamRS commented Jan 23, 2014

sjackman commented Jan 23, 2014

mr-c commented Jan 23, 2014

RamRS commented Jan 23, 2014

mr-c commented Jan 23, 2014

RamRS commented Jan 23, 2014

mr-c commented Jan 24, 2014

tseemann commented Apr 19, 2014

mr-c commented Apr 19, 2014

tseemann commented Apr 19, 2014

mr-c commented Dec 15, 2014

sjackman commented Dec 16, 2014