Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

Closed
sjackman opened this issue Dec 23, 2013 · 24 comments
Closed

InvalidFASTQFileFormat: sequence and quality scores length mismatch #249

sjackman opened this issue Dec 23, 2013 · 24 comments
Labels
Milestone

Comments

@sjackman
Copy link

I converted a BAM file to FASTQ using bedtools bamtofastq:

bedtools bamtofastq -i D0G2RACXX_1.bam -fq /dev/stdout -fq2 /dev/stdout |gzip >D0G2RACXX_1.bam.fq.gz

I then ran load-into-counting.py and got an error message:

load-into-counting.py -k 32 -x 32e9 -T 32  D0G2RACXX_1.bam.kh D0G2RACXX_1.bam.fq.gz

PARAMETERS:
 - kmer size =    32        (-k)
 - n hashes =     4         (-N)
 - min hashsize = 3.2e+10   (-x)

Estimated memory usage is 1.3e+11 bytes (n_hashes x min_hashsize)
--------
terminate called after throwing an instance of 'khmer::read_parsers::InvalidFASTQFileFormat'
  what():  InvalidFASTQFileFormat: sequence and quality scores length mismatch
Saving hashtable to D0G2RACXX_1.bam.kh
Loading kmers from sequences in ['D0G2RACXX_1.bam.fq.gz']
making hashtable
consuming input D0G2RACXX_1.bam.fq.gz
Makefile:14: recipe for target 'D0G2RACXX_1.bam.kh' failed
make: *** [D0G2RACXX_1.bam.kh] Aborted

The FASTQ file appears to be okay:

bioawk -cfastx 'length($seq) != length($qual)' D0G2RACXX_1.bam.fq.gz |wc
      0       0       0
$ zcat D0G2RACXX_1.bam.fq.gz |head
@HS5_219:1:1101:10000:101286/1
TTGCTTGCAATTCTTCCCCCTTTTCACCATGATAGAGCCTTCTTATATAGGTTAAGAGAAGCTTAACCTCTTGAGTTATCTTTAGAACAAGTTGAGGAATTCCCTCCTTAGATCAGGGATCTTCTGATGGAACCAAGAATTCGAATGAAA
+
88?D??;DH?BFBEGBCFHGIAFH@BA@?B?E<C<CDFCHGH@FGD@4?9?BFGB<F:3B3=3=FEH@@=CH:47=.7AACAAEDDCE@C@;;6;A5;;5@CCCC989@(5@@AC>:?B(2>:>C>CAC:A@>8?B?8CCCC0:?@>AA#
@HS5_219:1:1101:10000:101286/2
CATAGCGATGACACTAAGGAGGCACTTAAGACATCTCATACACTTGATTTGGAGGAGGAAAAATGATTAGCTGAAGTCAAGGAACTTTTATCACTTGAGAAGAGTTCTTAAACAAAGGTTTTGAGTGTGAGCGCTCCCCTACATCAGCCT
+
+1?;++A@:0C2+<AG@:+<F6+AGC<G49C3D4?DFF?9B??GHGD@GGIG9B1C;A5@=E(.7?CCC33?)..7(..;;;?C5(;C>>::@(:(55:2(5+8(4:C@C@(:3(2>3:3@@<??(:+::>350<.<?BD##########
@HS5_219:1:1101:10000:104120/1
AAGTCCATGCAAATACATCCTTGAATTCTTTAAATAATTTAACAAAAGCAGCTTTTTCTTGTCTAGAACATCCCAAGCCTAGATTGATACACTCAGGATTTTCACTAGTCCCTAAATTGACTTTTTCATAAGACAATGAGGAACTGTTGG

The FASTQ.gz file is 47 GB, and I haven't yet tried reducing it to a smaller test set that reproduces the error.

The machine has 32 processors and 128 GB of RAM.

$ pip show khmer

---
Name: khmer
Version: 0.7.1
Location: /home/sjackman/local/lib/python2.7/site-packages
Requires: screed, argparse
@mr-c
Copy link
Contributor

mr-c commented Dec 23, 2013

Thank you @sjackman for your report. Looks like we need to improve that error message. Can you share that FASTQ.gz file via Globus or iPlant's DiscoveryEnvironment so I can take a look at it?

Info on the second option: https://pods.iplantcollaborative.org/wiki/display/DEmanual/Using+Public+Links

@sjackman
Copy link
Author

I corrected length(seq) != length(qual) to the intended length($seq) != length($qual). The results remained the same. I fixed a bug in my own bug report. How meta.

@sjackman
Copy link
Author

Luckily enough the first 500 reads are able to replicate the error. Here's the gist:
https://gist.github.com/sjackman/8105180

load-into-counting.py -k 32 -x 1e5 -T 32 khmer-test.kh khmer-test.fastq

PARAMETERS:
 - kmer size =    32        (-k)
 - n hashes =     4         (-N)
 - min hashsize = 1e+05     (-x)

Estimated memory usage is 4e+05 bytes (n_hashes x min_hashsize)
--------
Saving hashtable to khmer-test.kh
Loading kmers from sequences in ['khmer-test.fastq']
making hashtable
consuming input khmer-test.fastq
terminate called after throwing an instance of 'khmer::read_parsers::InvalidFASTQFileFormat'
  what():  InvalidFASTQFileFormat: sequence and quality scores length mismatch
Aborted

@mr-c
Copy link
Contributor

mr-c commented Jan 9, 2014

Confirmed, here is the trace

(gdb) backtrace
#0  0x00007ffff782bf77 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff782f5e8 in __GI_abort () at abort.c:90
#2  0x00007ffff644e6e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff644c856 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff644c883 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff644caf6 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff6717be5 in khmer::read_parsers::IParser::imprint_next_read (this=0xb41550, the_read=...) at lib/read_parsers.cc:1705
#7  0x00007ffff6719fc1 in get_next_read (this=0xb41550) at lib/read_parsers.hh:423
#8  khmer::Hashtable::consume_fasta (this=this@entry=0xb413c0, parser=parser@entry=0xb41550, total_reads=@0x7ffff5892650: 195, n_consumed=@0x7ffff5892680: 23205, callback=callback@entry=0x7ffff670f120 <_report_fn(char const*, void*, unsigned long long, unsigned long long)>, 
    callback_data=<optimized out>) at lib/hashtable.cc:202
#9  0x00007ffff6708114 in hash_consume_fasta_with_reads_parser (self=<optimized out>, args=<optimized out>) at khmer/_khmermodule.cc:1508
#10 0x00000000005313b5 in PyEval_EvalFrameEx ()
#11 0x000000000052e672 in PyEval_EvalFrameEx ()
#12 0x000000000052e672 in PyEval_EvalFrameEx ()
#13 0x000000000050697a in ?? ()
#14 0x00000000004d53f4 in ?? ()
#15 0x000000000053bf0b in PyEval_CallObjectWithKeywords ()
#16 0x0000000000579a22 in _start ()

Aside: we need to document how to debug the C++ code. Here was my invocation as inspired by http://docs.python.org/2/faq/extending.html#how-do-i-debug-an-extension

(env)mcrusoe@athyra:~/khmer/gl-master/tests/test-data$ gdb ~/khmer/gl-master/env/bin/python
[...]
(gdb) br _PyImport_LoadDynamicModule
Breakpoint 1 at 0x4265ba
(gdb) run /home/mcrusoe/khmer/gl-master/env/bin/load-into-counting.py -k 32 -x 1e5 -T 32 invalidFASTQFileFormat-false-postive.kh invalidFASTQFileFormat-false-postive.fq
Starting program: /home/mcrusoe/khmer/gl-master/env/bin/python /home/mcrusoe/khmer/gl-master/env/bin/load-into-counting.py -k 32 -x 1e5 -T 32 invalidFASTQFileFormat-false-postive.kh invalidFASTQFileFormat-false-postive.fq
[...]
(gdb) c
Continuing.
[repeat until crash]
(gdb) backtrace

@mr-c
Copy link
Contributor

mr-c commented Jan 9, 2014

This appears to be the edge case as documented at https://github.com/ged-lab/khmer/blob/master/lib/read_parsers.cc#L1541 :-(

@sjackman For now you can get past this by not multithreading this file.

@mr-c
Copy link
Contributor

mr-c commented Jan 9, 2014

@sjackman To clarify, don't use multithreading while processing this file (-T 1 will fix your problem)

@sjackman
Copy link
Author

sjackman commented Jan 9, 2014

I'm a little confused by this comment.

There are four or more lines of quality scores remaining.

It sounds as though you're allowing FASTQ quality scores to be split over multiple lines. I really don't think that's valid FASTQ. At least, I've never run into a FASTQ file with either the sequence or the quality scores broken over multiple lines.

@mr-c
Copy link
Contributor

mr-c commented Jan 9, 2014

Disclaimer: I didn't write this code.

Agreed, I read the comment the same way.

Alas, according to
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/#__sec7title the
sequence or quality lines can indeed be wrapped

On Thu, Jan 9, 2014 at 3:14 PM, Shaun Jackman notifications@github.comwrote:

I'm a little confused by this comment.

There are four or more lines of quality scores remaining.

It sounds as though you're allowing FASTQ quality scores to be split over
multiple lines. I really don't think that's valid FASTQ. At least, I've
never run into a FASTQ file with either the sequence or the quality scores
broken over multiple lines.


Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-31971136
.

@mr-c
Copy link
Contributor

mr-c commented Jan 9, 2014

"It is vital to note that the ‘@’ marker character (ASCII 64) may occur anywhere in the quality string—including at the start of any of the quality lines. This means that any parser must not treat a line starting with ‘@’ as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers."

I vote for simplifying our parser and suggesting reformating up front if needed.

@ctb & @camillescott what say you two?

@camillescott
Copy link
Member

I'm sure I've seen a fair amount of sequence with line breaks; a lot of unwitting biologists do this to make their files prettier. The question is whether we want to support such silliness.

@sjackman
Copy link
Author

I've often seen line breaks in FASTA files, but never in FASTQ files.

@ctb
Copy link
Member

ctb commented Jan 11, 2014

On Thu, Jan 09, 2014 at 12:35:26PM -0800, Michael R. Crusoe wrote:

"It is vital to note that the ???@??? marker character (ASCII 64) may occur anywhere in the quality string???including at the start of any of the quality lines. This means that any parser must not treat a line starting with ???@??? as indicating the start of the next record, without additionally checking the length of the quality string thus far matches the length of the sequence.

Because of this complication, most tools output FASTQ files without line wrapping of the sequence and quality string. This means each read consists of exactly four lines (sometimes very long lines), ideal for a very simple parser to deal with. The OBF tools follow this convention on output, as does the MAQ conversion script. We recommend this for maximum compatibility with (simplistic) parsers."

I vote for simplifying our parser and suggesting reformating up front if needed.

@ctb & @camillescott what say you two?

I think, as a broad policy, we should only support what comes out of the
machine and maybe what ultra common tools do (although we should yell at people
that write stupid software). If people have reformatted things in their own
weird broken way then I would worry about all sorts of other things having gone
wrong.

--titus

C. Titus Brown, ctb@msu.edu

@RamRS
Copy link
Contributor

RamRS commented Jan 23, 2014

Wild suggestion - what if we learn the first ~6 characters from the first line of the FASTQ and pick a line as SEQ header only if the [0]==@ and [1:7] =~ @XXXXXX?

@sjackman
Copy link
Author

Would this bug be fixed if the FASTQ parser were modified to accept only a single line of quality data?
https://github.com/ged-lab/khmer/blob/master/lib/read_parsers.cc#L1610

@mr-c
Copy link
Contributor

mr-c commented Jan 23, 2014

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com wrote:

Would this bug be fixed if the FASTQ parser were modified to accept only a
single line of quality data?


Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33153825
.

@RamRS
Copy link
Contributor

RamRS commented Jan 23, 2014

IMHO, that's a tiny bit inconsistent. We should either allow wrapping in
both SEQ and QUAL lines, or disallow both.

Ram

On Thu, Jan 23, 2014 at 1:51 PM, Michael R. Crusoe <notifications@github.com

wrote:

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com
wrote:

Would this bug be fixed if the FASTQ parser were modified to accept only
a
single line of quality data?


Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33153825>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33155354
.

@mr-c
Copy link
Contributor

mr-c commented Jan 23, 2014

We will not allow wrapping for either in FASTQ
On Jan 23, 2014 2:02 PM, "Ram RS" notifications@github.com wrote:

IMHO, that's a tiny bit inconsistent. We should either allow wrapping in
both SEQ and QUAL lines, or disallow both.

Ram

On Thu, Jan 23, 2014 at 1:51 PM, Michael R. Crusoe <
notifications@github.com

wrote:

Yep, that's the plan.
On Jan 23, 2014 1:35 PM, "Shaun Jackman" notifications@github.com
wrote:

Would this bug be fixed if the FASTQ parser were modified to accept
only
a
single line of quality data?


Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33153825>
.


Reply to this email directly or view it on GitHub<
https://github.com/ged-lab/khmer/issues/249#issuecomment-33155354>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33156424
.

@RamRS
Copy link
Contributor

RamRS commented Jan 23, 2014

So, ideally, we could just wc -l % 4 == 0 ? continue : raise ?

@mr-c
Copy link
Contributor

mr-c commented Jan 24, 2014

We operate on a streaming basis. The python equivalent of wc would
require us read the entire file first.

This parser seems over-complicated in many ways. Simple fixes for this are
welcome but I suspect it needs more attention than that.

Since there is a work around and I'm trying to get the 0.8 release done I'm
going to ignore this issue for a bit.

On Thu, Jan 23, 2014 at 2:45 PM, Ram RS notifications@github.com wrote:

So, ideally, we could just wc -l % 4 == 0 ? continue : raise ?

Reply to this email directly or view it on GitHubhttps://github.com//issues/249#issuecomment-33160738
.

@mr-c mr-c modified the milestones: 1.1+ Release, 1.0 release Apr 2, 2014
@tseemann
Copy link

I am still getting the "Invalid FASTQ" for load-into-counting.py from khmer 1.0 when using --threads > 1. Sometimes it doesn't core dump but runs in some non-sequence-consumin never-ending cpu-chomping state. Is this still the expected behaviour?

@mr-c
Copy link
Contributor

mr-c commented Apr 19, 2014

@tseemann Sadly this is still the expected, broken behavior as documented in https://github.com/ged-lab/khmer/blob/master/doc/known-issues.txt

We know how to fix this but many things compete for our time. Our apologies for the inconvenience.

@tseemann
Copy link

Michael, thanks for the friendly RTFM pointer :) I somehow missed that file. Appreciate all the work you've put into cleaning up the khmer code base!

@mr-c mr-c modified the milestones: 1.1.1+ Release, unscheduled Aug 5, 2014
This was referenced Oct 29, 2014
@mr-c
Copy link
Contributor

mr-c commented Dec 15, 2014

Fixed by #642 / #643

@mr-c mr-c closed this as completed Dec 15, 2014
@sjackman
Copy link
Author

Thanks, Michael.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants