ReadParser fails tests with large files #641

camillescott · 2014-10-27T21:48:09Z

I've added a new test that covers a case I discovered during multithreaded testing. For larger files (50000+ reads), ReadParser does not deliver all reads with multiple threads. I'm not exactly sure where this comes from, but the case is covered by an existing test if you give it a larger input file.

Testing output:

tests.test_read_parsers.test_read_properties ... ok
tests.test_read_parsers.test_with_default_arguments ... ok
tests.test_read_parsers.test_gzip_decompression ... ok
tests.test_read_parsers.test_bzip2_decompression ... ok
tests.test_read_parsers.test_badbzip2 ... ok
tests.test_read_parsers.test_with_multiple_threads ... ok
tests.test_read_parsers.test_with_multiple_threads_big ... FAIL

======================================================================
FAIL: tests.test_read_parsers.test_with_multiple_threads_big
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/camille/anaconda/envs/khmer_t/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/w/khmer/tests/test_read_parsers.py", line 120, in test_with_multiple_threads_big
    test_with_multiple_threads(testfile="test-large.fa")
  File "/w/khmer/tests/test_read_parsers.py", line 115, in test_with_multiple_threads
    assert reads_count_1thr == sum(reads_counts_per_thread)
AssertionError

This number of reads missed is a function of the cache size given to the constructor [edit: cache size is determined by the khmer config object]. For example, testing with different cache sizes and thread counts:

input_buffer_size: ((nthreads=2) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50095, 49902] 99997
**************************************** 

input_buffer_size: ((nthreads=2) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50104, 49894] 99998
**************************************** 

input_buffer_size: ((nthreads=2) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50101, 49898] 99999
**************************************** 

input_buffer_size: ((nthreads=2) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50698, 49296] 99994
**************************************** 

input_buffer_size: ((nthreads=3) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33403, 33398, 33196] 99997
**************************************** 

input_buffer_size: ((nthreads=3) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33403, 33395, 33201] 99999
**************************************** 

input_buffer_size: ((nthreads=3) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33807, 33396, 32796] 99999
**************************************** 

input_buffer_size: ((nthreads=3) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33408, 33396, 33195] 99999
**************************************** 

input_buffer_size: ((nthreads=4) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25052, 25048, 25043, 24854] 99997
**************************************** 

input_buffer_size: ((nthreads=4) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25353, 25342, 24699, 24595] 99989
**************************************** 

input_buffer_size: ((nthreads=4) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25058, 25047, 25046, 24847] 99998
**************************************** 

input_buffer_size: ((nthreads=4) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25064, 25047, 25040, 24848] 99999
****************************************

I probably have a poor understanding of how the caching / buffering is implemented, but it seems to me changing buffer sizes shouldn't affect correctness...

ged-jenkins · 2014-10-27T21:50:10Z

Test FAILed.

ctb · 2014-10-31T10:05:48Z

Nice catch, @camillescott. Is there a test or set of tests that more generically would catch this kind of problem? I was thinking of something like running normalize-by-median with loose parameters and verifying that it produced the same output as input, for example.

mr-c · 2014-12-01T20:06:57Z

These are a part of #642 now; closing.

Add new test, new data

88a26c9

This was referenced Oct 29, 2014

first pass seqan impl #642

Merged

replacement C++ FAST[AQ] parser #643

Closed

mr-c closed this Dec 1, 2014

mr-c mentioned this pull request Dec 8, 2014

Multithreading can lead to dropped reads #681

Closed

ctb deleted the fix/readparser_threaded branch January 21, 2017 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadParser fails tests with large files #641

ReadParser fails tests with large files #641

camillescott commented Oct 27, 2014

ged-jenkins commented Oct 27, 2014

ctb commented Oct 31, 2014

mr-c commented Dec 1, 2014

ReadParser fails tests with large files #641

ReadParser fails tests with large files #641

Conversation

camillescott commented Oct 27, 2014

ged-jenkins commented Oct 27, 2014

ctb commented Oct 31, 2014

mr-c commented Dec 1, 2014