Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadParser fails tests with large files #641

Closed
wants to merge 1 commit into from

Conversation

camillescott
Copy link
Member

I've added a new test that covers a case I discovered during multithreaded testing. For larger files (50000+ reads), ReadParser does not deliver all reads with multiple threads. I'm not exactly sure where this comes from, but the case is covered by an existing test if you give it a larger input file.

Testing output:

tests.test_read_parsers.test_read_properties ... ok
tests.test_read_parsers.test_with_default_arguments ... ok
tests.test_read_parsers.test_gzip_decompression ... ok
tests.test_read_parsers.test_bzip2_decompression ... ok
tests.test_read_parsers.test_badbzip2 ... ok
tests.test_read_parsers.test_with_multiple_threads ... ok
tests.test_read_parsers.test_with_multiple_threads_big ... FAIL

======================================================================
FAIL: tests.test_read_parsers.test_with_multiple_threads_big
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/camille/anaconda/envs/khmer_t/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/w/khmer/tests/test_read_parsers.py", line 120, in test_with_multiple_threads_big
    test_with_multiple_threads(testfile="test-large.fa")
  File "/w/khmer/tests/test_read_parsers.py", line 115, in test_with_multiple_threads
    assert reads_count_1thr == sum(reads_counts_per_thread)
AssertionError

This number of reads missed is a function of the cache size given to the constructor [edit: cache size is determined by the khmer config object]. For example, testing with different cache sizes and thread counts:

input_buffer_size: ((nthreads=2) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50095, 49902] 99997
**************************************** 

input_buffer_size: ((nthreads=2) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50104, 49894] 99998
**************************************** 

input_buffer_size: ((nthreads=2) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50101, 49898] 99999
**************************************** 

input_buffer_size: ((nthreads=2) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [50698, 49296] 99994
**************************************** 

input_buffer_size: ((nthreads=3) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33403, 33398, 33196] 99997
**************************************** 

input_buffer_size: ((nthreads=3) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33403, 33395, 33201] 99999
**************************************** 

input_buffer_size: ((nthreads=3) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33807, 33396, 32796] 99999
**************************************** 

input_buffer_size: ((nthreads=3) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [33408, 33396, 33195] 99999
**************************************** 

input_buffer_size: ((nthreads=4) + 0) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25052, 25048, 25043, 24854] 99997
**************************************** 

input_buffer_size: ((nthreads=4) + 1) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25353, 25342, 24699, 24595] 99989
**************************************** 

input_buffer_size: ((nthreads=4) + 2) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25058, 25047, 25046, 24847] 99998
**************************************** 

input_buffer_size: ((nthreads=4) + 3) * 64 * 1024

Total Input Reads, [Reads Parsed per Thread], Total Parsed
100000 [25064, 25047, 25040, 24848] 99999
**************************************** 

I probably have a poor understanding of how the caching / buffering is implemented, but it seems to me changing buffer sizes shouldn't affect correctness...

@ged-jenkins
Copy link

Test FAILed.

This was referenced Oct 29, 2014
@ctb
Copy link
Member

ctb commented Oct 31, 2014

Nice catch, @camillescott. Is there a test or set of tests that more generically would catch this kind of problem? I was thinking of something like running normalize-by-median with loose parameters and verifying that it produced the same output as input, for example.

@mr-c
Copy link
Contributor

mr-c commented Dec 1, 2014

These are a part of #642 now; closing.

@mr-c mr-c closed this Dec 1, 2014
@ctb ctb deleted the fix/readparser_threaded branch January 21, 2017 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants