Gh62 badrecord mstate #64

Merged
merged 5 commits into from Oct 23, 2012

Conversation

Projects
None yet
2 participants
Contributor

slfritchie commented Oct 18, 2012

Rather than squash commits, I wanted to leave all 5 parts as the are for easier review, if that's what the reviewer wishes.

  1. Add an eunit test that exposes the bug.
  2. Fix the bug.
  3. Change behavior of CRC error: don't treat it as a fatal error, in hope that stuff later in the file (if any) are salvagable.
  4. Refactor using "after" to consolidate file handle closing.
  5. Fix the same bug elsewhere.

Item 3 is reasonable, I believe, though if a large part of a file has corruption, an unintended effect could be error message spam. However, for a large spam tsunami, I think you'd have to have corruption inside multiple objects but not affecting the inter-object headers ... which seems unlikely.

Item 4 makes the code a bit longer but is the pattern intended for try/catch/after use, which I hope shows clearer intent.

slfritchie added some commits Oct 18, 2012

Add test to trigger {badrecord,mstate} error.
Modify the truncated_merge_test() function to trigger
the error that we're looking for ... there are a couple
of other ways that we could possibly corrupt the fold
accumulator so that merge would be confused, but we
won't try to address other other methods here.

A successful test failure (heh!) is when a test fails
with this output:

    =ERROR REPORT==== 17-Oct-2012::23:50:29 ===
    bitcask_fileops:fold: /tmp/bc.test.truncmerge/2.bitcask.data: expected 14 bytes but got only 5 bytes, skipping
    *failed*
    in function bitcask:merge_files/1 (src/bitcask.erl, line 860)
    in call from bitcask:merge1/3 (src/bitcask.erl, line 510)
    in call from bitcask:truncated_merge_test/0 (src/bitcask.erl, line 1649)
    **error:{badrecord,mstate}
Avoid fold accumulator corruption caused by I/O & CRC errors
In case of I/O or CRC error, use `throw` to throw details
about the error and the partial accumulator all the way
back to the caller.
Change behavior on CRC error
The old behavior, return a seemingly-corrupted accumulator,
or the new behavior, throw an error and abort subsequent
processing of this data file, are both bad.  So, we do
something different: we continue.

If the CRC is good, we call the fold function and get a
new accumulator.  If the CRC is bad, we complain and
use the old accumulator.  In either case, we continue
processing (what we hope are valid contents of) the
rest of the file.
Contributor

kellymclaughlin commented Oct 23, 2012

Verified the added test works as expected and tried these changes out on a node that was suffering from this issue and it work as expected. +1 to merge.

@slfritchie slfritchie merged commit 68dae7d into master Oct 23, 2012

1 check failed

default The Travis build failed
Details

@engelsanchez engelsanchez deleted the gh62-badrecord-mstate branch Mar 28, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment