Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful compaction checking in write ahead log recovery is incorrect #538

Closed
keith-turner opened this Issue Jun 21, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@keith-turner
Copy link
Contributor

keith-turner commented Jun 21, 2018

When a tablet server dies, all of the tablets it was serving are assigned its active WALs. See #537 for a more detailed description. Therefore its possible that a clean tablet (had no data in memory when tserver died) is assigned WALs for recovery. Its also possible that the clean tablet only has a compaction finish even in the WAL. However there is checking in the recovery code that checks for only a compaction finish event and flags it as an error. This is not an error. This check made sense before 1.8.0 when WALs were tracked per tablet, because in this case a lone compaction finish event should not happen. After 1.8.0, this check needs to be reconsidered. Discovered this issue as a result of looking into and discussing #535 with @ctubbsii.

@keith-turner keith-turner self-assigned this Jun 21, 2018

keith-turner added a commit to keith-turner/accumulo that referenced this issue Jun 22, 2018

fixes apache#538 fix WAL recovery code
There two changes in this patch. First, removed a sanity check from the code
that resulted in false positives.  Second, changed recovery code to use last
compaction finish event for recovery seq #.
@keith-turner

This comment has been minimized.

Copy link
Contributor Author

keith-turner commented Jul 18, 2018

In order to test this a modification to Accumulo's data loss test suite was
made to pause ingest. The problems in this issue are unlikely to be seen with
non-stop ingest. To see these problems, the ability to randomly pause ingest was added in apache/accumulo-testing#15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.