Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful compaction checking in write ahead log recovery is incorrect #538

Closed
keith-turner opened this issue Jun 21, 2018 · 1 comment
Closed
Assignees
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Milestone

Comments

@keith-turner
Copy link
Contributor

When a tablet server dies, all of the tablets it was serving are assigned its active WALs. See #537 for a more detailed description. Therefore its possible that a clean tablet (had no data in memory when tserver died) is assigned WALs for recovery. Its also possible that the clean tablet only has a compaction finish even in the WAL. However there is checking in the recovery code that checks for only a compaction finish event and flags it as an error. This is not an error. This check made sense before 1.8.0 when WALs were tracked per tablet, because in this case a lone compaction finish event should not happen. After 1.8.0, this check needs to be reconsidered. Discovered this issue as a result of looking into and discussing #535 with @ctubbsii.

@keith-turner keith-turner added v2.0.0 blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug. labels Jun 21, 2018
@keith-turner keith-turner self-assigned this Jun 21, 2018
keith-turner added a commit to keith-turner/accumulo that referenced this issue Jun 22, 2018
There two changes in this patch. First, removed a sanity check from the code
that resulted in false positives.  Second, changed recovery code to use last
compaction finish event for recovery seq #.
@keith-turner
Copy link
Contributor Author

In order to test this a modification to Accumulo's data loss test suite was
made to pause ingest. The problems in this issue are unlikely to be seen with
non-stop ingest. To see these problems, the ability to randomly pause ingest was added in apache/accumulo-testing#15.

@ctubbsii ctubbsii added this to the 1.9.2 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.
Projects
None yet
Development

No branches or pull requests

2 participants