Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WAL race condition between zookeeper and metadata table #539

Merged
merged 1 commit into from Jun 29, 2018

Conversation

Projects
None yet
2 participants
@keith-turner
Copy link
Contributor

keith-turner commented Jun 22, 2018

I noticed this race condition while looking in to #535. This seems like a very low probability event because multiple conditions would need to be met in a small time window for data loss to occur as a result of this race.

  • Only tablet A references WAL_X in its otherLogs set
  • Thread T1 one clears otherLogs for tablet A
  • Thread T2 calls markUnusedWALs() removing WAL_X from zookeeper
  • Tablet server dies (tablet data that was in WAL_X is not in metadata table or recovery data)
  • Thread T1 would have updated the metadata table with a new file if tserver had not died

This change acquires a lock so that the call to markUnusedWALs() will block until the file is added to metadata table. This lock prevents zookeeper updates from happening before metadata table updates.

This bug only affects Accumulo 1.8.0 and later.

@keith-turner

This comment has been minimized.

Copy link
Contributor Author

keith-turner commented Jun 22, 2018

I am seeing some WAL and Replication ITs timeout, so the locking introduced in this PR may be causing a problem.

@ctubbsii
Copy link
Member

ctubbsii left a comment

LGTM, pending IT results

@keith-turner

This comment has been minimized.

Copy link
Contributor Author

keith-turner commented Jun 29, 2018

WriteAheadLogIT timed out once on Jenkins, it was automatically run again and it passed. I ran WriteAheadLogIT 6 or 7 times locally without issue.

@keith-turner keith-turner merged commit fa20939 into apache:1.9 Jun 29, 2018

1 of 2 checks passed

Jenkins Looks like there's a problem with this pull request
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@keith-turner keith-turner deleted the keith-turner:accumulo-535 branch Jun 29, 2018

@keith-turner

This comment has been minimized.

Copy link
Contributor Author

keith-turner commented Jul 18, 2018

This PR caused deadlock, which #545 attempted to fix. However #545 also caused deadlock, which #559 ultimately fixed. So #559 is really the fix for the problem described in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.