Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when getting WALs for dead tserver #866

Merged
merged 1 commit into from Jan 2, 2019

Conversation

@keith-turner
Copy link
Contributor

commented Jan 2, 2019

When a tablet server dies the master gets its WALs from ZK. In ZK there
is a list of WALs per tserver. Each WAL in ZK has state that is either
OPEN, CLOSED, or UNREFERENCED. The master needs a list of OPEN and
CLOSED logs for dead tservers. While the master is trying to obtain
this list its possible that the Accumulo GC may delete an UNREFERENCED
WAL. If this happened then the code before this commit would return an
empty list of WALs. This could result in OPEN and CLOSED logs being
ignored for recovery which could result in data loss. This patch fixes
the race condition.

This bug was observed while looking into an exception I noticed while
writing test for #860. In the IT I was frequently calling another
WalStateManager function and noticed NoNodeExceptions. As a result I
examined how the entire class handled NoNode race conditions and found
this bug. I noticed some other possible NoNode race condition, but do
not think this these would occur with the current way the code is called.

Fix race condition when getting WALs for dead tserver
When a tablet server dies the master gets its WALs from ZK.  In ZK there
is a list of WALs per tserver. Each WAL in ZK has state that is either
OPEN, CLOSED, or UNREFERENCED.  The master needs a list of OPEN and
CLOSED logs for dead tservers.  While the master is trying to obtain
this list its possible that the Accumulo GC may delete an UNREFERENCED
WAL.  If this happened then the code before this commit would return an
empty list of WALs. This could result in OPEN and CLOSED logs being
ignored for recovery which could result in data loss.  This patch fixes
the race condition.

This bug was observed while looking into an exception I noticed while
writing test for #860.  In the IT I was frequently calling another
WalStateManager function and noticed NoNodeExceptions.  As a result I
examined how the entire class handled NoNode race conditions and found
this bug.  I noticed some other possible NoNode race condition, but do
not think this these would occur with the current way the code is called.
@keith-turner

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

I have been looking into this and I think it could only happen if the GC pulled info about an unreferenced WAL into memory before a tserver died AND then deleted the unreferenced WAL after the tserver died (while the master is trying to get the list). The GC seems to completely ignore WALs for any dead tservers that have tablets assigned to them.

@keith-turner keith-turner changed the base branch from master to 1.9 Jan 2, 2019

@keith-turner

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

Looking at the history of WalStateManager it seems like this may have been a bug since 1.8.0.

@keith-turner keith-turner merged commit 2c2840b into apache:1.9 Jan 2, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@keith-turner keith-turner deleted the keith-turner:wsm-race branch Jan 2, 2019

@ctubbsii ctubbsii added this to Done in 1.9.3 Jun 14, 2019

@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2 participants
You can’t perform that action at this time.