Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #854 handle many tablets referencing many WALs #860

Merged
merged 5 commits into from Jan 2, 2019

Conversation

@keith-turner
Copy link
Contributor

commented Dec 27, 2018

No description provided.

keith-turner added a commit to keith-turner/accumulo that referenced this pull request Jan 2, 2019
Fix race condition when getting WALs for dead tserver
When a tablet server dies the master gets its WALs from ZK.  In ZK there
is a list of WALs per tserver. Each WAL in ZK has state that is either
OPEN, CLOSED, or UNREFERENCED.  The master needs a list of OPEN and
CLOSED logs for dead tservers.  While the master is trying to obtain
this list its possible that the Accumulo GC may delete an UNREFERENCED
WAL.  If this happened then the code before this commit would return an
empty list of WALs. This could result in OPEN and CLOSED logs being
ignored for recovery which could result in data loss.  This patch fixes
the race condition.

This bug was observed while looking into an exception I noticed while
writing test for apache#860.  In the IT I was frequently calling another
WalStateManager function and noticed NoNodeExceptions.  As a result I
examined how the entire class handled NoNode race conditions and found
this bug.  I noticed some other possible NoNode race condition, but do
not think this these would occur with the current way the code is called.

@keith-turner keith-turner merged commit f3c8644 into apache:1.9 Jan 2, 2019

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@keith-turner keith-turner deleted the keith-turner:accumulo-854 branch Jan 2, 2019

keith-turner added a commit that referenced this pull request Jan 2, 2019
Fix race condition when getting WALs for dead tserver (#866)
When a tablet server dies the master gets its WALs from ZK.  In ZK there
is a list of WALs per tserver. Each WAL in ZK has state that is either
OPEN, CLOSED, or UNREFERENCED.  The master needs a list of OPEN and
CLOSED logs for dead tservers.  While the master is trying to obtain
this list its possible that the Accumulo GC may delete an UNREFERENCED
WAL.  If this happened then the code before this commit would return an
empty list of WALs. This could result in OPEN and CLOSED logs being
ignored for recovery which could result in data loss.  This patch fixes
the race condition.

This bug was observed while looking into an exception I noticed while
writing test for #860.  In the IT I was frequently calling another
WalStateManager function and noticed NoNodeExceptions.  As a result I
examined how the entire class handled NoNode race conditions and found
this bug.  I noticed some other possible NoNode race condition, but do
not think this these would occur with the current way the code is called.
keith-turner added a commit to keith-turner/accumulo that referenced this pull request Mar 1, 2019
Make max walogs a tserver property instead.
After the changes in apache#860 having a tablet property for max write ahead
logs no longer makes sense.  This should be a tserver property because
the behavior controls the max walogs per tserver.  This commit replaces
the table property with a tablet server property.
keith-turner added a commit that referenced this pull request Mar 4, 2019
Change max walogs from table to tserver property. (#1008)
After the changes in #860 having a tablet property for max write ahead
logs no longer makes sense.  This should be a tserver property because
the behavior controls the max walogs per tserver.  This commit replaces
the table property with a tablet server property.

@ctubbsii ctubbsii added this to Done in 1.9.3 Jun 14, 2019

@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants
You can’t perform that action at this time.