Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4203: Leader swallows the ZooKeeperServer.State.ERROR from Leader.LearnerCnxAcceptor in some concurrency condition #1596

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -189,9 +189,18 @@ public void dumpConf(PrintWriter pwriter) {
pwriter.print(self.getQuorumVerifier().toString());
}

private final Object stateChangeMutex = new Object();

@Override
protected void setState(State state) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a closer look I find that setState can be a single implementation in ZooKeeperServer. Just as is there. And you can apply this check as well as logging to verify as @eolivelli said.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @eolivelli that setState() has to be "careful," because ZooKeeperServerListenerImpl.notifyStopping calls it from "random" threads.

But @Tison's point still holds; it seems to me that this could be simplified by making setState() a synchronized method. Do you have a specific reason of using a separate object?

(The parent setState() does not need to be synchronized because the field is volatile.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But @Tison's point still holds; it seems to me that this could be simplified by making setState() a synchronized method. Do you have a specific reason of using a separate object?

(The parent setState() does not need to be synchronized because the field is volatile.)

@ztzg The rationale for not making setState() a synchronized method is that if a thread is running another synchronized method of this ZooKeeperServer object for some time, it may block the setState() invoked by another thread, and this invocation may be critical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @eolivelli that setState() has to be "careful," because ZooKeeperServerListenerImpl.notifyStopping calls it from "random" threads.

Actually, the single-node ZooKeeperServer can handle this issue with the handler:

protected void setState(State state) {
this.state = state;
// Notify server state changes to the registered shutdown handler, if any.
if (zkShutdownHandler != null) {
zkShutdownHandler.handle(state);
} else {
LOG.debug(
"ZKShutdownHandler is not registered, so ZooKeeper server"
+ " won't take any action on ERROR or SHUTDOWN server state changes");
}
}

because it is registered at the very beginning:
zkServer.registerServerShutdownHandler(new ZooKeeperServerShutdownHandler(shutdownLatch));

This handling logic is:
public void handle(State state) {
if (state == State.ERROR || state == State.SHUTDOWN) {
shutdownLatch.countDown();
}
}

and then
// Watch status of ZooKeeper server. It will do a graceful shutdown
// if the server is not running or hits an internal error.
shutdownLatch.await();
shutdown();

However, in the case of QuorumZooKeeperServer, including Leader, Follower, etc., the aforementioned logic is not used.

this.state = state;
synchronized (stateChangeMutex) {
if (this.state == State.ERROR) {
if (state == State.RUNNING || state == State.INITIAL) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a relevant log in order to see if this case is happening.

Creating a test that reproduces the problem and demonstrates that this fix actually resolves the problem would be better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add the test, do we need to consider using a fault injection tool? See ZOOKEEPER-3601 and #1135. I have provided Byteman injection script in ZOOKEEPER-4203 so it can be somehow translated into a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree with @eolivelli on adding a LOG.warn (if not LOG.error).

For the test, I would suggest mocking and/or overriding some methods to explicitly generate the fault, if possible. I haven't tried doing so, but perhaps LearnerTest.java (which subclasses LearnerZooKeeperServer) can serve as inspiration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree with @eolivelli on adding a LOG.warn (if not LOG.error).

Okay, I will add the log later.

For the test, I would suggest mocking and/or overriding some methods to explicitly generate the fault, if possible. I haven't tried doing so, but perhaps LearnerTest.java (which subclasses LearnerZooKeeperServer) can serve as inspiration?

Okay, I will take a try.
BTW, I think introducing a fault injection tool like #1135 may make it easier to write a test for it, and is helpful for the community. Another bug I propose (#1582) also require such tool to reproduce/test. I commented at #1135 but haven't got response. Do you have any suggestion or opinion? Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @functioner, I am very interested in this problem, I have written a test case, I don't konw if it can be used to test this problem.Maybe you can use it as a reference.
TestCase
Byteman btm script

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to use stateChangeMutex to handle concurrent access to state you have to use it at every read/write access.

Using AtomicReference is better and easier to understand.
I suggest to switch to AtomicReference and drop stateChangeMutex

}
}
this.state = state;
}
}

@Override
Expand Down