ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException#1798
ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException#1798eolivelli wants to merge 3 commits intoapache:masterfrom
Conversation
…age throws NullPointerException
anmolnar
left a comment
There was a problem hiding this comment.
Sorry for the delayed answer @eolivelli , I'm not actively monitoring ZK anymore unfortunately.
So, the problem has been introduced by this patch: b842cd4
It's only about a metric update if I understand the intention correctly. I think your patch is right, but I don't think we have to log this at the trace level, but doesn't do any harm.
Unit testing. I see some unit tests updated in the original patch which might cover this code part in one way or the other. Would you please take a very quick look at those tests and check if you're able to cover this scenario somehow with a test?
Thanks.
symat
left a comment
There was a problem hiding this comment.
LGTM. if you have time for a unit test (as Andor suggested) then I think it is a good idea to have it. But if it is complicated, then I'm happy with the patch as it is. It is harmless and fixing a trivial bug.
anmolnar
left a comment
There was a problem hiding this comment.
+1
I changed my mind reading @symat 's comment. Up to you @eolivelli .
Thanks!
|
thank you @anmolnar for your pointer b842cd4#diff-508b7388fc8bead40a3f6d652a9faeed4796d12fc84b9e8fbfe657a520398e3fR179 |
|
for reference, this is an alternative solution to #1770 |
|
my plan is to cherry-pick this patch to 3.7.x and to 3.6.x |
|
Great stuff! 👍 |
|
The test is passing for me locally |
|
C client unit test failed again: this test is using mocked ZooKeeper Server, not even starting the real Java server, so this is completely unrelated. The test is flaky or we have some environment issue on the CI nodes. Let's merge this PR regardless of this independent CI problem. |
…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>
…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>
|
PR merged on master, branch-3.7 and (with slight modification due to junit version differences) also on branch-3.6. Thanks for the fix @eolivelli ! |
|
On this PR: NettyServerCnxnFactory.java line 462 zkServer.serverStats().incrementAuthFailedCount(); Probably also could use an if (zkServer != null). There are uses of zks in NettyServerCnxn that I was also protecting from being null. A minimal fix is probably best but in testing just branch/PR I just get it stuck at launching Zookeeper 0: 14:34:25.541 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 25600 ms I suspect Zookeeper 0 is not responding with "I'm ok/up" so kubernetes never starts up the other two instances. I saw this before and I think I side stepped it by always allowing things to progress and reply "I'm ok/up". Not exactly sure why the NIO server context does not suffer this problem but just closing the connection may not be enough to fix it for Netty/Pulsar/Kubernetes. So not really sure how to progress this now unless we ignore SSL for Zookeeper/Pulsar/Kubernetes to keep the throttling hack and otherwise try with my original approach if throttling can't be avoided. Hopefully I've not messed up testing or something else in building your pull request. But probably we need some more testing with Pulsar/Kubernetes at the very least? |
|
I see this was merged just as I was writing up my testing. Fixes the NULL reference OK but may not be a fix for the original 3 node Zookeeper Pulsar in Kubernetes start up problem? #1770 |
|
@andrekramer1 thanks for testing this patch. This is interesting
I have checked again the NIO Server connection code and the behaviour should be the same as for Netty, if the zkServer is not available then it drops the connection.
I checked other accesses and it seemed to me that the other places should be safe (IIRC because the connection should have been dropped before reaching those parts, but I may be wrong) |
|
Possibly zkServer comes up for NIO and not for Netty? Zookeeper 0 seems to be running an election before replying "I'm ok/up" (new warnings from my test above). I waited a few minutes and it did not seem to recover or progress any further. This was just 3 nodes on one machine as I did not have our 3 node kubernetes environment to test on. Possibly the new release can be tested with the official Helm chart just to confirm where we are now. |
|
I did a simple test on current master (with this fix).
The result is "imok" with both NIO, but not with Netty so this fix is not enough and you are right !
|
|
I don't see the npe just a number of retries on the election and then the Zookeeper was restarted. So maybe it's a timing issue in that Kubernetes is not waiting long enough for Zookeeper node to say "imok"? Seems to be 5 mins between restarts. This is on a Pulsar 2.8.0 I built with the patched Zookeeper (from your branch). It says state is LOOKING and then there are UnknownHostExceptions maybe you are getting another networking error? Never seems to get further for me. So pod keeps restarting:Events: Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't find available persistent volumes to bind. |
…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes apache#1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>
…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes apache#1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>
Modifications:
Related to: