ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException by eolivelli · Pull Request #1798 · apache/zookeeper

eolivelli · 2022-01-20T11:05:23Z

Modifications:

prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:

Zookeeper in 3 node Kubernetes cluster does not pass heath check in 2.8.0 pulsar#11070
All zK pods reports error "Zookeeper is not running" pravega/zookeeper-operator#393

…age throws NullPointerException

nicoloboschi

LGTM

anmolnar

Sorry for the delayed answer @eolivelli , I'm not actively monitoring ZK anymore unfortunately.

So, the problem has been introduced by this patch: b842cd4
It's only about a metric update if I understand the intention correctly. I think your patch is right, but I don't think we have to log this at the trace level, but doesn't do any harm.

Unit testing. I see some unit tests updated in the original patch which might cover this code part in one way or the other. Would you please take a very quick look at those tests and check if you're able to cover this scenario somehow with a test?

Thanks.

symat

LGTM. if you have time for a unit test (as Andor suggested) then I think it is a good idea to have it. But if it is complicated, then I'm happy with the patch as it is. It is harmless and fixing a trivial bug.

anmolnar

+1
I changed my mind reading @symat 's comment. Up to you @eolivelli .
Thanks!

eolivelli · 2022-01-20T12:59:44Z

thank you @anmolnar for your pointer
I have updated the test added by the patch that caused the problem

b842cd4#diff-508b7388fc8bead40a3f6d652a9faeed4796d12fc84b9e8fbfe657a520398e3fR179

eolivelli · 2022-01-20T13:04:31Z

for reference, this is an alternative solution to #1770

eolivelli · 2022-01-20T13:25:44Z

my plan is to cherry-pick this patch to 3.7.x and to 3.6.x

anmolnar · 2022-01-20T16:40:26Z

Great stuff! 👍

eolivelli · 2022-01-21T10:51:50Z

@symat @ztzg
CPP tests are always failing.
I am running them locally now.

the test that is failing looks unrelated to the patch

eolivelli · 2022-01-25T10:30:14Z

@symat @maoling would you have time to help here ?
the CPP tests are failing.

eolivelli · 2022-01-25T12:14:26Z

The test is passing for me locally

symat · 2022-01-25T12:26:10Z

C client unit test failed again:

  [exec] /home/runner/work/zookeeper/zookeeper/zookeeper-client/zookeeper-client-c/tests/TestOperations.cc:296: Assertion: equality assertion failed [Expected: 1, Actual  : 0]

this test is using mocked ZooKeeper Server, not even starting the real Java server, so this is completely unrelated. The test is flaky or we have some environment issue on the CI nodes. Let's merge this PR regardless of this independent CI problem.

…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>

symat · 2022-01-25T13:13:56Z

PR merged on master, branch-3.7 and (with slight modification due to junit version differences) also on branch-3.6. Thanks for the fix @eolivelli !

andrekramer1 · 2022-01-25T14:47:13Z

On this PR: NettyServerCnxnFactory.java line 462 zkServer.serverStats().incrementAuthFailedCount(); Probably also could use an if (zkServer != null). There are uses of zks in NettyServerCnxn that I was also protecting from being null.
It would be better if the ServerCnxn flavours were more similar as you pointed out it does a close early on exceptions. Then it may not need the SSL handshake suppression for example.

A minimal fix is probably best but in testing just branch/PR I just get it stuck at launching Zookeeper 0:

14:34:25.541 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 25600 ms
14:34:39.674 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:34:47.441 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:34:51.147 [QuorumConnectionThread-[myid=1]-1] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 2 at election address pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:34:51.158 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 51200 ms
14:34:51.160 [QuorumConnectionThread-[myid=1]-1] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 3 at election address pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:09.729 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:17.480 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:39.785 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:42.371 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 60000 ms
14:35:42.371 [QuorumConnectionThread-[myid=1]-5] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 3 at election address pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:42.374 [QuorumConnectionThread-[myid=1]-4] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 2 at election address pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:47.425 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:36:09.738 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:36:17.448 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake

I suspect Zookeeper 0 is not responding with "I'm ok/up" so kubernetes never starts up the other two instances. I saw this before and I think I side stepped it by always allowing things to progress and reply "I'm ok/up". Not exactly sure why the NIO server context does not suffer this problem but just closing the connection may not be enough to fix it for Netty/Pulsar/Kubernetes.

So not really sure how to progress this now unless we ignore SSL for Zookeeper/Pulsar/Kubernetes to keep the throttling hack and otherwise try with my original approach if throttling can't be avoided.

Hopefully I've not messed up testing or something else in building your pull request. But probably we need some more testing with Pulsar/Kubernetes at the very least?

andrekramer1 · 2022-01-25T14:48:58Z

I see this was merged just as I was writing up my testing. Fixes the NULL reference OK but may not be a fix for the original 3 node Zookeeper Pulsar in Kubernetes start up problem? #1770

eolivelli · 2022-01-25T14:53:18Z

@andrekramer1 thanks for testing this patch.

This is interesting

I suspect Zookeeper 0 is not responding with "I'm ok/up" so kubernetes never starts up the other two instances. I saw this before and I think I side stepped it by always allowing things to progress and reply "I'm ok/up". Not exactly sure why the NIO server context does not suffer this problem but just closing the connection may not be enough to fix it for Netty/Pulsar/Kubernetes.

I have checked again the NIO Server connection code and the behaviour should be the same as for Netty, if the zkServer is not available then it drops the connection.

There are uses of zks in NettyServerCnxn that I was also protecting from being null

I checked other accesses and it seemed to me that the other places should be safe (IIRC because the connection should have been dropped before reaching those parts, but I may be wrong)

andrekramer1 · 2022-01-25T14:59:43Z

Possibly zkServer comes up for NIO and not for Netty? Zookeeper 0 seems to be running an election before replying "I'm ok/up" (new warnings from my test above). I waited a few minutes and it did not seem to recover or progress any further. This was just 3 nodes on one machine as I did not have our 3 node kubernetes environment to test on. Possibly the new release can be tested with the official Helm chart just to confirm where we are now.

eolivelli · 2022-01-25T16:20:09Z

I did a simple test on current master (with this fix).

Prepare a configuration file with 3 nodes
Start only the 1 node
the node is not able to create a quorum, but it is up and running
Send the 'ruok' command to the node
see the result

The result is "imok" with both NIO, but not with Netty

so this fix is not enough and you are right !

2022-01-25 17:20:48,289 [myid:1] - WARN [nioEventLoopGroup-4-14:NettyServerCnxnFactory$CnxnChannelHandler@231] - Zookeeper server is not running, close the connection before starting the TLS handshake

andrekramer1 · 2022-01-25T16:44:22Z

I don't see the npe just a number of retries on the election and then the Zookeeper was restarted. So maybe it's a timing issue in that Kubernetes is not waiting long enough for Zookeeper node to say "imok"? Seems to be 5 mins between restarts. This is on a Pulsar 2.8.0 I built with the patched Zookeeper (from your branch). It says state is LOOKING and then there are UnknownHostExceptions maybe you are getting another networking error? Never seems to get further for me. So pod keeps restarting:Events:
Type Reason Age From Message

Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't find available persistent volumes to bind.
Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't find available persistent volumes to bind.
Normal Scheduled default-scheduler Successfully assigned c8y-messaging-service/pulsar-zookeeper-0 to ip-172-18-252-196
Normal Started 14m kubelet, ip-172-18-252-196 Started container pulsar-zookeeper
Warning Unhealthy 14m kubelet, ip-172-18-252-196 Readiness probe failed: ZK server is not ok
localhost [127.0.0.1] 2181 (?) : Connection refused
Warning Unhealthy 9m58s (x9 over 13m) kubelet, ip-172-18-252-196 Readiness probe failed: ZK server is not ok
Normal Killing 9m38s kubelet, ip-172-18-252-196 Container pulsar-zookeeper failed liveness probe, will be restarted
Normal Created 9m37s (x2 over 14m) kubelet, ip-172-18-252-196 Created container pulsar-zookeeper
Normal Pulled 9m37s (x2 over 14m) kubelet, ip-172-18-252-196 Container image "pulsar:2.8.0" already present on machine
Warning Unhealthy 4m38s (x20 over 14m) kubelet, ip-172-18-252-196 Liveness probe failed: ZK server is not ok

…sage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - pravega/zookeeper-operator#393 Author: Enrico Olivelli <eolivelli@apache.org> Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org> Closes apache#1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc) Signed-off-by: Mate Szalay-Beko <symat@apache.org>

ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMess…

e734cea

…age throws NullPointerException

nicoloboschi approved these changes Jan 20, 2022

View reviewed changes

anmolnar requested changes Jan 20, 2022

View reviewed changes

symat approved these changes Jan 20, 2022

View reviewed changes

anmolnar approved these changes Jan 20, 2022

View reviewed changes

Add tests

d6fee8d

fix build

cea1a4a

asfgit closed this in 957f8fc Jan 25, 2022

eolivelli deleted the fix/ZOOKEEPER-3988-npe branch January 25, 2022 14:49

Comments

Conversation

eolivelli commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

anmolnar left a comment

Choose a reason for hiding this comment

Uh oh!

symat left a comment

Choose a reason for hiding this comment

Uh oh!

anmolnar left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli commented Jan 20, 2022

Uh oh!

eolivelli commented Jan 20, 2022

Uh oh!

eolivelli commented Jan 20, 2022

Uh oh!

anmolnar commented Jan 20, 2022

Uh oh!

eolivelli commented Jan 21, 2022

Uh oh!

eolivelli commented Jan 25, 2022

Uh oh!

eolivelli commented Jan 25, 2022

Uh oh!

symat commented Jan 25, 2022

Uh oh!

symat commented Jan 25, 2022

Uh oh!

andrekramer1 commented Jan 25, 2022

Uh oh!

andrekramer1 commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eolivelli commented Jan 25, 2022

Uh oh!

andrekramer1 commented Jan 25, 2022

Uh oh!

eolivelli commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrekramer1 commented Jan 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eolivelli commented Jan 20, 2022 •

edited

Loading

andrekramer1 commented Jan 25, 2022 •

edited

Loading

eolivelli commented Jan 25, 2022 •

edited

Loading