Skip to content

Comments

ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException#1798

Closed
eolivelli wants to merge 3 commits intoapache:masterfrom
eolivelli:fix/ZOOKEEPER-3988-npe
Closed

ZOOKEEPER-3988 rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException#1798
eolivelli wants to merge 3 commits intoapache:masterfrom
eolivelli:fix/ZOOKEEPER-3988-npe

Conversation

@eolivelli
Copy link
Contributor

@eolivelli eolivelli commented Jan 20, 2022

Modifications:

  • prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed answer @eolivelli , I'm not actively monitoring ZK anymore unfortunately.

So, the problem has been introduced by this patch: b842cd4
It's only about a metric update if I understand the intention correctly. I think your patch is right, but I don't think we have to log this at the trace level, but doesn't do any harm.

Unit testing. I see some unit tests updated in the original patch which might cover this code part in one way or the other. Would you please take a very quick look at those tests and check if you're able to cover this scenario somehow with a test?

Thanks.

Copy link
Contributor

@symat symat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. if you have time for a unit test (as Andor suggested) then I think it is a good idea to have it. But if it is complicated, then I'm happy with the patch as it is. It is harmless and fixing a trivial bug.

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I changed my mind reading @symat 's comment. Up to you @eolivelli .
Thanks!

@eolivelli
Copy link
Contributor Author

thank you @anmolnar for your pointer
I have updated the test added by the patch that caused the problem

b842cd4#diff-508b7388fc8bead40a3f6d652a9faeed4796d12fc84b9e8fbfe657a520398e3fR179

@eolivelli
Copy link
Contributor Author

for reference, this is an alternative solution to #1770

@eolivelli
Copy link
Contributor Author

my plan is to cherry-pick this patch to 3.7.x and to 3.6.x

@anmolnar
Copy link
Contributor

Great stuff! 👍

@eolivelli
Copy link
Contributor Author

@symat @ztzg
CPP tests are always failing.
I am running them locally now.

the test that is failing looks unrelated to the patch

@eolivelli
Copy link
Contributor Author

@symat @maoling would you have time to help here ?
the CPP tests are failing.

@eolivelli
Copy link
Contributor Author

The test is passing for me locally

@symat
Copy link
Contributor

symat commented Jan 25, 2022

C client unit test failed again:

  [exec] /home/runner/work/zookeeper/zookeeper/zookeeper-client/zookeeper-client-c/tests/TestOperations.cc:296: Assertion: equality assertion failed [Expected: 1, Actual  : 0]

this test is using mocked ZooKeeper Server, not even starting the real Java server, so this is completely unrelated. The test is flaky or we have some environment issue on the CI nodes. Let's merge this PR regardless of this independent CI problem.

@asfgit asfgit closed this in 957f8fc Jan 25, 2022
asfgit pushed a commit that referenced this pull request Jan 25, 2022
…sage throws NullPointerException

Modifications:
- prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:
- apache/pulsar#11070
- pravega/zookeeper-operator#393

Author: Enrico Olivelli <eolivelli@apache.org>

Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org>

Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe

(cherry picked from commit 957f8fc)
Signed-off-by: Mate Szalay-Beko <symat@apache.org>
asfgit pushed a commit that referenced this pull request Jan 25, 2022
…sage throws NullPointerException

Modifications:
- prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:
- apache/pulsar#11070
- pravega/zookeeper-operator#393

Author: Enrico Olivelli <eolivelli@apache.org>

Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org>

Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe

(cherry picked from commit 957f8fc)
Signed-off-by: Mate Szalay-Beko <symat@apache.org>
@symat
Copy link
Contributor

symat commented Jan 25, 2022

PR merged on master, branch-3.7 and (with slight modification due to junit version differences) also on branch-3.6. Thanks for the fix @eolivelli !

@andrekramer1
Copy link

On this PR: NettyServerCnxnFactory.java line 462 zkServer.serverStats().incrementAuthFailedCount(); Probably also could use an if (zkServer != null). There are uses of zks in NettyServerCnxn that I was also protecting from being null.
It would be better if the ServerCnxn flavours were more similar as you pointed out it does a close early on exceptions. Then it may not need the SSL handshake suppression for example.

A minimal fix is probably best but in testing just branch/PR I just get it stuck at launching Zookeeper 0:

14:34:25.541 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 25600 ms
14:34:39.674 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:34:47.441 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:34:51.147 [QuorumConnectionThread-[myid=1]-1] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 2 at election address pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:34:51.158 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 51200 ms
14:34:51.160 [QuorumConnectionThread-[myid=1]-1] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 3 at election address pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:09.729 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:17.480 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:39.785 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:35:42.371 [QuorumPeermyid=1(secure=disabled)] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification time out: 60000 ms
14:35:42.371 [QuorumConnectionThread-[myid=1]-5] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 3 at election address pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-2.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:42.374 [QuorumConnectionThread-[myid=1]-4] WARN org.apache.zookeeper.server.quorum.QuorumCnxManager - Cannot open channel to 2 at election address pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local:3888
java.net.UnknownHostException: pulsar-zookeeper-1.pulsar-zookeeper.c8y-messaging-service.svc.cluster.local
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:609) ~[?:?]
at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:384) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:458) [org.apache.zookeeper-zookeeper-3.8.0-SNAPSHOT.jar:3.8.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
14:35:47.425 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:36:09.738 [epollEventLoopGroup-4-1] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake
14:36:17.448 [epollEventLoopGroup-4-2] WARN org.apache.zookeeper.server.NettyServerCnxnFactory - Zookeeper server is not running, close the connection before starting the TLS handshake

I suspect Zookeeper 0 is not responding with "I'm ok/up" so kubernetes never starts up the other two instances. I saw this before and I think I side stepped it by always allowing things to progress and reply "I'm ok/up". Not exactly sure why the NIO server context does not suffer this problem but just closing the connection may not be enough to fix it for Netty/Pulsar/Kubernetes.

So not really sure how to progress this now unless we ignore SSL for Zookeeper/Pulsar/Kubernetes to keep the throttling hack and otherwise try with my original approach if throttling can't be avoided.

Hopefully I've not messed up testing or something else in building your pull request. But probably we need some more testing with Pulsar/Kubernetes at the very least?

@andrekramer1
Copy link

andrekramer1 commented Jan 25, 2022

I see this was merged just as I was writing up my testing. Fixes the NULL reference OK but may not be a fix for the original 3 node Zookeeper Pulsar in Kubernetes start up problem? #1770

@eolivelli eolivelli deleted the fix/ZOOKEEPER-3988-npe branch January 25, 2022 14:49
@eolivelli
Copy link
Contributor Author

@andrekramer1 thanks for testing this patch.

This is interesting

I suspect Zookeeper 0 is not responding with "I'm ok/up" so kubernetes never starts up the other two instances. I saw this before and I think I side stepped it by always allowing things to progress and reply "I'm ok/up". Not exactly sure why the NIO server context does not suffer this problem but just closing the connection may not be enough to fix it for Netty/Pulsar/Kubernetes.

I have checked again the NIO Server connection code and the behaviour should be the same as for Netty, if the zkServer is not available then it drops the connection.

There are uses of zks in NettyServerCnxn that I was also protecting from being null

I checked other accesses and it seemed to me that the other places should be safe (IIRC because the connection should have been dropped before reaching those parts, but I may be wrong)

@andrekramer1
Copy link

Possibly zkServer comes up for NIO and not for Netty? Zookeeper 0 seems to be running an election before replying "I'm ok/up" (new warnings from my test above). I waited a few minutes and it did not seem to recover or progress any further. This was just 3 nodes on one machine as I did not have our 3 node kubernetes environment to test on. Possibly the new release can be tested with the official Helm chart just to confirm where we are now.

@eolivelli
Copy link
Contributor Author

eolivelli commented Jan 25, 2022

I did a simple test on current master (with this fix).

  1. Prepare a configuration file with 3 nodes
  2. Start only the 1 node
  3. the node is not able to create a quorum, but it is up and running
  4. Send the 'ruok' command to the node
  5. see the result

The result is "imok" with both NIO, but not with Netty

so this fix is not enough and you are right !

2022-01-25 17:20:48,289 [myid:1] - WARN [nioEventLoopGroup-4-14:NettyServerCnxnFactory$CnxnChannelHandler@231] - Zookeeper server is not running, close the connection before starting the TLS handshake

@andrekramer1
Copy link

I don't see the npe just a number of retries on the election and then the Zookeeper was restarted. So maybe it's a timing issue in that Kubernetes is not waiting long enough for Zookeeper node to say "imok"? Seems to be 5 mins between restarts. This is on a Pulsar 2.8.0 I built with the patched Zookeeper (from your branch). It says state is LOOKING and then there are UnknownHostExceptions maybe you are getting another networking error? Never seems to get further for me. So pod keeps restarting:Events:
Type Reason Age From Message


Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't find available persistent volumes to bind.
Warning FailedScheduling default-scheduler 0/1 nodes are available: 1 node(s) didn't find available persistent volumes to bind.
Normal Scheduled default-scheduler Successfully assigned c8y-messaging-service/pulsar-zookeeper-0 to ip-172-18-252-196
Normal Started 14m kubelet, ip-172-18-252-196 Started container pulsar-zookeeper
Warning Unhealthy 14m kubelet, ip-172-18-252-196 Readiness probe failed: ZK server is not ok
localhost [127.0.0.1] 2181 (?) : Connection refused
Warning Unhealthy 9m58s (x9 over 13m) kubelet, ip-172-18-252-196 Readiness probe failed: ZK server is not ok
Normal Killing 9m38s kubelet, ip-172-18-252-196 Container pulsar-zookeeper failed liveness probe, will be restarted
Normal Created 9m37s (x2 over 14m) kubelet, ip-172-18-252-196 Created container pulsar-zookeeper
Normal Pulled 9m37s (x2 over 14m) kubelet, ip-172-18-252-196 Container image "pulsar:2.8.0" already present on machine
Warning Unhealthy 4m38s (x20 over 14m) kubelet, ip-172-18-252-196 Liveness probe failed: ZK server is not ok

desaikomal pushed a commit to linkedin/zookeeper that referenced this pull request Jun 17, 2023
…sage throws NullPointerException

Modifications:
- prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:
- apache/pulsar#11070
- pravega/zookeeper-operator#393

Author: Enrico Olivelli <eolivelli@apache.org>

Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org>

Closes apache#1798 from eolivelli/fix/ZOOKEEPER-3988-npe

(cherry picked from commit 957f8fc)
Signed-off-by: Mate Szalay-Beko <symat@apache.org>
desaikomal pushed a commit to linkedin/zookeeper that referenced this pull request Jun 27, 2023
…sage throws NullPointerException

Modifications:
- prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests

Related to:
- apache/pulsar#11070
- pravega/zookeeper-operator#393

Author: Enrico Olivelli <eolivelli@apache.org>

Reviewers: Nicolo² Boschi <boschi1997@gmail.com>, Andor Molnar <andor@apache.org>, Mate Szalay-Beko <symat@apache.org>

Closes apache#1798 from eolivelli/fix/ZOOKEEPER-3988-npe

(cherry picked from commit 957f8fc)
Signed-off-by: Mate Szalay-Beko <symat@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants