-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the bug
I tried using Pulsar in Kubernetes and it's deployed using helm. After one broker pod crashed and came back, I found the whole Pulsar didn't work. I used pulsar-perf to publish message and the log showed as below. I can confirm 10.0.129.19 is the pulsar proxy IP exposed by kubernetes service. It's reachable from client. And the broker's log indicate there's a client connection connected.
2022-02-07T09:40:03,110
+0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [chenyltopic2] Failed to get partitioned topic metadata: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
2022-02-07T09:40:03,111+0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0xdb0f2346, L:/10.240.2.68:52414 - R:10.0.129.19/10.0.129.19:6650] Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
2022-02-07T09:40:03,111+0000 [pulsar-perf-producer-exec-1-1] ERROR org.apache.pulsar.testclient.PerformanceProducer - Got error
java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_312]
at org.apache.pulsar.testclient.PerformanceProducer.runProducer(PerformanceProducer.java:595) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
at org.apache.pulsar.testclient.PerformanceProducer.lambda$main$1(PerformanceProducer.java:425) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1204) ~[org.apache.pulsar-pulsar-client-original-2.9.1.jar:2.9.1]
at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[io.netty-netty-transport-classes-epoll-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
To Reproduce
The issue happened many times.
- Use pulsar-perf to publish some messages at first and it worked well.
- And then one broker OOM crashed as I'm sending message too quickly.
- I saw many errors and I terminated the perf process and start another one a few seconds later. Then I the issue happened and I can't publish any messages as topic lookup timeout. I tried using different topic name (name that never used before) but doesn't help.
Expected behavior
Everything can recover after broker crash.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: Ubuntu 18.04
- Pulsar: 2.9.1
Additional context
Add any other context about the problem here.