Skip to content

Topic lookup timeout and can't recover after broker crash #14183

@zackliu

Description

@zackliu

Describe the bug
I tried using Pulsar in Kubernetes and it's deployed using helm. After one broker pod crashed and came back, I found the whole Pulsar didn't work. I used pulsar-perf to publish message and the log showed as below. I can confirm 10.0.129.19 is the pulsar proxy IP exposed by kubernetes service. It's reachable from client. And the broker's log indicate there's a client connection connected.

2022-02-07T09:40:03,110
+0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [chenyltopic2] Failed to get partitioned topic metadata: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
2022-02-07T09:40:03,111+0000 [pulsar-client-io-2-3] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0xdb0f2346, L:/10.240.2.68:52414 - R:10.0.129.19/10.0.129.19:6650] Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
2022-02-07T09:40:03,111+0000 [pulsar-perf-producer-exec-1-1] ERROR org.apache.pulsar.testclient.PerformanceProducer - Got error
java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) ~[?:1.8.0_312]
at org.apache.pulsar.testclient.PerformanceProducer.runProducer(PerformanceProducer.java:595) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
at org.apache.pulsar.testclient.PerformanceProducer.lambda$main$1(PerformanceProducer.java:425) ~[org.apache.pulsar-pulsar-testclient-2.9.1.jar:2.9.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_312]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_312]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: Lookup request timeout {'durationMs': '30000', 'reqId':'1671309513597037281', 'remote':'10.0.129.19/10.0.129.19:6650', 'local':'/10.240.2.68:52414'}
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1204) ~[org.apache.pulsar-pulsar-client-original-2.9.1.jar:2.9.1]
at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[io.netty-netty-transport-classes-epoll-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]

To Reproduce
The issue happened many times.

  1. Use pulsar-perf to publish some messages at first and it worked well.
  2. And then one broker OOM crashed as I'm sending message too quickly.
  3. I saw many errors and I terminated the perf process and start another one a few seconds later. Then I the issue happened and I can't publish any messages as topic lookup timeout. I tried using different topic name (name that never used before) but doesn't help.

Expected behavior
Everything can recover after broker crash.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Pulsar: 2.9.1

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions