Skip to content

[Bug] Producer experienced TimeoutError with direct memory increased #22212

@YanshuoH

Description

@YanshuoH

Search before asking

  • I searched in the issues and found nothing similar.

Version

Pulsar broker (and other components): v3.1.2
Pulsar client: go sdk v0.8.1 / v0.9.0

Minimal reproduce step

As the incident happened in production environment, I've tried to simulate the conditions with the ideas of:

  1. bookie slow response
  2. broker pending requests high

But unfortunately the increasing direct memory cannot be reproduced:

  1. set broker config of maxPendingPublishRequestsPerConnection to 1
  2. set client OperationTimeout and other timeout related parameters to a low value (like 100ms)
  3. create a producer with high load and produce messages to persistent://public/default/t1
  4. to simulate bookie slow response, either:
  5. unload the topic
  6. kill the bookie in charge of topic

Then the two kinds of error occurred:

  1. send operation: 2024/03/06 08:12:33 sendMessage failed: producer.Send: message send timeout: TimeoutError
  2. create new producer operation using the same connection: ERRO[0012] Failed to create producer at send PRODUCER request error="request timed out" topic="persistent://public/default/t1-partition-0"

What did you expect to see?

  1. DirectMemory should not have peak increasing
  2. Producer should not encounter TimeoutError

What did you see instead?

  1. DirectMemory kept increasing
  2. Producer send Timeout and new connection failed
37f52f7b-412b-4f1a-80c8-0c0e887f0094

Anything else?

This issue is rather rare as we have used Pulsar for years and always kept the version fresh (from v2.2.x -> v3.2.0) and never have encountered such problem.

I have digged into the source code, and for me, the possible DirectMemory reason may be that the ledger.asyncAddEntry did not complete (the async executor callback), so as code comment below, the buffer was not released.

90419b98-ee35-4c80-a3ff-7308f6a105fb (1)

Here are also some other findings that may help identify the root cause.

  1. Throttled connection
    f5d43c9b-9220-4a35-adbd-6661a4107026

The code that in charge of toggling pulsar_broker_throttled_connections is org.apache.pulsar.broker.service.ServerCnx#startSendOperation and org.apache.pulsar.broker.service.ServerCnx#enableCnxAutoRead

That means during that period of time, ServerCnx's pendingSendRequest should have reached the maxPendingSendRequests which is 1000 for our configuration.

Also I can confirm that no publish rate limit set on any of the topics.

  1. Metadata store executor queue
    7e954cbf-ab00-4448-a2fd-907fe9a49b90

While the disableCnxAutoRead called, the command Producer on the connection will not be able to succeed so we can see many of below log:

2024-03-04T09:29:56,302+0000 [pulsar-io-3-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.120.159.82:59748] Closed producer before its creation was completed. producerId=61

Then the producer will try to reconnect, which results to tons of metadata store operations in org.apache.pulsar.broker.service.BrokerService#getOrCreateTopic.

  1. For what it's worth, the broker / bookie / zookeeper logs show no Exception or error and their CPU/Memory/JVM (except broker DirectMemory) seems fine.

the peaks around 17:40 is the time that I've restarted the brokers on question.

Screen Shot 2024-03-07 at 07 37 09
  1. As we have deployed our Pulsar clusters on AWS, my first guess was the networking issue, but after confirming with AWS Technical support and looking into the Cloudwatch metrics, both networking and disk metrics seem fine:
  • No packet drop or high networking usage
  • Disk queue is reasonable and read / write metrics too.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions