producer continually refreshing client metadata after broker disconnection #4577

jpiper · 2024-01-09T19:07:10Z

Description

We are producing into a topic with 84 partitions over 8 brokers. We are seeing the issue where a recoverable error condition (difficult to reproduce, but an example would be transport failure causing a broker connection to fail) will trigger the underlying librdkafka library to continually refresh the metadata for this topic at the retry.backoff.max.ms, even though the metadata request is successful and all partitions have leaders.

e.g. here is the error condition happening which is triggered by a new connection to a broker being broken just ~8s after entering UP state

%7|1704816898.865|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816898.865|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816898.865|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.886|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.888|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.907|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#13)
%7|1704816898.907|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816898.907|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 6 message(s) queued but broker not up
%7|1704816898.926|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816898.926|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP) (_TRANSPORT)
%6|1704816906.827|FAIL|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Disconnected (after 7900ms in state UP)
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state UP -> DOWN
%7|1704816906.827|REQERR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: ProduceRequest failed: Local: Broker transport failure: explicit actions Refresh,MsgPossiblyPersisted
%7|1704816906.827|BROKERUA|producer.host#producer-2| [thrd:ssl://broker-6:4]: produce_topic [51]: broker unavailable: produce: Local: Broker transport failure
%7|1704816906.827|FASTQUERY|producer.host#producer-2| [thrd:ssl://broker-6:4]: Starting fast leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Requesting metadata for 1/1 topics: broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-5:443/5: Request metadata for 1 topic(s): broker down
%7|1704816906.827|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: Request metadata for 1 topic(s): partition leader query
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state DOWN -> INIT
%7|1704816906.827|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state INIT -> TRY_CONNECT
%7|1704816906.827|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: broker in state TRY_CONNECT connecting
%7|1704816906.827|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state TRY_CONNECT -> CONNECT
%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): broker down =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.859|CONNECT|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected to ipv4#10.255.253.3:443
%7|1704816906.859|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state CONNECT -> SSL_HANDSHAKE
%7|1704816906.859|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.878|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.880|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.897|CONNECTED|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Connected (#14)
%7|1704816906.897|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state SSL_HANDSHAKE -> APIVERSION_QUERY
%7|1704816906.897|TOPPAR|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: produce_topic [51] 4 message(s) queued but broker not up
%7|1704816906.912|STATE|producer.host#producer-2| [thrd:ssl://broker-6:4]: ssl://broker-6:443/6: Broker changed state APIVERSION_QUERY -> UP
%7|1704816906.912|METADATA|producer.host#producer-2| [thrd:ssl://broker-6:4]: Skipping metadata refresh of 1 topic(s): connected: already being requested
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query

and here we can see that even though every one of these metadata requests is returning successfully (I've truncated the response for readability) and all the partitions have leaders, the client is still refreshing the metadata as if something were broken. Note that during this the producer is working fine and messages are getting delivered fine, but we have noticed on our servers that the number of metadata requests from our clients is huge.

%7|1704816906.843|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: ===== Received metadata (for 1 requested topics): partition leader query =====
%7|1704816906.858|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816906.927|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 100ms later
%7|1704816906.943|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.115|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 200ms later
%7|1704816907.131|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816907.475|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 400ms later
%7|1704816907.491|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816908.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 800ms later
%7|1704816908.243|METADATA|producer.host#producer-2| [thrd:main]: ssl://broker-5:443/5: 1/1 requested topic(s) seen in metadata
%7|1704816909.227|METADATA|producer.host#producer-2| [thrd:main]: Requesting metadata for 1/1 topics: partition leader query <-- 1s later

It looks to me like there could be some sort of race condition around broker disconnects/reconnects and metadata refreshing?

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 2.3.0
Apache Kafka version: 3.4.0
librdkafka client configuration: message.max.bytes=25165824;socket.timeout.ms=10000;socket.keepalive.enable=true;debug=metadata,broker,topic
Operating system: centos7
Provide logs (with debug=.. as necessary) from librdkafka

The text was updated successfully, but these errors were encountered:

jpiper · 2024-01-10T13:42:14Z

Reverting to 2.2.0 has fixed the issue so I assume this is a regression introduced in 2.3.0

richardartoul · 2024-01-29T17:25:10Z

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

mensfeld · 2024-01-29T17:26:38Z

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

Wouldn't increasing retry.backoff.max.ms mitigate this?

jpiper · 2024-01-29T17:28:24Z

I can confirm this as well, combined with the very low value of retry.backoff.max.ms (1s) the latest version of librdkafka can easily turn into a DDOS cannon at any moment once a retriable error is encountered.

Wouldn't increasing retry.backoff.max.ms mitigate this?

ehh... if by mitigate you mean "reduce the impact of the bug on the kafka cluster" - it doesn't actually solve the issue at hand.

richardartoul · 2024-01-29T17:28:49Z

Yeah that's a viable work-around in the interim, but this behavior is clearly a bug and there is a lot of software out there that isn't gonna increase that value. It's really common for people to have thousands of Kafka clients/producers and one day they're going to upgrade and everything will work fine until one of their connections has an intermittent error in the middle of the night and their cluster gets melted :/

mensfeld · 2024-01-29T17:29:30Z

@jpiper sorry, this is what I meant. I am a maintainer of the ruby bindings and I usually do my best to ship intermediate fixes when stuff like this occurs. This is why I asked. I may consider changing the defaults for my lib until patch is released.

// edit - I see default for librdkafka is 1 second (1000ms) so as long as not altered, it should be ok)

jpiper · 2024-01-29T21:48:51Z

... until one of their connections has an intermittent error in the middle of the night and their cluster gets melted :/

I feel seen

emasab · 2024-02-27T11:42:20Z

There is a problem in v2.3.0 when metadata is refreshed without leader changes that avoids that fast metadata refresh is stopped. We'll include this fix in next release.

mensfeld mentioned this issue Jan 29, 2024

Consider defaults alter on producer due to potential DDOS karafka/rdkafka-ruby#403

Closed

emasab added the bug label Feb 27, 2024

emasab mentioned this issue Mar 26, 2024

Metadata cache by topic id and fixes #4660

Closed

emasab added the in progress label Mar 27, 2024

emasab mentioned this issue Apr 10, 2024

Fix to metadata refresh interruption #4679

Merged

emasab closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

producer continually refreshing client metadata after broker disconnection #4577

producer continually refreshing client metadata after broker disconnection #4577

jpiper commented Jan 9, 2024

jpiper commented Jan 10, 2024

richardartoul commented Jan 29, 2024

mensfeld commented Jan 29, 2024

jpiper commented Jan 29, 2024

richardartoul commented Jan 29, 2024

mensfeld commented Jan 29, 2024 •

edited

Loading

jpiper commented Jan 29, 2024

emasab commented Feb 27, 2024

producer continually refreshing client metadata after broker disconnection #4577

producer continually refreshing client metadata after broker disconnection #4577

Comments

jpiper commented Jan 9, 2024

Description

Checklist

jpiper commented Jan 10, 2024

richardartoul commented Jan 29, 2024

mensfeld commented Jan 29, 2024

jpiper commented Jan 29, 2024

richardartoul commented Jan 29, 2024

mensfeld commented Jan 29, 2024 • edited Loading

jpiper commented Jan 29, 2024

emasab commented Feb 27, 2024

mensfeld commented Jan 29, 2024 •

edited

Loading