Skip to content

Producer message reliability

Magnus Edenhill edited this page Jul 13, 2017 · 2 revisions

This page outlines common producer failure status and how they are handled.

Waiting for active partition leader

The following cases are handled identically in librdkafka:

  • entire cluster is down
  • no leader for partition
  • no active connection to the partition leader

In all these cases the message(s) remain in the partition's queue waiting for an active connection to the partition leader. The client will perform metadata refreshes at regular intervals to check if the leader has changed, and it will also try to reconnect (forever) to any brokers that are down.

Unresponsive brokers

When the producer sends a ProduceRequest to the broker it will put the message(s) on the wait-response queue (waitresp). The ProduceRequest's protocol timeout is set to the timeout of the first message in the ProduceRequest batch, i.e., the oldest message. (The message timeout is based on message.timeout.ms, the timeout scanner runs roughly once per second (sub-second timeouts are thus meaningless)). If no response is received from the broker within the request timeout the request fails and the message(s) in the request are failed, the RD_KAFKA_RESP_ERR__TIMED_OUT error will be propagated through the delivery report. No retries will be made at this point since the message.timeout.ms has been reached. On the other hand, if the connection is closed before a response is received and before the timeout hits, then metadata is refreshed (to find out if there is a new partition leader that will take over from the down broker) and the messages are put back on the partition queue for a future retransmission. Do note though that this retransmission does not increment the retry count; retries are incremented for temporary server-side errors, not connection losses (which might just be a sign of network instability).

Temporary server-side errors

For temporary errors returned by the broker, such asERR_REQUEST_TIMED_OUT, the request is retried in its entirety and the retry counter is incremented by one.