New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Produced messages get stuck in rkb_retrybufs until Local: Message timed out
#1432
Comments
Very good report! Do you have more than one broker in the cluster, and if so, are other brokers picking up leadership for the killed broker's partitions? |
Thanks! The cluster has 3 brokers. Yeap it seems that other brokers are picking up the leadership. Leadership report before broker stop:
And after stop:
|
…1432, #1476, #1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
…1432, #1476, #1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
…, confluentinc#1092, confluentinc#1432, confluentinc#1476, confluentinc#1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
…1432, #1476, #1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
…1432, #1476, #1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
…1432, #1476, #1421) ProduceRequest retries are reworked to not retry the request itself, but put the messages back on the partition queue (while maintaining input order) and then have an upcoming ProduceRequest include the messages again. Retries are now calculated per message rather than ProduceRequest and the retry backoff is also enforced on a per-message basis. The input order of messages is retained during this whole process, which should guarantee ordered delivery if max.in.flight=1 but with retries > 0. The new behaviour is formalised through documentation (INTRODUCTION.md)
This is now fixed on master |
Hello, I think this should re-open. I run the steps listed in the "How to reproduce" section again but with librdkafka 0.11.4 and kafka 0.11.0.2. I froze and then stopped the partition leader and after 5 mins (the default After the Also notice that the retransmission happened right after the |
Do you have any debug logs from this occurence? |
I re-run the reproduction steps with But still most of the times i get the usual The attached debug logs come from a high message loss occurence: |
I've got exactly the same error.
with librdkafka 0.11.4 and kafka 1.0.0 |
Description
Hello,
The Producer message reliability # Unresponsive brokers wiki section states following:
in practice however it seems that the above retransmission happens only if the old partition leader returns to UP state. If we leave the old partition leader down for a period longer than
message.timeout.ms
, the messages seem to get stuck in therkb_retrybufs
queue until they hit themessage.timeout.ms
and then fail.In the code, if we follow the producer related callers chain of
rd_kafka_broker_retry_bufs_move()
which moves the messages from the retry queue to the outbuf queue, we see that it is called under acase RD_KAFKA_BROKER_STATE_UP
(src/rdkafka_broker.c:3150) which probably proves the above hypothesis.How to reproduce
While kafkacat produces the messages we run the following on a partition leader broker:
kafkacat output:
after 5 mins (default
message.timeout.ms
):Checklist
Please provide the following information:
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: