-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message stalled in queue when broker (partition leader) die #1913
Comments
If you are going to lower Try increasing |
@rnpridgeon I'm agree with you but like I said it's just to make the troubleshooting easier. |
The issue seems to be that the ProduceRequests in rkb_outbufs (requests waiting to be sent) are left on the queue rather than having their error-checking callback triggered, which in the ProduceRequest case would move the messages back to the partition queue. This was introduced in a commit over a year ago: 70cf144 |
There are multiple parts to this fix: * A request/response handler can now check if a failing request was sent out on the wire (or did not make it past the output queue). * The retry code now only increments the retry count for actually sent requests. * When the broker connection goes down, requests in the output queue are now purged to have their handler callbacks called which in turn will trigger a (now free) retry. * ProduceRequests are not retried, but their messages are put back on the partition queue (existing behaviour).
There are multiple parts to this fix: * A request/response handler can now check if a failing request was sent out on the wire (or did not make it past the output queue). * The retry code now only increments the retry count for actually sent requests. * When the broker connection goes down, requests in the output queue are now purged to have their handler callbacks called which in turn will trigger a (now free) retry. * ProduceRequests are not retried, but their messages are put back on the partition queue (existing behaviour).
There are multiple parts to this fix: * A request/response handler can now check if a failing request was sent out on the wire (or did not make it past the output queue). * The retry code now only increments the retry count for actually sent requests. * When the broker connection goes down, requests in the output queue are now purged to have their handler callbacks called which in turn will trigger a (now free) retry. * ProduceRequests are not retried, but their messages are put back on the partition queue (existing behaviour).
I confirm your commit correct the issue. I guess you can close the issue. Maybe do you have an idea when it will be released ? in librdkafka and in confluent-kafka-python, in my opinion it's a critical issue due to messages loss. |
Since the message is stuck in an internal queue and the application never gets a callback for a message it has produced, the messages are technically not lost - the application knows it tried to produce a message but never got a response back. A lost message is one that librdkafka triggered a successful delivery report but did not deliver the message. We'll schedule a maintenance release soon. |
Description
When a broker, which is the leader of the partition, goes down while a producer produce message at continuous rate, some messages timeout at the end of the message.timeout.ms and result in messages loss.
It seems the messages stay in a queue somewhere and are not re-send to the new leader.
I have Repeated log message like this:
Full log:
github_kafka-producer.log.gz
The consumer:
How to reproduce
kill -9 the leader of the partition
Checklist
Please provide the following information:
v0.11.5
4.0.0
api.version.request=True,socket.keepalive.enable= True,request.required.acks=-1,max.in.flight=1,error_cb=error_cb,on_delivery=delivery_cb,debug=all
Rhel 7 (x64)
PS: I use the max.in.flight=1 to not flood the log, if you remove this parameter you will observe multiple message timeout.
The text was updated successfully, but these errors were encountered: