-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery takes too long after stopping and restarting a broker in a cluster #2508
Comments
examples/rdkafka_performance -P -t test -s 343 -c 500000000 -b <> -X linger.ms=1 -X queue.buffering.max.messages=10000000 -X batch.num.messages=1000000 -X message.max.bytes=20971520 -X queued.max.messages.kbytes=2097151 -X enable.idempotence=true -D |
Set min.insync.replicas=1. After the second broker restarts, sending hangs for a while and the cause looks like it's somewhere related to rebalancing: %7|1568019280.284|REQERR|rdkafka#producer-1| [thrd:192.168.4.2:9095/bootstrap]: 192.168.4.2:9095/1: ProduceRequest failed: Broker: Not leader for partition: actions Refresh,MsgNotPersisted |
At about this point in the logs it stops completely with perf top showing that it's sorting using rd_kafka_msgq_enq_sorted0: |
1568020658 -> somewhere here it stops sending, and 1568021996 -> somewhere around here it starts sending again |
Backtrace:
|
Interrupted with debugger while sorting |
Fantastic! 💯 |
Breakpoint before sorting and while sorting |
After sorting (src and dest are identical but attached for completeness' sake) |
Message consecutive ranges in the attached files:
Looking at *-before-sort we see that srcq has one missing range which is perfectly filled by the destq. With the current insert sort method the naiive approximated cost is: So there's clearly room for improvement. |
I agree. But I'm even more worried about why does the srcq grow together with the destq? |
… retry (#2508) The msgq insert code now properly handles interleaved and overlapping message range inserts, which may occur during Producer retries for high-throughput applications.
… retry (confluentinc#2508) The msgq insert code now properly handles interleaved and overlapping message range inserts, which may occur during Producer retries for high-throughput applications. (cherry picked from commit 3e6c64d)
… retry (#2508) The msgq insert code now properly handles interleaved and overlapping message range inserts, which may occur during Producer retries for high-throughput applications.
… retry (confluentinc#2508) The msgq insert code now properly handles interleaved and overlapping message range inserts, which may occur during Producer retries for high-throughput applications.
… retry (#2508) The msgq insert code now properly handles interleaved and overlapping message range inserts, which may occur during Producer retries for high-throughput applications.
Description
Stopping a broker for a while (while producing with a high rate) and then restarting it leads to message losses when the broker rejoins the cluster (basically because librdkafka sees the brokers as DOWN). I don't think this should be the case.
How to reproduce
Stop a broker in a cluster for a while then restart it.
Checklist
Please provide the following information:
1.0.1
2.3.0
"linger.ms=1;queue.buffering.max.messages=10000000;batch.num.messages=100000;message.max.bytes=20971520;queued.max.messages.kbytes=2097151;enable.idempotence=true"
Centos 7 (x64)
Attaching stack trace and logs.
stacktrace.txt
I will try to provide more data as I figure out more details.
The text was updated successfully, but these errors were encountered: