-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafkaProducer.Events() stops sending callback events, then kafkaProducer.ProduceChannel() fills and and becomes blocked indefinitely. #251
Comments
Interesting. And this is regardless of any error events in the cluster, such as a broker going down or partition leader becoming unavailable? |
If there is an issue within the Kafka cluster, it only affects a single client server at a time, and is remedied by a manual restart. There are multiple server instances running the exact same code sending messages to the same Kafka cluster. One will hit this issue and require a manual restart to resume sending to the Kafka cluster, while the others will not experience any hiccups and run completely smoothly. |
@kurtostfeld have you had any luck debugging this issue? @edenhill I am seeing a similar issue except more accelerated- producer stops working after a few hours running. I have a sample bit of code that reproduces the issue consistently. It's just a minimal webserver that writes the body of incoming requests into a topic. After a few million messages (5-6 million) the producer stops sending messages. Memory consumption looks fine (i'm monitoring the container this code runs in and there is plenty of memory free). I don't think I'm leaking goroutines- when i check pprof the number of goroutines is not growing, and in fact when the producers stop working, there are only 14 goroutines present according to pprof.
Here is a stack trace of the active goroutines:
|
@trtg You don't seem to read from the producer's Events channel, which causes it to fill up with delivery reports. |
@edenhill thanks for the clarification, I didn't realize consuming the delivery reports was required. I'll rework things to do that. Just out of curiosity, however, where would go.delivery.reports be set? It's not mentioned in CONFIGURATION.md |
It is a go-level config property, it is documented in the NewProducer docs. |
I've modified my sending code to the following. This code will run for days to weeks, before it hits the issue. When the In the following code, the first log message shows up but the second message does not. What I was hoping to do, was when this rare even occurs, just close+reopen the Kafka Producer. Unfortunately, Close() blocks and that strategy doesn't work. Is there anything else that I can try to resolve this issue? select {
case p.kafkaProducer.ProduceChannel() <- &msg:
// Success.
default:
// There seems to be a relatively rare bug in the Go Kafka client, where the ProduceChannel can fill up
// and the client needs to be manually reset.
log.Println("Kafka ProduceChannel() full. Resetting channel...")
// fyi, this is a Prometheus Metric counter.
kafkaClientResetsCounter.Inc()
// Unfortunately, the code will hang here. This Close() will never complete.
// The application needs to manually restarted at this point.
p.kafkaProducer.Close()
p.kafkaProducer = nil
log.Println("Reset channel complete.")
} |
@edenhill, any comment on the above. Is there anything else that I can try? I guess start adding logs+breakpoints to the confluent-kafka-go code to debug into that? |
@kurtostfeld we are facing a similar issue where, we face the error message |
@kurtostfeld The channel producer will block on its internal produce() call until there is room in the send-queue, which unfortunately also blocks Close(). |
Hi everyone, We were apparently hitting the same issue after upgrading to librdkafka 0.11.6 Over the course of 3 days, 5 of our 300 consumers/producers experienced an unexpected "hang" (no further processing was done and there was one core at 100% cpu usage). Some of these producers were also complaining of full queues. This happened at apparently random times. If you grabbed a profile from one of these stalled apps, it would look like the following: We've tracked this down to these 2 issues:
Given that these fixes are very simple and have already been merged into master but haven't been released in a stable version, we patched them into the 0.11.6 release, rebuilt librdkafka with them and haven't had another stall for over 5 days now.
|
@AlexJF That's great! Thank you! We'll have a new v1.0.0 release (with the fixes) of both librdkafka and confluent-kafka-go within a week or two |
FWIW I'm still seeing the issues in 1.0.0 and 1.10 in alpine3.10 - is there anything I can do to help debug this? |
I too have the exact same issue. |
Description
Sporadic issue where Kafka Go client works fine for days, then hits an issue where it stops sending callback events to
kafkaProducer.Events()
, then minutes later,kafkaProducer.ProduceChannel()
becomes blocked and stops taking new messages. The Go app will never send or accept more messages until the Go app is manually restarted.How to reproduce
It takes days of running in production for this issue to occur. It's happened multiple times. I don't have a way to reproduce easily.
Checklist
Please provide the following information:
LibraryVersion()
): librdkafka1 version librdka0.11.6~1confluent5.0.1-&kafka.ConfigMap{ "bootstrap.servers": strings.Join(bootstrapServers,","), })
"debug": ".."
as necessary): None.The text was updated successfully, but these errors were encountered: