-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Producer memory leak with delivery.report.only.error #84
Comments
@fillest This gets tricky because the issue could be in Python or C, or even due to some bad interaction between the two. You can find a few strategies for starting to pin down heap issues on the Python side here: http://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html They'll probably require some code changes & a redploy. Probably the most interesting thing to do initially would be to use heappy to get a diff over a substantial amount of time (e.g. at least a few minutes) and then find out what in that diff is eating the most memory. If the diff looks too small to account for the difference then it would suggest we're missing cleaning something up on the C side. Of course the more relevant stats you can get for us, the better. |
Could you monitor |
I've added a thread printing I will try to debug using memory tools (thanks for a link) later (somewhere in december, I think) cause I don't have enough time now unfortunately. As a band-aid I'm turning on auto-restarting workers after each N requests for now. |
Thank you for taking the time to investigate this. |
Any progress on this @fillest? |
I ran into this leak today as well. I created a test producer program that sends 20k messages to a kafka topic and waits for it to be flushed. Running the test program under a debug python build and valgrind helped me pinpoint the leak. Version of confluent-kafka-python: 0.9.1.2
It looks like
|
Ah, I think I know what the problem is: Since a msgstate is allocated for each produce it needs to be freed in the delivery report handler, but setting delivery.report.only.error to True means that the delivery report handler isn't called for succesfully delivered messages, just as you say, so that means those msgstates will leak. There is no obvious fix for this so I must advise against the use of delivery.report.only.error and have your application check error status in the delivery report handler instead. |
You are right, I am indeed setting I kinda like having that flag because it makes callback implementation in Python easy. Does the throughput drop if the delivery report handler is called on each message instead of only failed messages? |
It is librdkafka skipping the callback, so the Python code really doesnt have any means of freeing its msgstate, there isnt really a viable solution for this so we'll just have to disalow using delivery.report.only.error on Python or abstract the functionality to filter out succesfull DRs in the Python bindings instead. I dont think performance is a big concern but you'll have to try it to see, performance is really hard to talk generally about since it depends on so many environment specific things. |
@edenhill Seems like we don't actually have to disallow it. Rather, we might want to intercept this setting, force our own callback that delegates to the user callback in the case of error, and in other cases handles cleaning up the message state. We might want to avoid that since it's somewhat misleading (we will be executing a callback for every message, regardless of there being an error or not). If so, we should explicitly disable this option for python, but if we think its still useful even if we have some small (C callback) overhead, we might want to use the approach I described. |
Handle delivery.report.only.error in Python (#84)
This is fixed on master, can you give it a try @fillest? |
Bug in 'confluent-kafka': confluentinc/confluent-kafka-python#84 Let's revisit when 0.11 is released.
Sorry, nope, we've switched from Python to Go |
confluent-kafka==0.9.2, librdkafka 0.9.2, Python 2.7.6, Ubuntu 14.04
I run a web app in 16 gunicorn workers (processes). Each has a producer, which works continually until the process is restarted (normally by deploy system). This instance serves ~100-250 req/s (some part of which produce to kafka) and has ~3.8gb or memory. After switching to confluent-kafka and running it for some time in production I'm observing worker RSS memory only growing and growing (it was not the case before with kafka-python). Here is a screenshot of memory monitoring graph:
The code is roughly like this (so I
poll(0)
after eachproduce
):What should I do now? Do you probably have a suggestion how to debug it? (preferably right in production - it is kinda testing production so any overhead is currently affordable)
The text was updated successfully, but these errors were encountered: