Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer shouldn't use multiprocessing #151

Closed
snaury opened this issue Apr 1, 2014 · 9 comments
Closed

Producer shouldn't use multiprocessing #151

snaury opened this issue Apr 1, 2014 · 9 comments
Labels

Comments

@snaury
Copy link
Contributor

snaury commented Apr 1, 2014

Producer, when in async mode, uses multiprocessing. Unfortunately this breaks when application itself already uses multiprocessing, because nested processes are not allowed by multiprocessing. What's even more important is that this forces messages to be synchronously pickled and sent over the channel (multiprocessing.Queue) which should be unnecessary. I wanted to use Producer with batch_send=True, but due to these two reasons I effectively can't. Since async and batching is probably more about I/O wouldn't it be better to use threads for async Producers?

@wizzat
Copy link
Collaborator

wizzat commented Apr 2, 2014

So are you talking about having multiple threads write to the same producer, or a producer per thread? IIRC the Connection class is not reentrant and doesn't play well with threads.

@snaury
Copy link
Contributor Author

snaury commented Apr 2, 2014

Neither. I'm saying that kafka/producer.py should use threading.Thread and Queue.Queue instead of multiprocessing.Process and multiprocessing.Queue.

@wizzat
Copy link
Collaborator

wizzat commented Apr 2, 2014

I see. I'm generally ok with the idea of a thread based async producer, but I'm not sure that it should come at the cost of the multiprocessing based one. I'd be interested in seeing any benchmarking that showed the cost multiprocessing.Queue exceeds the cost of sending via a thread.

@dpkp
Copy link
Owner

dpkp commented Apr 2, 2014

There was also some work a while back to make this pluggable between threading / multiprocessing / gevent / etc. I think the previous discussion was on #37 . Not sure whether any of those folks have made progress on this. No PR that I know of, other than the gevent support.

@snaury
Copy link
Contributor Author

snaury commented Apr 2, 2014

@wizzat the cost of sending over multiprocessing.Queue is huge. See my gist for an example benchmark:

https://gist.github.com/snaury/9939482

As you can see threading queues are not impacted by message sizes, while multiprocessing queues are. What's even more important is that sending via multiprocessing.Queue can never be faster than threading.Queue because multiprocessing.Queue actually uses threads under the hood (otherwise you'd find that multiprocessing.Pipe obviously has very small buffer size and thus is not asynchronous at all).

Performance, however, is not that big a problem (at least for me). Any performance improvements from not paying pickle/unpickle tax over pipes would be a plus (though, compared to network speed, it would of course be negligible), but the biggest problem is that by using multiprocessing kafka-python disallows uses of multiprocessing where it might give more gain. I might want to use multiprocessing where there's a lot of computation involved (it's not just numbers, e.g. text parsing is also computation), and where parallelism is really important. Imagine that during those computations I want to write results to Kafka, but I don't want to synchronously wait for any ACKs from Kafka, thus I want async=True. When kafka-python tries to use multiprocessing it, simply speaking, bans parallelism in my application, because nested processes are not allowed by multiprocessing.

I would be ok if there was e.g. MultiProcessProducer (similar to MultiProcessConsumer), but default shouldn't use multiprocessing, unless explicitly asked to.

@wizzat
Copy link
Collaborator

wizzat commented Apr 2, 2014

Thank you for the detailed response! I really want to try some performance benchmarking myself, but I'm 100% in favor of a threaded solution. Do you have a PR for this?

@snaury
Copy link
Contributor Author

snaury commented Apr 2, 2014

@wizzat not yet, will prepare one in a day or two.

@vshlapakov
Copy link
Contributor

I think we can close this issue.

@dpkp
Copy link
Owner

dpkp commented Mar 26, 2015

yes indeed -- thanks for all the help on this!

@dpkp dpkp closed this as completed Mar 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants