New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage in KafkaConsumer.poll() when subscribed to many topics with no new messages (possibly SSL related) #1315
Comments
Just to clarify, it doesn't actually get stuck, it just spends a lot of time in the tight loop. The application performs correctly, it's just the cpu usage that is the problem (negatively impacts other applications) |
What broker version are you using? Perhaps related: https://issues.apache.org/jira/browse/KAFKA-1563 I dont think there is a way to disable TCP_NODELAY on the kafka broker, but I suspect that would help if the issue is that the broker is sending lots of very small packets. Otherwise, I'll think about how to improve performance in this scenario. |
One alternate approach here might be to wrap each sock.recv() call in a configurable timeout (perhaps |
Our broker version is 0.11.0, so seems unlikely to be KAFKA-1563, but we can have a look into whether it might be that or something else on the server. |
Thanks. I can reproduce this, but only if the poll timeout_ms is 0. Have you tried setting a larger timeout? Perhaps the much simpler solution here is to change the default timeout_ms to something like 100ms? |
Sorry for the delayed response. For us, the issue is actually when we have a non-zero timeout. It's when it sits idle in a poll() call with no messages coming in. The tight loop with small reads probably happens with a 0 timeout too (I'll retry that as soon as I get a chance), but that's actually how we mitigate the issue right now: we do a poll() with timeout_ms=0, and if there are no messages we do a sleep(1). This brings the cpu way down, but probably only because it is spreading it into smaller bursts. In trying to reproduce this with a timeout, did you try subscribing to several topics? That magnifies the issue and makes it easier to notice. |
Tested with timeout_ms=0, I see the same behaviour as I do with a bigger timeout. |
How many total partitions are you assigned? And what is the total leader count across these partitions? |
16 partitions per topic, and up to 40 topics. I tested with 1, 2 10 and 40 topics, same behaviour in all cases, but gets worse with the number of partitions (because the fetch response size goes up). Total leader count 6 (i.e., I have 6 brokers and the leaders are spread among those brokers). |
Also I traced with tcpdump and it looks like the broker is not sending small packets, rather, the client is breaking it up into small reads. Here's a sample of the trace:
while network traffic was blocked at that point, the client continued with its slow read:
in this case I am in a loop where I call poll(timeot_ms=0) and then sleep for 1 second if there are no messages. So it seems to be taking multiple poll calls (over multiple seconds) to assemble the fetch responses. This was with 40 topics in order to highlight the effect. |
BTW, I don't think I mentioned that I am using SSL |
Are you able to profile the process while it is in this state? Have you used vmprof ? |
It looks like the problem is with recv'ing from the SSL wrapped socket. It seems to always return a small number of bytes (most of the time 18). We tried the same test without SSL and don't see the same behaviour. I put the following probe in the 1.3.4 code:
Here's some sample output with SSL enabled:
Here's a sample with a plaintext connection:
We have yet to figure out why the SSL socket is behaving the way it is. I did try putting the self._sock.recv(bytes_to_read) in a loop and only breaking out when staged_bytes == self._next_payload_bytes. (Not sure it is safe to do that, but it was an experiment.) Still got the same small reads, but there was a big improvement in cpu usage (from 100% down to 10%) |
Note, when using SSL, recv() mostly always returns a small number of bytes, it's not just a state that it gets into. I tried a similar probe in current master:
with the following sample result:
I would have expected the cpu to be better in this case, since it reads up to 4096 bytes in a loop, but it was still 100%. |
I misread the code for _recv() in master, it doesn't keep looping until it gets 4096 bytes, it only loops if it reads the full 4096, which makes sense and also explains why the cpu is still high. So I'm at a loss as to why I'm getting these small reads. I've tried to reproduce with a simple client and server app running on the same machines as my kafka client app and kafka broker. The server will respond to client requests with a 16K byte response, and the client selects using the default selector from selectors34 and does a non blocking recv()... and the recv() returns the full 16K bytes every time. It's a bit of a head scratcher. |
Ok, I kind of know what's going now. My test server was sending 16K in a single write. When I make it respond with multiple small writes of 50 bytes each, the client recv() returns 50 bytes at a time. Probably what is happening here (and I am speculating a bit as I am not intimately familiar with the SSL protocol) is that each write from server is encoded as a separate SSL record, and on the client side recv() for an SSL socket will return 1 record at a time. It looks like the Kafka broker is sending the FetchResponse as a bunch of small writes. In fact it looks like it is doing a write for each topic name, and a write for each (empty) message set. That explains the pattern of 18 byte reads (the message sets) and somewhat larger reads (the topics). The reason I see 30 byte reads when using the master code is that it is using the newer protocol. If I specify api_version=(0,10), it goes back to 18 bytes. So, I'm not sure there is much that can be done about getting the small chunks of data from recv(). Do you think it might be possible to safely assemble the response in a tighter loop, without going back through a select() call every time, in order to make it more efficient? |
very interesting. exceptional investigation @rmechler ! |
Thanks. It's when my colleague pointed out that my test server should do multiple small writes that things started to click. :) So it seems like we have a degenerate case: SSL + lots of topics and partitions + relatively low traffic with periods of idle. We can address this to a fair degree on our end by (a) consolidating some of our topics and reducing the number of partitions in some cases, and (b) reducing the frequency of fetches in idle time using fetch_min_bytes and fetch_max_wait. Nevertheless, it seems there is some efficiency to be gained in kafka-python by trying to read as many records as possible before doing another poll. Are you open to that idea? |
Yes, very open.
…On Dec 13, 2017 8:31 AM, "rmechler" ***@***.***> wrote:
Thanks. It's when my colleague pointed out that my test server should do
multiple small writes that things started to click. :)
So it seems like we have a degenerate case: SSL + lots of topics and
partitions + relatively low traffic with periods of idle. We can address
this to a fair degree on our end by (a) consolidating some of our topics
and reducing the number of partitions in some cases, and (b) reducing the
frequency of fetches in idle time using fetch_min_bytes and fetch_max_wait.
Nevertheless, it seems there is some efficiency to be gained in
kafka-python by trying to read as many records as possible before doing
another poll. Are you open to that idea?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1315 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAzetCQCopA_o-SSN5YDhgTpCjbW3te_ks5s__vFgaJpZM4Q4g63>
.
|
I put up a PR that may help in this situation. Would love if you were able to test in your setup and see if you get any improvement! |
Cool, I'll definitely test it out. I'm on holiday without access to my environment at the moment though so won't be able to try out until probably Sunday. |
Finally got a chance to test, sorry for the delay. Looks good. Ran a test that ~40 topics that previously sent the CPU to 100%, and now it goes to ~15% (with the patch). That's without setting fetch_min_bytes / fetch_max_wait_ms. If I set fetch_min_bytes=1000000 / fetch_max_wait_ms=2000, CPU goes to < 5%. If I set fetch_max_wait_ms=5000, CPU goes to < 3% (without the patch, this config is still at 25% CPU). Note that I tested by applying the patch to a699f6a, because I couldn't get current HEAD to work properly, I keep getting an exception:
The problem happens here:
self.join_future.add_errback(self._handle_join_failure) actually ends up calling the errback immediately because it has an exception set (NodeNotReadyError: 4), and that results in self.join_future being set to None, thus future.failed() blows up. |
Great. I'm going to close this. I've filed the NoneType exception as a separate issue. Thanks for testing!! |
Forgot that the PR hasn't landed yet. Will close when merged. |
Thanks, I probably should have opened a separate ticket myself. And thanks for getting a fix in for the CPU issue! |
Experiencing high CPU usage when sitting idle in poll() (i.e., waiting for a timeout when there are no new messages on the broker). Gets worse the more topics I am subscribed to (I have cpu pegged at 100 with 40 topics). Note that I am using 1.3.4 with mostly default configs, and repro'd also in the curret master.
Seems to be a couple things at play here. One is that poll() will do fetch requests in a tight loop. The other, the one that really seems to be killing cpu, is that when a fetch response is received, the low level poll() will get in a relatively tight loop as the payload buffer fills, adding a relatively small number of bytes at a time. This explains the effect of adding more topics: the fetch responses are bigger so it more time in this tight loop. Here's some debug output based on a couple probes I put in the code:
In conn.py: _recv()
In consumer/group.py: _poll_once()
So, for one topic I get output like this while blocked in poll():
For 2 topics:
For 40 topics:
so it gets stuck spinning in this, and cpu goes to 100.
I tried mitigating this using consumer fetch config:
but that did nothing.
The only thing that gets the cpu down is to to a non-blocking poll() instead of using a timeout, and then doing a short sleep when there are no result records (my application can tolerate that latency). It looks like poll used to support something like this, i.e., there was a sleep parameter that caused a sleep for the remainder of the timeout period if there were no records on first fetch. Looks like that was removed in 237bd73, not sure why.
So... like I said I can workaround the continuous fetching with my own sleep. Would be good to understand the real problem which is the tight _recv() loop, and whether anything can be done about it.
The text was updated successfully, but these errors were encountered: