[RFC] Performance improvements of ring buffer processing #1372

gianlucaborello · 2019-04-19T21:31:32Z

Working on a feature that's particularly sensitive to event drops and truncated snaplens, I did some performance experiments.

My chosen workload is a sustained busy loop that does nothing more than writing 10KB at a time to /dev/null using write(). On my laptop, on a single core I have a throughput of ~1.5M evt/s. Since I am explicitly running with a very high snaplen (65k), this workload would generate a sustained 7 GB/s (!) on a single core inside a ring buffer that could keep up with the consumption. Since our buffer is 8MB, this means that userspace has ~1ms to process all the data inside a ring buffer, before starting to drop. This workload highlighted some potential low hanging fruits that I'll describe here.

To start, this is how the whole thing behaves today, using eBPF (kmod is similar, but with less overhead):

Total events (including dropped): 11.3M
Dropped events: 10.6M

Pretty bad, we dropped 95% of the events.

Using an off-time CPU analysis methodology this is what I get during a 10s run:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 27973 sysdig

     msecs               : count     distribution
         0 -> 1          : 16       |**                                      |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 315      |****************************************|
        32 -> 63         : 1        |                                        |

$ sudo offcputime -p $(pgrep sysdig) 10
Tracing off-CPU time (us) of PID 31663 by user + kernel stack for 10 secs.

    finish_task_switch
    __schedule
    schedule
    exit_to_usermode_loop
    prepare_exit_to_usermode
    swapgs_restore_regs_and_return_to_usermode
    scap_next
    -                sysdig (31663)
        50

    finish_task_switch
    __schedule
    schedule
    do_nanosleep
    hrtimer_nanosleep
    sys_nanosleep
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    __GI___nanosleep
    -                sysdig (31663)
        9824891

So, while sysdig was dropping buffers like crazy, it was also leisurely spending the vast majority of its time off CPU (315 * 30ms ~= 9.8s), as a result of the 30ms sleep we use when we find the buffer empty. This happens because the userspace consumer is not CPU-bound, so it's able to keep up with the producer, and when it catches up it goes to sleep, and during the pause time the producer sends 100s MB of data, overflowing the buffer.

The general approach here is to obviously enlarge the buffer and/or diminish the sleeping time. Diminishing the sleeping time is tricky because it increases CPU utilization, and increasing the ring buffer is tricky because it's a lot of kernel memory wasted on multi core. To absorb a 30ms pause at 7 GB/s the per-core buffer needs to be 200+MB large.

These are the things I experimented with:

I replaced the fixed 30ms sleep with a simple adaptive algorithm: every time we find the buffer to be empty, we simply double the time we will sleep until we reach a ceiling (30ms). If the buffer is not empty, we reset the sleep threshold so we will start polling more aggressively next time (500us in this PR).
I removed the logic of "max consecutive wait", since I couldn't find a sense to it: regardless of whether we wait or not, at the end of each period we always refill the buffer and make it available to the caller, so it's not that multiple consecutive waits can cause any harm (that I can see, at least), hence they don't need to be tracked.
The tail of the buffer was previously advanced when the buffers are refilled, which for any buffer it could potentially be several us/ms after it is consumed since we have to wait for all the buffers to be emptied by userspace, thus keeping the ring buffer needlessly occupied. The tail is now advanced every time we are done completely consuming a buffer. The number of operations should be exactly the same. Admittedly this is a micro optimization, more for readability of the code, so that check_scap_next_wait always looks at the new amount of data available, rather than counting the already consumed one, which I found odd.

While this doesn't solve a bursty workload, it does wonders for a sustained workload, as we can see here after the patch for the same stress test:

Total events (including dropped): 8M
Dropped events: 60k

Not bad at all for a 8MB ring buffer, < 1% dropped (the lower throughput is because now the kernel probe is spending much more time copying all those 10KB into the ring buffer, since it's finding it free).

And the CPU of sysdig remains pretty much the same. In particular, during idle time, the adaptive sleep will always quickly reach the ceiling and stay there (see the relatively minor amount of sleeps under 30ms for the 10s period):

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 37       |****                                    |
         2 -> 3          : 11       |*                                       |
         4 -> 7          : 11       |*                                       |
         8 -> 15         : 11       |*                                       |
        16 -> 31         : 317      |****************************************|

When we are under sustained workload, the adaptive sleep will always instead be at the floor, allowing us to be very responsive to the sustained load:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 12746    |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1        |                                        |

This is rather simple but seems to be doing the job. Ideally I'd like to replace it with a better feedback loop, where the current throughput of the ring buffer is estimated (via timestamps and amounts read over a moving average), and the polling interval is calculated based on that. This way, it would be even more robust with bursty workloads.

Also, a reader might ask why we don't move to a proper I/O multiplexed model using poll(). Unfortunately, it's not obvious that it would help performance, in the past I've done experiments on this and the overhead necessary on the kernel side to maintain the state and wake up the consumer wasn't acceptable even with generous overflow thresholds when you talk about millions of events per second. It's possible that the tests weren't thorough, but that's certainly a much more significant work.

This is just a checkpoint on the experiment, since I'm going to be focusing on something else for a while.

nathan-b

Looks fine to me!

gianlucaborello · 2019-04-23T18:15:20Z

Thanks for reviewing, I'm going to test this a little bit more and also wait to see if there are other reviewers (considering how critical this part is), and will then merge.

Working on a feature that's particularly sensitive to event drops and truncated snaplens, I did some performance experiments. My chosen workload is a sustained busy loop that does nothing more than writing 10KB at a time to /dev/null using `write()`. On my laptop, on a single core I have a throughput of ~1.5M evt/s. Since I am explicitly running with a very high snaplen (65k), this workload would generate a sustained 7 GB/s (!) on a single core inside a ring buffer that could keep up with the consumption. Since our buffer is 8MB, this means that userspace has ~1ms to process all the data inside a ring buffer, before starting to drop. This workload highlighted some potential low hanging fruits that I'll describe here. To start, this is how the whole thing behaves today, using eBPF (kmod is similar, but with less overhead): Total events (including dropped): 11.3M Dropped events: 10.6M Pretty bad, we dropped 95% of the events. Using an off-time CPU analysis methodology this is what I get during a 10s run: $ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1 Tracing off-CPU time... Hit Ctrl-C to end. pid = 27973 sysdig msecs : count distribution 0 -> 1 : 16 |** | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 315 |****************************************| 32 -> 63 : 1 | | $ sudo offcputime -p $(pgrep sysdig) 10 Tracing off-CPU time (us) of PID 31663 by user + kernel stack for 10 secs. finish_task_switch __schedule schedule exit_to_usermode_loop prepare_exit_to_usermode swapgs_restore_regs_and_return_to_usermode scap_next - sysdig (31663) 50 finish_task_switch __schedule schedule do_nanosleep hrtimer_nanosleep sys_nanosleep do_syscall_64 entry_SYSCALL_64_after_hwframe __GI___nanosleep - sysdig (31663) 9824891 So, while sysdig was dropping buffers like crazy, it was also leisurely spending the vast majority of its time off CPU (315 * 30ms ~= 9.8s), as a result of the 30ms sleep we use when we find the buffer empty. This happens because the userspace consumer is not CPU-bound, so it's able to keep up with the producer, and when it catches up it goes to sleep, and during the pause time the producer sends 100s MB of data, overflowing the buffer. The general approach here is to obviously enlarge the buffer and/or diminish the sleeping time. Diminishing the sleeping time is tricky because it increases CPU utilization, and increasing the ring buffer is tricky because it's a lot of kernel memory wasted on multi core. To absorb a 30ms pause at 7 GB/s the per-core buffer needs to be 200+MB large. These are the things I experimented with: - I replaced the fixed 30ms sleep with a simple adaptive algorithm: every time we find the buffer to be empty, we simply double the time we will sleep until we reach a ceiling (30ms). If the buffer is not empty, we reset the sleep threshold so we will start polling more aggressively next time (500us in this PR). - I removed the logic of "max consecutive wait", since I couldn't find a sense to it: regardless of whether we wait or not, at the end of each period we always refill the buffer and make it available to the caller, so it's not that multiple consecutive waits can cause any harm (that I can see, at least), hence they don't need to be tracked. - The tail of the buffer was previously advanced when the buffers are refilled, which for any buffer it could potentially be several us/ms after it is consumed since we have to wait for all the buffers to be emptied by userspace, thus keeping the ring buffer needlessly occupied. The tail is now advanced every time we are done completely consuming a buffer. The number of operations should be exactly the same. Admittedly this is a micro optimization, more for readability of the code, so that `check_scap_next_wait` always looks at the new amount of data available, rather than counting the already consumed one, which I found odd. While this doesn't solve a bursty workload, it does wonders for a sustained workload, as we can see here after the patch for the same stress test: Total events (including dropped): 8M Dropped events: 60k Not bad at all for a 8MB ring buffer, < 1% dropped (the lower throughput is because now the kernel probe is spending much more time copying all those 10KB into the ring buffer, since it's finding it free). And the CPU of sysdig remains pretty much the same. In particular, during idle time, the adaptive sleep will always quickly reach the ceiling and stay there (see the relatively minor amount of sleeps under 30ms for the 10s period): $ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1 Tracing off-CPU time... Hit Ctrl-C to end. pid = 28645 sysdig msecs : count distribution 0 -> 1 : 37 |**** | 2 -> 3 : 11 |* | 4 -> 7 : 11 |* | 8 -> 15 : 11 |* | 16 -> 31 : 317 |****************************************| When we are under sustained workload, the adaptive sleep will always instead be at the floor, allowing us to be very responsive to the sustained load: $ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1 Tracing off-CPU time... Hit Ctrl-C to end. pid = 28645 sysdig msecs : count distribution 0 -> 1 : 12746 |****************************************| 2 -> 3 : 0 | | 4 -> 7 : 1 | | This is rather simple but seems to be doing the job. Ideally I'd like to replace it with a better feedback loop, where the current throughput of the ring buffer is estimated (via timestamps and amounts read over a moving average), and the polling interval is calculated based on that. This way, it would be even more robust with bursty workloads. Also, a reader might ask why we don't move to a proper I/O multiplexed model using poll(). Unfortunately, it's not obvious that it would help performance, in the past I've done experiments on this and the overhead necessary on the kernel side to maintain the state and wake up the consumer wasn't acceptable even with generous overflow thresholds when you talk about millions of events per second. It's possible that the tests weren't thorough, but that's certainly a much more significant work.

gianlucaborello · 2019-04-24T19:12:26Z

I amended the commit message to be more meaningful, and after spending the whole day yesterday doing more performance tests and not finding any problems, I'm merging this. Hopefully if there are issues folks will run into them while on the dev branch.

nathan-b approved these changes Apr 23, 2019

View reviewed changes

gianlucaborello force-pushed the scap-perf branch from 0142113 to 490b356 Compare April 24, 2019 19:11

gianlucaborello merged commit be90e43 into dev Apr 24, 2019

gianlucaborello deleted the scap-perf branch April 24, 2019 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Performance improvements of ring buffer processing #1372

[RFC] Performance improvements of ring buffer processing #1372

gianlucaborello commented Apr 19, 2019 •

edited

Loading

nathan-b left a comment

gianlucaborello commented Apr 23, 2019

gianlucaborello commented Apr 24, 2019

[RFC] Performance improvements of ring buffer processing #1372

[RFC] Performance improvements of ring buffer processing #1372

Conversation

gianlucaborello commented Apr 19, 2019 • edited Loading

nathan-b left a comment

Choose a reason for hiding this comment

gianlucaborello commented Apr 23, 2019

gianlucaborello commented Apr 24, 2019

gianlucaborello commented Apr 19, 2019 •

edited

Loading