Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Performance improvements of ring buffer processing #1372

Merged
merged 1 commit into from
Apr 24, 2019

Conversation

gianlucaborello
Copy link
Contributor

@gianlucaborello gianlucaborello commented Apr 19, 2019

Working on a feature that's particularly sensitive to event drops and truncated snaplens, I did some performance experiments.

My chosen workload is a sustained busy loop that does nothing more than writing 10KB at a time to /dev/null using write(). On my laptop, on a single core I have a throughput of ~1.5M evt/s. Since I am explicitly running with a very high snaplen (65k), this workload would generate a sustained 7 GB/s (!) on a single core inside a ring buffer that could keep up with the consumption. Since our buffer is 8MB, this means that userspace has ~1ms to process all the data inside a ring buffer, before starting to drop. This workload highlighted some potential low hanging fruits that I'll describe here.

To start, this is how the whole thing behaves today, using eBPF (kmod is similar, but with less overhead):

Total events (including dropped): 11.3M
Dropped events: 10.6M

Pretty bad, we dropped 95% of the events.

Using an off-time CPU analysis methodology this is what I get during a 10s run:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 27973 sysdig

     msecs               : count     distribution
         0 -> 1          : 16       |**                                      |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 315      |****************************************|
        32 -> 63         : 1        |                                        |

$ sudo offcputime -p $(pgrep sysdig) 10
Tracing off-CPU time (us) of PID 31663 by user + kernel stack for 10 secs.

    finish_task_switch
    __schedule
    schedule
    exit_to_usermode_loop
    prepare_exit_to_usermode
    swapgs_restore_regs_and_return_to_usermode
    scap_next
    -                sysdig (31663)
        50

    finish_task_switch
    __schedule
    schedule
    do_nanosleep
    hrtimer_nanosleep
    sys_nanosleep
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    __GI___nanosleep
    -                sysdig (31663)
        9824891

So, while sysdig was dropping buffers like crazy, it was also leisurely spending the vast majority of its time off CPU (315 * 30ms ~= 9.8s), as a result of the 30ms sleep we use when we find the buffer empty. This happens because the userspace consumer is not CPU-bound, so it's able to keep up with the producer, and when it catches up it goes to sleep, and during the pause time the producer sends 100s MB of data, overflowing the buffer.

The general approach here is to obviously enlarge the buffer and/or diminish the sleeping time. Diminishing the sleeping time is tricky because it increases CPU utilization, and increasing the ring buffer is tricky because it's a lot of kernel memory wasted on multi core. To absorb a 30ms pause at 7 GB/s the per-core buffer needs to be 200+MB large.

These are the things I experimented with:

  • I replaced the fixed 30ms sleep with a simple adaptive algorithm: every time we find the buffer to be empty, we simply double the time we will sleep until we reach a ceiling (30ms). If the buffer is not empty, we reset the sleep threshold so we will start polling more aggressively next time (500us in this PR).

  • I removed the logic of "max consecutive wait", since I couldn't find a sense to it: regardless of whether we wait or not, at the end of each period we always refill the buffer and make it available to the caller, so it's not that multiple consecutive waits can cause any harm (that I can see, at least), hence they don't need to be tracked.

  • The tail of the buffer was previously advanced when the buffers are refilled, which for any buffer it could potentially be several us/ms after it is consumed since we have to wait for all the buffers to be emptied by userspace, thus keeping the ring buffer needlessly occupied. The tail is now advanced every time we are done completely consuming a buffer. The number of operations should be exactly the same. Admittedly this is a micro optimization, more for readability of the code, so that check_scap_next_wait always looks at the new amount of data available, rather than counting the already consumed one, which I found odd.

While this doesn't solve a bursty workload, it does wonders for a sustained workload, as we can see here after the patch for the same stress test:

Total events (including dropped): 8M
Dropped events: 60k

Not bad at all for a 8MB ring buffer, < 1% dropped (the lower throughput is because now the kernel probe is spending much more time copying all those 10KB into the ring buffer, since it's finding it free).

And the CPU of sysdig remains pretty much the same. In particular, during idle time, the adaptive sleep will always quickly reach the ceiling and stay there (see the relatively minor amount of sleeps under 30ms for the 10s period):

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 37       |****                                    |
         2 -> 3          : 11       |*                                       |
         4 -> 7          : 11       |*                                       |
         8 -> 15         : 11       |*                                       |
        16 -> 31         : 317      |****************************************|

When we are under sustained workload, the adaptive sleep will always instead be at the floor, allowing us to be very responsive to the sustained load:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 12746    |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1        |                                        |

This is rather simple but seems to be doing the job. Ideally I'd like to replace it with a better feedback loop, where the current throughput of the ring buffer is estimated (via timestamps and amounts read over a moving average), and the polling interval is calculated based on that. This way, it would be even more robust with bursty workloads.

Also, a reader might ask why we don't move to a proper I/O multiplexed model using poll(). Unfortunately, it's not obvious that it would help performance, in the past I've done experiments on this and the overhead necessary on the kernel side to maintain the state and wake up the consumer wasn't acceptable even with generous overflow thresholds when you talk about millions of events per second. It's possible that the tests weren't thorough, but that's certainly a much more significant work.

This is just a checkpoint on the experiment, since I'm going to be focusing on something else for a while.

Copy link
Contributor

@nathan-b nathan-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me!

@gianlucaborello
Copy link
Contributor Author

Thanks for reviewing, I'm going to test this a little bit more and also wait to see if there are other reviewers (considering how critical this part is), and will then merge.

Working on a feature that's particularly sensitive to event drops and
truncated snaplens, I did some performance experiments.

My chosen workload is a sustained busy loop that does nothing more than
writing 10KB at a time to /dev/null using `write()`. On my laptop, on a
single core I have a throughput of ~1.5M evt/s. Since I am explicitly
running with a very high snaplen (65k), this workload would generate a
sustained 7 GB/s (!) on a single core inside a ring buffer that could keep
up with the consumption. Since our buffer is 8MB, this means that userspace
has ~1ms to process all the data inside a ring buffer, before starting to
drop. This workload highlighted some potential low hanging fruits that I'll
describe here.

To start, this is how the whole thing behaves today, using eBPF (kmod is
similar, but with less overhead):

Total events (including dropped): 11.3M
Dropped events: 10.6M

Pretty bad, we dropped 95% of the events.

Using an off-time CPU analysis methodology this is what I get during a 10s
run:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 27973 sysdig

     msecs               : count     distribution
         0 -> 1          : 16       |**                                      |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 315      |****************************************|
        32 -> 63         : 1        |                                        |

$ sudo offcputime -p $(pgrep sysdig) 10
Tracing off-CPU time (us) of PID 31663 by user + kernel stack for 10 secs.

    finish_task_switch
    __schedule
    schedule
    exit_to_usermode_loop
    prepare_exit_to_usermode
    swapgs_restore_regs_and_return_to_usermode
    scap_next
    -                sysdig (31663)
        50

    finish_task_switch
    __schedule
    schedule
    do_nanosleep
    hrtimer_nanosleep
    sys_nanosleep
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    __GI___nanosleep
    -                sysdig (31663)
        9824891

So, while sysdig was dropping buffers like crazy, it was also leisurely
spending the vast majority of its time off CPU (315 * 30ms ~= 9.8s), as a
result of the 30ms sleep we use when we find the buffer empty. This happens
because the userspace consumer is not CPU-bound, so it's able to keep up
with the producer, and when it catches up it goes to sleep, and during the
pause time the producer sends 100s MB of data, overflowing the buffer.

The general approach here is to obviously enlarge the buffer and/or
diminish the sleeping time. Diminishing the sleeping time is tricky because
it increases CPU utilization, and increasing the ring buffer is tricky
because it's a lot of kernel memory wasted on multi core. To absorb a 30ms
pause at 7 GB/s the per-core buffer needs to be 200+MB large.

These are the things I experimented with:

- I replaced the fixed 30ms sleep with a simple adaptive algorithm: every
  time we find the buffer to be empty, we simply double the time we will
  sleep until we reach a ceiling (30ms). If the buffer is not empty, we
  reset the sleep threshold so we will start polling more aggressively next
  time (500us in this PR).

- I removed the logic of "max consecutive wait", since I couldn't find a
  sense to it: regardless of whether we wait or not, at the end of each
  period we always refill the buffer and make it available to the caller,
  so it's not that multiple consecutive waits can cause any harm (that I
  can see, at least), hence they don't need to be tracked.

- The tail of the buffer was previously advanced when the buffers are
  refilled, which for any buffer it could potentially be several us/ms
  after it is consumed since we have to wait for all the buffers to be
  emptied by userspace, thus keeping the ring buffer needlessly occupied.
  The tail is now advanced every time we are done completely consuming a
  buffer. The number of operations should be exactly the same. Admittedly
  this is a micro optimization, more for readability of the code, so that
  `check_scap_next_wait` always looks at the new amount of data available,
  rather than counting the already consumed one, which I found odd.

While this doesn't solve a bursty workload, it does wonders for a sustained
workload, as we can see here after the patch for the same stress test:

Total events (including dropped): 8M
Dropped events: 60k

Not bad at all for a 8MB ring buffer, < 1% dropped (the lower throughput is
because now the kernel probe is spending much more time copying all those
10KB into the ring buffer, since it's finding it free).

And the CPU of sysdig remains pretty much the same. In particular, during
idle time, the adaptive sleep will always quickly reach the ceiling and
stay there (see the relatively minor amount of sleeps under 30ms for the
10s period):

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 37       |****                                    |
         2 -> 3          : 11       |*                                       |
         4 -> 7          : 11       |*                                       |
         8 -> 15         : 11       |*                                       |
        16 -> 31         : 317      |****************************************|

When we are under sustained workload, the adaptive sleep will always
instead be at the floor, allowing us to be very responsive to the sustained
load:

$ sudo cpudist -O -m -P -p $(pgrep sysdig) 10 1
Tracing off-CPU time... Hit Ctrl-C to end.

pid = 28645 sysdig

     msecs               : count     distribution
         0 -> 1          : 12746    |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1        |                                        |

This is rather simple but seems to be doing the job. Ideally I'd like to
replace it with a better feedback loop, where the current throughput of the
ring buffer is estimated (via timestamps and amounts read over a moving
average), and the polling interval is calculated based on that. This way,
it would be even more robust with bursty workloads.

Also, a reader might ask why we don't move to a proper I/O multiplexed
model using poll(). Unfortunately, it's not obvious that it would help
performance, in the past I've done experiments on this and the overhead
necessary on the kernel side to maintain the state and wake up the consumer
wasn't acceptable even with generous overflow thresholds when you talk
about millions of events per second. It's possible that the tests weren't
thorough, but that's certainly a much more significant work.
@gianlucaborello
Copy link
Contributor Author

I amended the commit message to be more meaningful, and after spending the whole day yesterday doing more performance tests and not finding any problems, I'm merging this. Hopefully if there are issues folks will run into them while on the dev branch.

@gianlucaborello gianlucaborello merged commit be90e43 into dev Apr 24, 2019
@gianlucaborello gianlucaborello deleted the scap-perf branch April 24, 2019 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants