Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
[RFC] Performance improvements of ring buffer processing #1372
Working on a feature that's particularly sensitive to event drops and truncated snaplens, I did some performance experiments.
My chosen workload is a sustained busy loop that does nothing more than writing 10KB at a time to /dev/null using
To start, this is how the whole thing behaves today, using eBPF (kmod is similar, but with less overhead):
Pretty bad, we dropped 95% of the events.
Using an off-time CPU analysis methodology this is what I get during a 10s run:
So, while sysdig was dropping buffers like crazy, it was also leisurely spending the vast majority of its time off CPU (315 * 30ms ~= 9.8s), as a result of the 30ms sleep we use when we find the buffer empty. This happens because the userspace consumer is not CPU-bound, so it's able to keep up with the producer, and when it catches up it goes to sleep, and during the pause time the producer sends 100s MB of data, overflowing the buffer.
The general approach here is to obviously enlarge the buffer and/or diminish the sleeping time. Diminishing the sleeping time is tricky because it increases CPU utilization, and increasing the ring buffer is tricky because it's a lot of kernel memory wasted on multi core. To absorb a 30ms pause at 7 GB/s the per-core buffer needs to be 200+MB large.
These are the things I experimented with:
While this doesn't solve a bursty workload, it does wonders for a sustained workload, as we can see here after the patch for the same stress test:
Not bad at all for a 8MB ring buffer, < 1% dropped (the lower throughput is because now the kernel probe is spending much more time copying all those 10KB into the ring buffer, since it's finding it free).
And the CPU of sysdig remains pretty much the same. In particular, during idle time, the adaptive sleep will always quickly reach the ceiling and stay there (see the relatively minor amount of sleeps under 30ms for the 10s period):
When we are under sustained workload, the adaptive sleep will always instead be at the floor, allowing us to be very responsive to the sustained load:
This is rather simple but seems to be doing the job. Ideally I'd like to replace it with a better feedback loop, where the current throughput of the ring buffer is estimated (via timestamps and amounts read over a moving average), and the polling interval is calculated based on that. This way, it would be even more robust with bursty workloads.
Also, a reader might ask why we don't move to a proper I/O multiplexed model using poll(). Unfortunately, it's not obvious that it would help performance, in the past I've done experiments on this and the overhead necessary on the kernel side to maintain the state and wake up the consumer wasn't acceptable even with generous overflow thresholds when you talk about millions of events per second. It's possible that the tests weren't thorough, but that's certainly a much more significant work.
This is just a checkpoint on the experiment, since I'm going to be focusing on something else for a while.
I amended the commit message to be more meaningful, and after spending the whole day yesterday doing more performance tests and not finding any problems, I'm merging this. Hopefully if there are issues folks will run into them while on the dev branch.