Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
memcg: flush rstat for a single tree in memory.stat
Commit 11192d9 ("memcg: flush stats only if updated") made memory.stat to flush the whole cgroup tree, which seems to be too expensive for use cases like cadvisor, where we iterate through all cgroups and get stats for each one. Since commit 2d146aa ("mm: memcontrol: switch to rstat") memory accounting switched to rstat, similar to cpu accounting. In cpu.stat only one tree is flushed to read the cpu usage data. This means that for: * cpu.stat: flush one tree (both cpu and memory) * memory.stat: flush all cgroups (both cpu and memory) When cadvisor iterates through every cgroups this means a lot of flushing, which quickly adds up. Additionally, reading memory.stat and then cpu.stat for every cgroup is more expensive than reading memory.stat for all of them and then reading cpu.stat for all of them: * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13 completed: 9.55s [manual / cpu-stat + mem-stat] completed: 1.52s [manual / mem-stat] completed: 0.04s [manual / cpu-stat] The exact reason for this behavior is unclear and it's probably a pathological case for our large servers (128 hardware threads) with deeper than usual cgroup trees. Having different flushing for reading memory.stat and cpu.stat doesn't seem right. Let's use the same mechanism for both. This speeds things up: completed: 0.20s [manual / cpu-stat + mem-stat] completed: 0.12s [manual / mem-stat] completed: 0.07s [manual / cpu-stat] When reading cpu.stat + memory.stat and measuring with bpftrace: $ sudo bpftrace -e 'kfunc:cgroup_rstat_flush_locked { if (comm == "derp") { @cgroup_rstat_flush_locked++; } } kfunc:mem_cgroup_css_rstat_flush { if (comm == "derp") { @mem_cgroup_css_rstat_flush++; }}' Before: @cgroup_rstat_flush_locked: 1957 @mem_cgroup_css_rstat_flush: 1913631 After: @cgroup_rstat_flush_locked: 2000 @mem_cgroup_css_rstat_flush: 3548 The counts are not precise due to data races in bpftrace, but it's clear that we're doing a lot less work with this patch, which translates in a much lower runtime. Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
- Loading branch information