You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mm/memcg: Free percpu stats memory of dying memcg's
JIRA: https://issues.redhat.com/browse/RHEL-67445
Upstream Status: RHEL-only
For systems with large number of CPUs, the majority of the memory
consumed by the mem_cgroup structure is actually the percpu stats
memory. When a large number of memory cgroups are continuously created
and destroyed (like in a container host), it is possible that more
and more mem_cgroup structures remained in the dying state holding up
increasing amount of percpu memory.
We can't free up the memory of the dying mem_cgroup structures due to
active references mainly from pages in the page cache. However, the
percpu stats memory allocated to that mem_cgroup is a different story.
As of v6.12 kernel, there are 2 main sets of percpu stat counters in the
mem_cgroup structure and the associated mem_cgroup_per_node structure.
- vmstats_percpu (2424 bytes, in struct mem_cgroup)
- lruvec_stats_percpu (1920 bytes, in struct mem_cgroup_per_node)
When using cgroup v1, there is also a small events_percpu stat counter
(24 bytes).
Upstream hasn't decided on the best way to handle dying memory cgroups
yet. See https://lwn.net/Articles/932070/ for more information. It
looks like a final solution may still need some more time.
This patch is a workaround by freeing the percpu stats memory (except
the small v1's events_percpu) associated with a dying memory cgroup.
This will mostly eliminates the percpu memory increase problem, but we
will still see increase in slab memory consumption associated with the
dying memory cgroups. As a workaround, it is not likely to be accepted
upstream, but a lot of RHEL customers are seeing this percpu memory
increase problem.
A new percpu_stats_disabled variable is added to keep track of the
state of the percpu stats memory. If the variable is set, percpu stats
updates will be disabled for that particular memcg and forwarded
to a nearest ancestor memcg that is online. The only exception is
memcg_rstat_updated() which will only be called after the memcg has
been properly updated.
The disabling, flushing and freeing of the percpu stats memory is a
multi-step process.
The percpu_stats_disabled variable is set to MEMCG_PERCPU_STATS_DISABLED
first when the memcg is being set to an offline state. At this point,
the cgroup filesystem control files corresponding to the offline cgroups
is being removed and will no longer be visible in user space.
After a grace period with the help of rcu_work, no task should be
reading or updating percpu stats at that point. The percpu_stats_disabled
variable is then atomically set to PERCPU_STATS_FLUSHING before flushing
out the percpu stats and changing its state to PERCPU_STATS_FLUSHED.
The percpu memory is then freed and the state is changed to
PERCPU_STATS_FREED.
This will greatly reduce the amount of memory held up by dying memory
cgroups.
For a compiled RHEL10 x86-64 kernel running cgroup v2 on a relatively
simple 2-socket 16 cores per socket system with HT on, the memory
consumption of the composite mem_cgroup structure will be about 138,984
bytes which is almost 136 kBytes. With a bigger 8-socket 32 cores per
socket system with HT on, the memory consumption will be about 2,606,056
bytes which is almost 2.5 MBytes.
After getting rid of the percpu stats memory, the memory consumptions
will be about 5,864 and 17,384 bytes respectively. So it is a lot of
memory saving (95.8% and 99.3%) especially for systems with large number
of CPUs.
This patch does introduce a bit of performance overhead when doing
memcg stat update especially __mod_memcg_lruvec_state().
This RHEL-only patch will be reverted once the upstream fix is finalized
and merged into RHEL10.
Signed-off-by: Waiman Long <longman@redhat.com>
0 commit comments