Skip to content

Commit be64087

Browse files
committed
mm/memcg: Free percpu stats memory of dying memcg's
JIRA: https://issues.redhat.com/browse/RHEL-67445 Upstream Status: RHEL-only For systems with large number of CPUs, the majority of the memory consumed by the mem_cgroup structure is actually the percpu stats memory. When a large number of memory cgroups are continuously created and destroyed (like in a container host), it is possible that more and more mem_cgroup structures remained in the dying state holding up increasing amount of percpu memory. We can't free up the memory of the dying mem_cgroup structures due to active references mainly from pages in the page cache. However, the percpu stats memory allocated to that mem_cgroup is a different story. As of v6.12 kernel, there are 2 main sets of percpu stat counters in the mem_cgroup structure and the associated mem_cgroup_per_node structure. - vmstats_percpu (2424 bytes, in struct mem_cgroup) - lruvec_stats_percpu (1920 bytes, in struct mem_cgroup_per_node) When using cgroup v1, there is also a small events_percpu stat counter (24 bytes). Upstream hasn't decided on the best way to handle dying memory cgroups yet. See https://lwn.net/Articles/932070/ for more information. It looks like a final solution may still need some more time. This patch is a workaround by freeing the percpu stats memory (except the small v1's events_percpu) associated with a dying memory cgroup. This will mostly eliminates the percpu memory increase problem, but we will still see increase in slab memory consumption associated with the dying memory cgroups. As a workaround, it is not likely to be accepted upstream, but a lot of RHEL customers are seeing this percpu memory increase problem. A new percpu_stats_disabled variable is added to keep track of the state of the percpu stats memory. If the variable is set, percpu stats updates will be disabled for that particular memcg and forwarded to a nearest ancestor memcg that is online. The only exception is memcg_rstat_updated() which will only be called after the memcg has been properly updated. The disabling, flushing and freeing of the percpu stats memory is a multi-step process. The percpu_stats_disabled variable is set to MEMCG_PERCPU_STATS_DISABLED first when the memcg is being set to an offline state. At this point, the cgroup filesystem control files corresponding to the offline cgroups is being removed and will no longer be visible in user space. After a grace period with the help of rcu_work, no task should be reading or updating percpu stats at that point. The percpu_stats_disabled variable is then atomically set to PERCPU_STATS_FLUSHING before flushing out the percpu stats and changing its state to PERCPU_STATS_FLUSHED. The percpu memory is then freed and the state is changed to PERCPU_STATS_FREED. This will greatly reduce the amount of memory held up by dying memory cgroups. For a compiled RHEL10 x86-64 kernel running cgroup v2 on a relatively simple 2-socket 16 cores per socket system with HT on, the memory consumption of the composite mem_cgroup structure will be about 138,984 bytes which is almost 136 kBytes. With a bigger 8-socket 32 cores per socket system with HT on, the memory consumption will be about 2,606,056 bytes which is almost 2.5 MBytes. After getting rid of the percpu stats memory, the memory consumptions will be about 5,864 and 17,384 bytes respectively. So it is a lot of memory saving (95.8% and 99.3%) especially for systems with large number of CPUs. This patch does introduce a bit of performance overhead when doing memcg stat update especially __mod_memcg_lruvec_state(). This RHEL-only patch will be reverted once the upstream fix is finalized and merged into RHEL10. Signed-off-by: Waiman Long <longman@redhat.com>
1 parent 89a6dfd commit be64087

File tree

2 files changed

+105
-4
lines changed

2 files changed

+105
-4
lines changed

include/linux/memcontrol.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <linux/writeback.h>
2424
#include <linux/page-flags.h>
2525
#include <linux/shrinker.h>
26+
#include <linux/rh_kabi.h>
2627

2728
struct mem_cgroup;
2829
struct obj_cgroup;
@@ -104,6 +105,7 @@ struct mem_cgroup_per_node {
104105
unsigned long usage_in_excess;/* Set to the value by which */
105106
/* the soft limit is exceeded*/
106107
bool on_tree;
108+
RH_KABI_FILL_HOLE(unsigned short nid)
107109
#else
108110
CACHELINE_PADDING(_pad1_);
109111
#endif
@@ -322,6 +324,12 @@ struct mem_cgroup {
322324
struct list_head event_list;
323325
spinlock_t event_list_lock;
324326
#endif /* CONFIG_MEMCG_V1 */
327+
/*
328+
* Disable percpu stats when offline, flush and free them after one
329+
* grace period.
330+
*/
331+
RH_KABI_EXTEND(int percpu_stats_disabled)
332+
RH_KABI_EXTEND(struct rcu_work percpu_stats_rwork)
325333

326334
struct mem_cgroup_per_node *nodeinfo[];
327335
};

mm/memcontrol.c

Lines changed: 97 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,14 @@ static bool cgroup_memory_nobpf __ro_after_init;
9595
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
9696
#endif
9797

98+
enum percpu_stats_state {
99+
PERCPU_STATS_ACTIVE = 0,
100+
PERCPU_STATS_DISABLED,
101+
PERCPU_STATS_FLUSHING,
102+
PERCPU_STATS_FLUSHED,
103+
PERCPU_STATS_FREED
104+
};
105+
98106
static inline bool task_is_dying(void)
99107
{
100108
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
@@ -666,6 +674,30 @@ static int memcg_state_val_in_pages(int idx, int val)
666674
return max(val * unit / PAGE_SIZE, 1UL);
667675
}
668676

677+
/*
678+
* Return the active percpu stats memcg and optionally mem_cgroup_per_node.
679+
*
680+
* When percpu_stats_disabled, the percpu stats update is transferred to
681+
* its parent.
682+
*/
683+
static __always_inline struct mem_cgroup *
684+
percpu_stats_memcg(struct mem_cgroup *memcg, struct mem_cgroup_per_node **pn)
685+
{
686+
if (likely(!memcg->percpu_stats_disabled))
687+
return memcg;
688+
689+
do {
690+
memcg = parent_mem_cgroup(memcg);
691+
} while (memcg->percpu_stats_disabled);
692+
693+
if (pn) {
694+
unsigned int nid = (*pn)->nid;
695+
696+
*pn = memcg->nodeinfo[nid];
697+
}
698+
return memcg;
699+
}
700+
669701
/**
670702
* __mod_memcg_state - update cgroup memory statistics
671703
* @memcg: the memory cgroup
@@ -683,6 +715,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
683715
if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
684716
return;
685717

718+
memcg = percpu_stats_memcg(memcg, NULL);
686719
__this_cpu_add(memcg->vmstats_percpu->state[i], val);
687720
memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
688721
}
@@ -716,7 +749,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
716749
return;
717750

718751
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
719-
memcg = pn->memcg;
752+
memcg = percpu_stats_memcg(pn->memcg, &pn);
720753

721754
/*
722755
* The caller from rmap relies on disabled preemption because they never
@@ -831,6 +864,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
831864
if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
832865
return;
833866

867+
memcg = percpu_stats_memcg(memcg, NULL);
834868
memcg_stats_lock();
835869
__this_cpu_add(memcg->vmstats_percpu->events[i], count);
836870
memcg_rstat_updated(memcg, count);
@@ -3437,6 +3471,7 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
34373471

34383472
lruvec_init(&pn->lruvec);
34393473
pn->memcg = memcg;
3474+
pn->nid = node;
34403475

34413476
memcg->nodeinfo[node] = pn;
34423477
return true;
@@ -3453,7 +3488,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
34533488
if (!pn)
34543489
return;
34553490

3456-
free_percpu(pn->lruvec_stats_percpu);
3491+
//free_percpu(pn->lruvec_stats_percpu);
34573492
kfree(pn->lruvec_stats);
34583493
kfree(pn);
34593494
}
@@ -3468,7 +3503,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
34683503
free_mem_cgroup_per_node_info(memcg, node);
34693504
memcg1_free_events(memcg);
34703505
kfree(memcg->vmstats);
3471-
free_percpu(memcg->vmstats_percpu);
3506+
//free_percpu(memcg->vmstats_percpu);
34723507
kfree(memcg);
34733508
}
34743509

@@ -3553,6 +3588,61 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
35533588
return ERR_PTR(error);
35543589
}
35553590

3591+
/*
3592+
* Flush and free the percpu stats
3593+
*/
3594+
static void percpu_stats_free_rwork_fn(struct work_struct *work)
3595+
{
3596+
struct mem_cgroup *memcg = container_of(to_rcu_work(work),
3597+
struct mem_cgroup,
3598+
percpu_stats_rwork);
3599+
int node;
3600+
3601+
if (cmpxchg(&memcg->percpu_stats_disabled, PERCPU_STATS_DISABLED,
3602+
PERCPU_STATS_FLUSHING) != PERCPU_STATS_DISABLED) {
3603+
static DEFINE_RATELIMIT_STATE(_rs,
3604+
DEFAULT_RATELIMIT_INTERVAL,
3605+
DEFAULT_RATELIMIT_BURST);
3606+
3607+
if (__ratelimit(&_rs))
3608+
WARN(1, "%s called more than once!\n", __func__);
3609+
return;
3610+
}
3611+
3612+
cgroup_rstat_flush_hold(memcg->css.cgroup);
3613+
WRITE_ONCE(memcg->percpu_stats_disabled, PERCPU_STATS_FLUSHED);
3614+
cgroup_rstat_flush_release(memcg->css.cgroup);
3615+
3616+
for_each_node(node) {
3617+
struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
3618+
3619+
if (pn)
3620+
free_percpu(pn->lruvec_stats_percpu);
3621+
}
3622+
free_percpu(memcg->vmstats_percpu);
3623+
WRITE_ONCE(memcg->percpu_stats_disabled, PERCPU_STATS_FREED);
3624+
css_put(&memcg->css);
3625+
}
3626+
3627+
static void memcg_percpu_stats_disable(struct mem_cgroup *memcg)
3628+
{
3629+
/*
3630+
* Block memcg from being freed before percpu_stats_free_rwork_fn()
3631+
* is called. css_get() will succeed before a potential final
3632+
* css_put() in mem_cgroup_id_put().
3633+
*/
3634+
css_get(&memcg->css);
3635+
mem_cgroup_id_put(memcg);
3636+
memcg->percpu_stats_disabled = PERCPU_STATS_DISABLED;
3637+
INIT_RCU_WORK(&memcg->percpu_stats_rwork, percpu_stats_free_rwork_fn);
3638+
queue_rcu_work(system_wq, &memcg->percpu_stats_rwork);
3639+
}
3640+
3641+
static inline bool memcg_percpu_stats_flushed(struct mem_cgroup *memcg)
3642+
{
3643+
return memcg->percpu_stats_disabled >= PERCPU_STATS_FLUSHED;
3644+
}
3645+
35563646
static struct cgroup_subsys_state * __ref
35573647
mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
35583648
{
@@ -3666,7 +3756,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
36663756

36673757
drain_all_stock(memcg);
36683758

3669-
mem_cgroup_id_put(memcg);
3759+
memcg_percpu_stats_disable(memcg);
36703760
}
36713761

36723762
static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -3741,6 +3831,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
37413831
long delta, delta_cpu, v;
37423832
int i, nid;
37433833

3834+
if (memcg_percpu_stats_flushed(memcg))
3835+
return;
3836+
37443837
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
37453838

37463839
for (i = 0; i < MEMCG_VMSTAT_SIZE; i++) {

0 commit comments

Comments
 (0)