/
memcg-regression-5.15-rc.patch
265 lines (234 loc) · 10.8 KB
/
memcg-regression-5.15-rc.patch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
* [PATCH] memcg: flush lruvec stats in the refault
@ 2021-09-22 22:49 Shakeel Butt
2021-09-23 10:02 ` Michael Larabel
0 siblings, 1 reply; 2+ messages in thread
From: Shakeel Butt @ 2021-09-22 22:49 UTC (permalink / raw)
To: Johannes Weiner
Cc: Roman Gushchin, Michael Larabel, Feng Tang, Michal Hocko,
Hillf Danton, Michal Koutný,
Andrew Morton, Linus Torvalds, cgroups, linux-mm, linux-kernel,
Shakeel Butt
Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat")
and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg
stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
32) at worst and for unbounded amount of time. The commit aa48e47e3906
moved the lruvec stats to rstat infrastructure and the commit
7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus *
32) at worst for at most 2 seconds. More specifically it decoupled the
number of stats and the number of cgroups from the error rate.
However this reduction in error comes with the cost of triggering the
slowpath of stats update more frequently. Previously in the slowpath the
kernel adds the stats up the memcg tree. After aa48e47e3906, the kernel
triggers the asyn lruvec stats flush through queue_work(). This causes
regression reports from 0day kernel bot [1] as well as from phoronix
test suite [2].
We tried two options to fix the regression:
1) Increase the threshold to trigger the slowpath in lruvec stats update
codepath from 32 to 512.
2) Remove the slowpath from lruvec stats update codepath and instead
flush the stats in the page refault codepath. The assumption is that the
kernel timely flush the stats, so, the update tree would be small in the
refault codepath to not cause the preformance impact.
Following are the results of will-it-scale/page_fault[1|2|3] benchmark
on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1
(4) 5.15-rc1 with option-2.
test (1) (2) (3) (4)
pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%)
pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%)
pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%)
From the above result, it seems like the option-2 not only solves the
regression but also improves the performance for at least these
benchmarks.
Feng Tang (intel) ran the aim7 benchmark with these two options and
confirms that option-1 reduces the regression but option-2 removes the
regression.
Michael Larabel (phoronix) ran multiple benchmarks with these options
and reported the results at [3] and it shows for most benchmarks
option-2 removes the regression introduced by the commit aa48e47e3906
("memcg: infrastructure to flush memcg stats").
Based on the experiment results, this patch proposed the option-2 as the
solution to resolve the regression.
[1] https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020
[2] https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress
[3] https://openbenchmarking.org/result/2109226-DEBU-LINUX5104
Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
mm/memcontrol.c | 10 ----------
mm/workingset.c | 1 +
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b762215d73eb..6da5020a8656 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -106,9 +106,6 @@ static bool do_memsw_account(void)
/* memcg and lruvec stats flushing */
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static void flush_memcg_stats_work(struct work_struct *w);
-static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work);
-static DEFINE_PER_CPU(unsigned int, stats_flush_threshold);
static DEFINE_SPINLOCK(stats_flush_lock);
#define THRESHOLDS_EVENTS_TARGET 128
@@ -682,8 +679,6 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
/* Update lruvec */
__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
- if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
- queue_work(system_unbound_wq, &stats_flush_work);
}
/**
@@ -5361,11 +5356,6 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
}
-static void flush_memcg_stats_work(struct work_struct *w)
-{
- mem_cgroup_flush_stats();
-}
-
static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
diff --git a/mm/workingset.c b/mm/workingset.c
index d4268d8e9a82..d5b81e4f4cbe 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -352,6 +352,7 @@ void workingset_refault(struct page *page, void *shadow)
inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
+ mem_cgroup_flush_stats();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.33.0.464.g1972c5931b-goog
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] memcg: flush lruvec stats in the refault
2021-09-22 22:49 [PATCH] memcg: flush lruvec stats in the refault Shakeel Butt
@ 2021-09-23 10:02 ` Michael Larabel
0 siblings, 0 replies; 2+ messages in thread
From: Michael Larabel @ 2021-09-23 10:02 UTC (permalink / raw)
To: Shakeel Butt, Johannes Weiner
Cc: Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
Michal Koutný,
Andrew Morton, Linus Torvalds, cgroups, linux-mm, linux-kernel
On 9/22/21 5:49 PM, Shakeel Butt wrote:
> Prior to the commit 7e1c0d6f5820 ("memcg: switch lruvec stats to rstat")
> and the commit aa48e47e3906 ("memcg: infrastructure to flush memcg
> stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
> 32) at worst and for unbounded amount of time. The commit aa48e47e3906
> moved the lruvec stats to rstat infrastructure and the commit
> 7e1c0d6f5820 bounded the error for all the lruvec stats to (nr_cpus *
> 32) at worst for at most 2 seconds. More specifically it decoupled the
> number of stats and the number of cgroups from the error rate.
>
> However this reduction in error comes with the cost of triggering the
> slowpath of stats update more frequently. Previously in the slowpath the
> kernel adds the stats up the memcg tree. After aa48e47e3906, the kernel
> triggers the asyn lruvec stats flush through queue_work(). This causes
> regression reports from 0day kernel bot [1] as well as from phoronix
> test suite [2].
>
> We tried two options to fix the regression:
>
> 1) Increase the threshold to trigger the slowpath in lruvec stats update
> codepath from 32 to 512.
>
> 2) Remove the slowpath from lruvec stats update codepath and instead
> flush the stats in the page refault codepath. The assumption is that the
> kernel timely flush the stats, so, the update tree would be small in the
> refault codepath to not cause the preformance impact.
>
> Following are the results of will-it-scale/page_fault[1|2|3] benchmark
> on four settings i.e. (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
> aa48e47e3906 and 7e1c0d6f5820 reverted (3) 5.15-rc1 with option-1
> (4) 5.15-rc1 with option-2.
>
> test (1) (2) (3) (4)
> pg_f1 368563 406277 (10.23%) 399693 (8.44%) 416398 (12.97%)
> pg_f2 338399 372133 (9.96%) 369180 (9.09%) 381024 (12.59%)
> pg_f3 500853 575399 (14.88%) 570388 (13.88%) 576083 (15.02%)
>
> From the above result, it seems like the option-2 not only solves the
> regression but also improves the performance for at least these
> benchmarks.
>
> Feng Tang (intel) ran the aim7 benchmark with these two options and
> confirms that option-1 reduces the regression but option-2 removes the
> regression.
>
> Michael Larabel (phoronix) ran multiple benchmarks with these options
> and reported the results at [3] and it shows for most benchmarks
> option-2 removes the regression introduced by the commit aa48e47e3906
> ("memcg: infrastructure to flush memcg stats").
>
> Based on the experiment results, this patch proposed the option-2 as the
> solution to resolve the regression.
>
> [1] https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020
> [2] https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress
> [3] https://openbenchmarking.org/result/2109226-DEBU-LINUX5104
>
> Fixes: aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
> mm/memcontrol.c | 10 ----------
> mm/workingset.c | 1 +
> 2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b762215d73eb..6da5020a8656 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -106,9 +106,6 @@ static bool do_memsw_account(void)
> /* memcg and lruvec stats flushing */
> static void flush_memcg_stats_dwork(struct work_struct *w);
> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> -static void flush_memcg_stats_work(struct work_struct *w);
> -static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work);
> -static DEFINE_PER_CPU(unsigned int, stats_flush_threshold);
> static DEFINE_SPINLOCK(stats_flush_lock);
>
> #define THRESHOLDS_EVENTS_TARGET 128
> @@ -682,8 +679,6 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>
> /* Update lruvec */
> __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
> - if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> - queue_work(system_unbound_wq, &stats_flush_work);
> }
>
> /**
> @@ -5361,11 +5356,6 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
> queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
> }
>
> -static void flush_memcg_stats_work(struct work_struct *w)
> -{
> - mem_cgroup_flush_stats();
> -}
> -
> static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index d4268d8e9a82..d5b81e4f4cbe 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -352,6 +352,7 @@ void workingset_refault(struct page *page, void *shadow)
>
> inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
>
> + mem_cgroup_flush_stats();
> /*
> * Compare the distance to the existing workingset size. We
> * don't activate pages that couldn't stay resident even if
This patch is still looking good in my additional testing so far.
With an Intel Core i9 11900K system where it was quite noticeably
regressing in various workloads, things now seem to be under control[1]
with this patch and inline with v5.14 / 5.15 with the memcg patches
reverted.
If desired:
Tested-by: Michael Larabel <Michael@phoronix.com>
Michael
[1]
https://openbenchmarking.org/result/2109221-DEBU-LINUX5143&sgm=1&hgv=flush-in-the-ref&sor