Skip to content

Commit cdc2fcf

Browse files
minatorvalds
authored andcommitted
hugetlb_cgroup: add hugetlb_cgroup reservation counter
These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter. This is patch 1 of an 9 patch series. Problem: Currently tasks attempting to reserve more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to reserve more hugetlb memory than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault in the excess memory. We have users hitting their hugetlb_cgroup limits and thus we've been looking at this failure mode. We'd like to improve this behavior such that users violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in. This gives the user an opportunity to fallback more gracefully to non-hugetlbfs memory for example. The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd. Proposed Solution: A new page counter named 'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has slightly different semantics than 'hugetlb.xMB.[limit|usage|max_usage]_in_bytes': - While usage_in_bytes tracks all *faulted* hugetlb memory, rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation. - If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than rsvd.limit_in_bytes, the kernel will fail this reservation. This proposal is implemented in this patch series, with tests to verify functionality and show the usage. Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well. 2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via rsvd.limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach gives them the option to do so. Testing: - Added tests passing. - Used libhugetlbfs for regression testing. [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 87bf91d commit cdc2fcf

File tree

2 files changed

+104
-15
lines changed

2 files changed

+104
-15
lines changed

include/linux/hugetlb.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -440,8 +440,8 @@ struct hstate {
440440
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
441441
#ifdef CONFIG_CGROUP_HUGETLB
442442
/* cgroup control files */
443-
struct cftype cgroup_files_dfl[5];
444-
struct cftype cgroup_files_legacy[5];
443+
struct cftype cgroup_files_dfl[7];
444+
struct cftype cgroup_files_legacy[9];
445445
#endif
446446
char name[HSTATE_NAME_LEN];
447447
};

mm/hugetlb_cgroup.c

Lines changed: 102 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,11 @@ struct hugetlb_cgroup {
3636
*/
3737
struct page_counter hugepage[HUGE_MAX_HSTATE];
3838

39+
/*
40+
* the counter to account for hugepage reservations from hugetlb.
41+
*/
42+
struct page_counter rsvd_hugepage[HUGE_MAX_HSTATE];
43+
3944
atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
4045
atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
4146

@@ -55,6 +60,15 @@ struct hugetlb_cgroup {
5560

5661
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
5762

63+
static inline struct page_counter *
64+
hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx,
65+
bool rsvd)
66+
{
67+
if (rsvd)
68+
return &h_cg->rsvd_hugepage[idx];
69+
return &h_cg->hugepage[idx];
70+
}
71+
5872
static inline
5973
struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
6074
{
@@ -294,28 +308,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
294308

295309
enum {
296310
RES_USAGE,
311+
RES_RSVD_USAGE,
297312
RES_LIMIT,
313+
RES_RSVD_LIMIT,
298314
RES_MAX_USAGE,
315+
RES_RSVD_MAX_USAGE,
299316
RES_FAILCNT,
317+
RES_RSVD_FAILCNT,
300318
};
301319

302320
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
303321
struct cftype *cft)
304322
{
305323
struct page_counter *counter;
324+
struct page_counter *rsvd_counter;
306325
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
307326

308327
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
328+
rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(cft->private)];
309329

310330
switch (MEMFILE_ATTR(cft->private)) {
311331
case RES_USAGE:
312332
return (u64)page_counter_read(counter) * PAGE_SIZE;
333+
case RES_RSVD_USAGE:
334+
return (u64)page_counter_read(rsvd_counter) * PAGE_SIZE;
313335
case RES_LIMIT:
314336
return (u64)counter->max * PAGE_SIZE;
337+
case RES_RSVD_LIMIT:
338+
return (u64)rsvd_counter->max * PAGE_SIZE;
315339
case RES_MAX_USAGE:
316340
return (u64)counter->watermark * PAGE_SIZE;
341+
case RES_RSVD_MAX_USAGE:
342+
return (u64)rsvd_counter->watermark * PAGE_SIZE;
317343
case RES_FAILCNT:
318344
return counter->failcnt;
345+
case RES_RSVD_FAILCNT:
346+
return rsvd_counter->failcnt;
319347
default:
320348
BUG();
321349
}
@@ -337,10 +365,16 @@ static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v)
337365
1 << huge_page_order(&hstates[idx]));
338366

339367
switch (MEMFILE_ATTR(cft->private)) {
368+
case RES_RSVD_USAGE:
369+
counter = &h_cg->rsvd_hugepage[idx];
370+
/* Fall through. */
340371
case RES_USAGE:
341372
val = (u64)page_counter_read(counter);
342373
seq_printf(seq, "%llu\n", val * PAGE_SIZE);
343374
break;
375+
case RES_RSVD_LIMIT:
376+
counter = &h_cg->rsvd_hugepage[idx];
377+
/* Fall through. */
344378
case RES_LIMIT:
345379
val = (u64)counter->max;
346380
if (val == limit)
@@ -364,6 +398,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
364398
int ret, idx;
365399
unsigned long nr_pages;
366400
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
401+
bool rsvd = false;
367402

368403
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
369404
return -EINVAL;
@@ -377,9 +412,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
377412
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
378413

379414
switch (MEMFILE_ATTR(of_cft(of)->private)) {
415+
case RES_RSVD_LIMIT:
416+
rsvd = true;
417+
/* Fall through. */
380418
case RES_LIMIT:
381419
mutex_lock(&hugetlb_limit_mutex);
382-
ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
420+
ret = page_counter_set_max(
421+
hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
422+
nr_pages);
383423
mutex_unlock(&hugetlb_limit_mutex);
384424
break;
385425
default:
@@ -405,18 +445,25 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
405445
char *buf, size_t nbytes, loff_t off)
406446
{
407447
int ret = 0;
408-
struct page_counter *counter;
448+
struct page_counter *counter, *rsvd_counter;
409449
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
410450

411451
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
452+
rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(of_cft(of)->private)];
412453

413454
switch (MEMFILE_ATTR(of_cft(of)->private)) {
414455
case RES_MAX_USAGE:
415456
page_counter_reset_watermark(counter);
416457
break;
458+
case RES_RSVD_MAX_USAGE:
459+
page_counter_reset_watermark(rsvd_counter);
460+
break;
417461
case RES_FAILCNT:
418462
counter->failcnt = 0;
419463
break;
464+
case RES_RSVD_FAILCNT:
465+
rsvd_counter->failcnt = 0;
466+
break;
420467
default:
421468
ret = -EINVAL;
422469
break;
@@ -471,7 +518,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
471518
struct hstate *h = &hstates[idx];
472519

473520
/* format the size */
474-
mem_fmt(buf, 32, huge_page_size(h));
521+
mem_fmt(buf, sizeof(buf), huge_page_size(h));
475522

476523
/* Add the limit file */
477524
cft = &h->cgroup_files_dfl[0];
@@ -481,23 +528,38 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
481528
cft->write = hugetlb_cgroup_write_dfl;
482529
cft->flags = CFTYPE_NOT_ON_ROOT;
483530

484-
/* Add the current usage file */
531+
/* Add the reservation limit file */
485532
cft = &h->cgroup_files_dfl[1];
533+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max", buf);
534+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
535+
cft->seq_show = hugetlb_cgroup_read_u64_max;
536+
cft->write = hugetlb_cgroup_write_dfl;
537+
cft->flags = CFTYPE_NOT_ON_ROOT;
538+
539+
/* Add the current usage file */
540+
cft = &h->cgroup_files_dfl[2];
486541
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf);
487542
cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
488543
cft->seq_show = hugetlb_cgroup_read_u64_max;
489544
cft->flags = CFTYPE_NOT_ON_ROOT;
490545

546+
/* Add the current reservation usage file */
547+
cft = &h->cgroup_files_dfl[3];
548+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.current", buf);
549+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
550+
cft->seq_show = hugetlb_cgroup_read_u64_max;
551+
cft->flags = CFTYPE_NOT_ON_ROOT;
552+
491553
/* Add the events file */
492-
cft = &h->cgroup_files_dfl[2];
554+
cft = &h->cgroup_files_dfl[4];
493555
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
494556
cft->private = MEMFILE_PRIVATE(idx, 0);
495557
cft->seq_show = hugetlb_events_show;
496558
cft->file_offset = offsetof(struct hugetlb_cgroup, events_file[idx]),
497559
cft->flags = CFTYPE_NOT_ON_ROOT;
498560

499561
/* Add the events.local file */
500-
cft = &h->cgroup_files_dfl[3];
562+
cft = &h->cgroup_files_dfl[5];
501563
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
502564
cft->private = MEMFILE_PRIVATE(idx, 0);
503565
cft->seq_show = hugetlb_events_local_show;
@@ -506,7 +568,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
506568
cft->flags = CFTYPE_NOT_ON_ROOT;
507569

508570
/* NULL terminate the last cft */
509-
cft = &h->cgroup_files_dfl[4];
571+
cft = &h->cgroup_files_dfl[6];
510572
memset(cft, 0, sizeof(*cft));
511573

512574
WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -520,7 +582,7 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
520582
struct hstate *h = &hstates[idx];
521583

522584
/* format the size */
523-
mem_fmt(buf, 32, huge_page_size(h));
585+
mem_fmt(buf, sizeof(buf), huge_page_size(h));
524586

525587
/* Add the limit file */
526588
cft = &h->cgroup_files_legacy[0];
@@ -529,28 +591,55 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
529591
cft->read_u64 = hugetlb_cgroup_read_u64;
530592
cft->write = hugetlb_cgroup_write_legacy;
531593

532-
/* Add the usage file */
594+
/* Add the reservation limit file */
533595
cft = &h->cgroup_files_legacy[1];
596+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.limit_in_bytes", buf);
597+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
598+
cft->read_u64 = hugetlb_cgroup_read_u64;
599+
cft->write = hugetlb_cgroup_write_legacy;
600+
601+
/* Add the usage file */
602+
cft = &h->cgroup_files_legacy[2];
534603
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
535604
cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
536605
cft->read_u64 = hugetlb_cgroup_read_u64;
537606

607+
/* Add the reservation usage file */
608+
cft = &h->cgroup_files_legacy[3];
609+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.usage_in_bytes", buf);
610+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
611+
cft->read_u64 = hugetlb_cgroup_read_u64;
612+
538613
/* Add the MAX usage file */
539-
cft = &h->cgroup_files_legacy[2];
614+
cft = &h->cgroup_files_legacy[4];
540615
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
541616
cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
542617
cft->write = hugetlb_cgroup_reset;
543618
cft->read_u64 = hugetlb_cgroup_read_u64;
544619

620+
/* Add the MAX reservation usage file */
621+
cft = &h->cgroup_files_legacy[5];
622+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max_usage_in_bytes", buf);
623+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_MAX_USAGE);
624+
cft->write = hugetlb_cgroup_reset;
625+
cft->read_u64 = hugetlb_cgroup_read_u64;
626+
545627
/* Add the failcntfile */
546-
cft = &h->cgroup_files_legacy[3];
628+
cft = &h->cgroup_files_legacy[6];
547629
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
548-
cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
630+
cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
631+
cft->write = hugetlb_cgroup_reset;
632+
cft->read_u64 = hugetlb_cgroup_read_u64;
633+
634+
/* Add the reservation failcntfile */
635+
cft = &h->cgroup_files_legacy[7];
636+
snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.failcnt", buf);
637+
cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_FAILCNT);
549638
cft->write = hugetlb_cgroup_reset;
550639
cft->read_u64 = hugetlb_cgroup_read_u64;
551640

552641
/* NULL terminate the last cft */
553-
cft = &h->cgroup_files_legacy[4];
642+
cft = &h->cgroup_files_legacy[8];
554643
memset(cft, 0, sizeof(*cft));
555644

556645
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,

0 commit comments

Comments
 (0)