Skip to content

Commit 7d8faaf

Browse files
zokeefeakpm00
authored andcommitted
mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
This idea was introduced by David Rientjes[1]. Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory at their own expense. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, process_madvise(2) returns the number of bytes advised, and madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Most notable here is ENOMEM and EBUSY (new to madvise) which are intended to provide the caller with actionable feedback so they may take an appropriate fallback measure. Use Cases An immediate user of this new functionality are malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[2]. Only privately-mapped anon memory is supported for now, but additional support for file, shmem, and HugeTLB high-granularity mappings[2] is expected. File and tmpfs/shmem support would permit: * Backing executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. * Backing guest memory by hugapages after the memory contents have been migrated in native-page-sized chunks to a new host, in a userfaultfd-based live-migration stack. [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ [2] https://github.com/google/tcmalloc/tree/master/tcmalloc [jrdr.linux@gmail.com: avoid possible memory leak in failure path] Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com [zokeefe@google.com add missing kfree() to madvise_collapse()] Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/ Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com [zokeefe@google.com: delay computation of hpage boundaries until use]] Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Signed-off-by: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Hugh Dickins <hughd@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 5072280 commit 7d8faaf

File tree

9 files changed

+147
-3
lines changed

9 files changed

+147
-3
lines changed

arch/alpha/include/uapi/asm/mman.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@
7676

7777
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
7878

79+
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
80+
7981
/* compatibility flags */
8082
#define MAP_FILE 0
8183

arch/mips/include/uapi/asm/mman.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,8 @@
103103

104104
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
105105

106+
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
107+
106108
/* compatibility flags */
107109
#define MAP_FILE 0
108110

arch/parisc/include/uapi/asm/mman.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,8 @@
7070
#define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */
7171
#define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */
7272

73+
#define MADV_COLLAPSE 73 /* Synchronous hugepage collapse */
74+
7375
#define MADV_HWPOISON 100 /* poison a page for testing */
7476
#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
7577

arch/xtensa/include/uapi/asm/mman.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,8 @@
111111

112112
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
113113

114+
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
115+
114116
/* compatibility flags */
115117
#define MAP_FILE 0
116118

include/linux/huge_mm.h

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
218218

219219
int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
220220
int advice);
221+
int madvise_collapse(struct vm_area_struct *vma,
222+
struct vm_area_struct **prev,
223+
unsigned long start, unsigned long end);
221224
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
222225
unsigned long end, long adjust_next);
223226
spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -361,9 +364,16 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
361364
static inline int hugepage_madvise(struct vm_area_struct *vma,
362365
unsigned long *vm_flags, int advice)
363366
{
364-
BUG();
365-
return 0;
367+
return -EINVAL;
366368
}
369+
370+
static inline int madvise_collapse(struct vm_area_struct *vma,
371+
struct vm_area_struct **prev,
372+
unsigned long start, unsigned long end)
373+
{
374+
return -EINVAL;
375+
}
376+
367377
static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
368378
unsigned long start,
369379
unsigned long end,

include/uapi/asm-generic/mman-common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@
7777

7878
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
7979

80+
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
81+
8082
/* compatibility flags */
8183
#define MAP_FILE 0
8284

mm/khugepaged.c

Lines changed: 118 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -982,7 +982,8 @@ static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
982982
struct collapse_control *cc)
983983
{
984984
/* Only allocate from the target node */
985-
gfp_t gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
985+
gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
986+
GFP_TRANSHUGE) | __GFP_THISNODE;
986987
int node = khugepaged_find_target_node(cc);
987988

988989
if (!khugepaged_alloc_page(hpage, gfp, node))
@@ -2362,3 +2363,119 @@ void khugepaged_min_free_kbytes_update(void)
23622363
set_recommended_min_free_kbytes();
23632364
mutex_unlock(&khugepaged_mutex);
23642365
}
2366+
2367+
static int madvise_collapse_errno(enum scan_result r)
2368+
{
2369+
/*
2370+
* MADV_COLLAPSE breaks from existing madvise(2) conventions to provide
2371+
* actionable feedback to caller, so they may take an appropriate
2372+
* fallback measure depending on the nature of the failure.
2373+
*/
2374+
switch (r) {
2375+
case SCAN_ALLOC_HUGE_PAGE_FAIL:
2376+
return -ENOMEM;
2377+
case SCAN_CGROUP_CHARGE_FAIL:
2378+
return -EBUSY;
2379+
/* Resource temporary unavailable - trying again might succeed */
2380+
case SCAN_PAGE_LOCK:
2381+
case SCAN_PAGE_LRU:
2382+
return -EAGAIN;
2383+
/*
2384+
* Other: Trying again likely not to succeed / error intrinsic to
2385+
* specified memory range. khugepaged likely won't be able to collapse
2386+
* either.
2387+
*/
2388+
default:
2389+
return -EINVAL;
2390+
}
2391+
}
2392+
2393+
int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
2394+
unsigned long start, unsigned long end)
2395+
{
2396+
struct collapse_control *cc;
2397+
struct mm_struct *mm = vma->vm_mm;
2398+
unsigned long hstart, hend, addr;
2399+
int thps = 0, last_fail = SCAN_FAIL;
2400+
bool mmap_locked = true;
2401+
2402+
BUG_ON(vma->vm_start > start);
2403+
BUG_ON(vma->vm_end < end);
2404+
2405+
*prev = vma;
2406+
2407+
/* TODO: Support file/shmem */
2408+
if (!vma->anon_vma || !vma_is_anonymous(vma))
2409+
return -EINVAL;
2410+
2411+
if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
2412+
return -EINVAL;
2413+
2414+
cc = kmalloc(sizeof(*cc), GFP_KERNEL);
2415+
if (!cc)
2416+
return -ENOMEM;
2417+
cc->is_khugepaged = false;
2418+
cc->last_target_node = NUMA_NO_NODE;
2419+
2420+
mmgrab(mm);
2421+
lru_add_drain_all();
2422+
2423+
hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
2424+
hend = end & HPAGE_PMD_MASK;
2425+
2426+
for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
2427+
int result = SCAN_FAIL;
2428+
2429+
if (!mmap_locked) {
2430+
cond_resched();
2431+
mmap_read_lock(mm);
2432+
mmap_locked = true;
2433+
result = hugepage_vma_revalidate(mm, addr, &vma, cc);
2434+
if (result != SCAN_SUCCEED) {
2435+
last_fail = result;
2436+
goto out_nolock;
2437+
}
2438+
}
2439+
mmap_assert_locked(mm);
2440+
memset(cc->node_load, 0, sizeof(cc->node_load));
2441+
result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, cc);
2442+
if (!mmap_locked)
2443+
*prev = NULL; /* Tell caller we dropped mmap_lock */
2444+
2445+
switch (result) {
2446+
case SCAN_SUCCEED:
2447+
case SCAN_PMD_MAPPED:
2448+
++thps;
2449+
break;
2450+
/* Whitelisted set of results where continuing OK */
2451+
case SCAN_PMD_NULL:
2452+
case SCAN_PTE_NON_PRESENT:
2453+
case SCAN_PTE_UFFD_WP:
2454+
case SCAN_PAGE_RO:
2455+
case SCAN_LACK_REFERENCED_PAGE:
2456+
case SCAN_PAGE_NULL:
2457+
case SCAN_PAGE_COUNT:
2458+
case SCAN_PAGE_LOCK:
2459+
case SCAN_PAGE_COMPOUND:
2460+
case SCAN_PAGE_LRU:
2461+
last_fail = result;
2462+
break;
2463+
default:
2464+
last_fail = result;
2465+
/* Other error, exit */
2466+
goto out_maybelock;
2467+
}
2468+
}
2469+
2470+
out_maybelock:
2471+
/* Caller expects us to hold mmap_lock on return */
2472+
if (!mmap_locked)
2473+
mmap_read_lock(mm);
2474+
out_nolock:
2475+
mmap_assert_locked(mm);
2476+
mmdrop(mm);
2477+
kfree(cc);
2478+
2479+
return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
2480+
: madvise_collapse_errno(last_fail);
2481+
}

mm/madvise.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
5959
case MADV_FREE:
6060
case MADV_POPULATE_READ:
6161
case MADV_POPULATE_WRITE:
62+
case MADV_COLLAPSE:
6263
return 0;
6364
default:
6465
/* be safe, default to 1. list exceptions explicitly */
@@ -1057,6 +1058,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
10571058
if (error)
10581059
goto out;
10591060
break;
1061+
case MADV_COLLAPSE:
1062+
return madvise_collapse(vma, prev, start, end);
10601063
}
10611064

10621065
anon_name = anon_vma_name(vma);
@@ -1150,6 +1153,7 @@ madvise_behavior_valid(int behavior)
11501153
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
11511154
case MADV_HUGEPAGE:
11521155
case MADV_NOHUGEPAGE:
1156+
case MADV_COLLAPSE:
11531157
#endif
11541158
case MADV_DONTDUMP:
11551159
case MADV_DODUMP:
@@ -1339,6 +1343,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
13391343
* MADV_NOHUGEPAGE - mark the given range as not worth being backed by
13401344
* transparent huge pages so the existing pages will not be
13411345
* coalesced into THP and new pages will not be allocated as THP.
1346+
* MADV_COLLAPSE - synchronously coalesce pages into new THP.
13421347
* MADV_DONTDUMP - the application wants to prevent pages in the given range
13431348
* from being included in its core dump.
13441349
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.

tools/include/uapi/asm-generic/mman-common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@
7777

7878
#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
7979

80+
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
81+
8082
/* compatibility flags */
8183
#define MAP_FILE 0
8284

0 commit comments

Comments
 (0)