Skip to content

Commit 6af8cb8

Browse files
davidhildenbrandakpm00
authored andcommitted
mm/rmap: basic MM owner tracking for large folios (!hugetlb)
For small folios, we traditionally use the mapcount to decide whether it was "certainly mapped exclusively" by a single MM (mapcount == 1) or whether it "maybe mapped shared" by multiple MMs (mapcount > 1). For PMD-sized folios that were PMD-mapped, we were able to use a similar mechanism (single PMD mapping), but for PTE-mapped folios and in the future folios that span multiple PMDs, this does not work. So we need a different mechanism to handle large folios. Let's add a new mechanism to detect whether a large folio is "certainly mapped exclusively", or whether it is "maybe mapped shared". We'll use this information next to optimize CoW reuse for PTE-mapped anonymous THP, and to convert folio_likely_mapped_shared() to folio_maybe_mapped_shared(), independent of per-page mapcounts. For each large folio, we'll have two slots, whereby a slot stores: (1) an MM id: unique id assigned to each MM (2) a per-MM mapcount If a slot is unoccupied, it can be taken by the next MM that maps folio page. In addition, we'll remember the current state -- "mapped exclusively" vs. "maybe mapped shared" -- and use a bit spinlock to sync on updates and to reduce the total number of atomic accesses on updates. In the future, it might be possible to squeeze a proper spinlock into "struct folio". For now, keep it simple, as we require the whole thing with THP only, that is incompatible with RT. As we have to squeeze this information into the "struct folio" of even folios of order-1 (2 pages), and we generally want to reduce the required metadata, we'll assign each MM a unique ID that can fit into an int. In total, we can squeeze everything into 4x int (2x long) on 64bit. 32bit support is a bit challenging, because we only have 2x long == 2x int in order-1 folios. But we can make it work for now, because we neither expect many MMs nor very large folios on 32bit. We will reliably detect folios as "mapped exclusively" vs. "mapped shared" as long as only two MMs map pages of a folio at one point in time -- for example with fork() and short-lived child processes, or with apps that hand over state from one instance to another. As soon as three MMs are involved at the same time, we might detect "maybe mapped shared" although the folio is "mapped exclusively". Example 1: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (4) App1 unmaps all folio pages -> We will detect "mapped exclusively". Example 2: (1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0 (2) App2 faults in a folio page -> Tracked as MM1 (3) App3 faults in a folio page -> No slot available, tracked as "unknown" (4) App1 and App2 unmap all folio pages -> We will detect "maybe mapped shared". Make use of __always_inline to keep possible performance degradation when (un)mapping large folios to a minimum. Note: by squeezing the two flags into the "unsigned long" that stores the MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic setting/clearing of the "maybe mapped shared" bit, effectively not adding any new atomics on the hot path when updating the large mapcount + new metadata, which further helps reduce the runtime overhead in micro-benchmarks. Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 4488544 commit 6af8cb8

File tree

8 files changed

+281
-0
lines changed

8 files changed

+281
-0
lines changed

Documentation/mm/transhuge.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,11 +120,19 @@ pages:
120120
and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
121121
when _entire_mapcount goes from -1 to 0 or 0 to -1.
122122

123+
We also maintain the two slots for tracking MM owners (MM ID and
124+
corresponding mapcount), and the current status ("maybe mapped shared" vs.
125+
"mapped exclusively").
126+
123127
- map/unmap of individual pages with PTE entry increment/decrement
124128
page->_mapcount, increment/decrement folio->_large_mapcount and also
125129
increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
126130
from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
127131

132+
We also maintain the two slots for tracking MM owners (MM ID and
133+
corresponding mapcount), and the current status ("maybe mapped shared" vs.
134+
"mapped exclusively").
135+
128136
split_huge_page internally has to distribute the refcounts in the head
129137
page to the tail pages before clearing all PG_head/tail bits from the page
130138
structures. It can be done easily for refcounts taken by page table

include/linux/mm_types.h

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,44 @@ typedef struct {
292292
#define NR_PAGES_IN_LARGE_FOLIO
293293
#endif
294294

295+
/*
296+
* On 32bit, we can cut the required metadata in half, because:
297+
* (a) PID_MAX_LIMIT implicitly limits the number of MMs we could ever have,
298+
* so we can limit MM IDs to 15 bit (32767).
299+
* (b) We don't expect folios where even a single complete PTE mapping by
300+
* one MM would exceed 15 bits (order-15).
301+
*/
302+
#ifdef CONFIG_64BIT
303+
typedef int mm_id_mapcount_t;
304+
#define MM_ID_MAPCOUNT_MAX INT_MAX
305+
typedef unsigned int mm_id_t;
306+
#else /* !CONFIG_64BIT */
307+
typedef short mm_id_mapcount_t;
308+
#define MM_ID_MAPCOUNT_MAX SHRT_MAX
309+
typedef unsigned short mm_id_t;
310+
#endif /* CONFIG_64BIT */
311+
312+
/* We implicitly use the dummy ID for init-mm etc. where we never rmap pages. */
313+
#define MM_ID_DUMMY 0
314+
#define MM_ID_MIN (MM_ID_DUMMY + 1)
315+
316+
/*
317+
* We leave the highest bit of each MM id unused, so we can store a flag
318+
* in the highest bit of each folio->_mm_id[].
319+
*/
320+
#define MM_ID_BITS ((sizeof(mm_id_t) * BITS_PER_BYTE) - 1)
321+
#define MM_ID_MASK ((1U << MM_ID_BITS) - 1)
322+
#define MM_ID_MAX MM_ID_MASK
323+
324+
/*
325+
* In order to use bit_spin_lock(), which requires an unsigned long, we
326+
* operate on folio->_mm_ids when working on flags.
327+
*/
328+
#define FOLIO_MM_IDS_LOCK_BITNUM MM_ID_BITS
329+
#define FOLIO_MM_IDS_LOCK_BIT BIT(FOLIO_MM_IDS_LOCK_BITNUM)
330+
#define FOLIO_MM_IDS_SHARED_BITNUM (2 * MM_ID_BITS + 1)
331+
#define FOLIO_MM_IDS_SHARED_BIT BIT(FOLIO_MM_IDS_SHARED_BITNUM)
332+
295333
/**
296334
* struct folio - Represents a contiguous set of bytes.
297335
* @flags: Identical to the page flags.
@@ -318,6 +356,9 @@ typedef struct {
318356
* @_nr_pages_mapped: Do not use outside of rmap and debug code.
319357
* @_pincount: Do not use directly, call folio_maybe_dma_pinned().
320358
* @_nr_pages: Do not use directly, call folio_nr_pages().
359+
* @_mm_id: Do not use outside of rmap code.
360+
* @_mm_ids: Do not use outside of rmap code.
361+
* @_mm_id_mapcount: Do not use outside of rmap code.
321362
* @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
322363
* @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
323364
* @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
@@ -390,6 +431,11 @@ struct folio {
390431
atomic_t _entire_mapcount;
391432
atomic_t _pincount;
392433
#endif /* CONFIG_64BIT */
434+
mm_id_mapcount_t _mm_id_mapcount[2];
435+
union {
436+
mm_id_t _mm_id[2];
437+
unsigned long _mm_ids;
438+
};
393439
/* private: the union with struct page is transitional */
394440
};
395441
unsigned long _usable_1[4];
@@ -1114,6 +1160,9 @@ struct mm_struct {
11141160
#endif
11151161
} lru_gen;
11161162
#endif /* CONFIG_LRU_GEN_WALKS_MMU */
1163+
#ifdef CONFIG_MM_ID
1164+
mm_id_t mm_id;
1165+
#endif /* CONFIG_MM_ID */
11171166
} __randomize_layout;
11181167

11191168
/*

include/linux/page-flags.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1185,6 +1185,10 @@ static inline int folio_has_private(const struct folio *folio)
11851185
return !!(folio->flags & PAGE_FLAGS_PRIVATE);
11861186
}
11871187

1188+
static inline bool folio_test_large_maybe_mapped_shared(const struct folio *folio)
1189+
{
1190+
return test_bit(FOLIO_MM_IDS_SHARED_BITNUM, &folio->_mm_ids);
1191+
}
11881192
#undef PF_ANY
11891193
#undef PF_HEAD
11901194
#undef PF_NO_TAIL

include/linux/rmap.h

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#include <linux/highmem.h>
1414
#include <linux/pagemap.h>
1515
#include <linux/memremap.h>
16+
#include <linux/bit_spinlock.h>
1617

1718
/*
1819
* The anon_vma heads a list of private "related" vmas, to scan if
@@ -173,6 +174,169 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
173174

174175
struct anon_vma *folio_get_anon_vma(const struct folio *folio);
175176

177+
#ifdef CONFIG_MM_ID
178+
static __always_inline void folio_lock_large_mapcount(struct folio *folio)
179+
{
180+
bit_spin_lock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids);
181+
}
182+
183+
static __always_inline void folio_unlock_large_mapcount(struct folio *folio)
184+
{
185+
__bit_spin_unlock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids);
186+
}
187+
188+
static inline unsigned int folio_mm_id(const struct folio *folio, int idx)
189+
{
190+
VM_WARN_ON_ONCE(idx != 0 && idx != 1);
191+
return folio->_mm_id[idx] & MM_ID_MASK;
192+
}
193+
194+
static inline void folio_set_mm_id(struct folio *folio, int idx, mm_id_t id)
195+
{
196+
VM_WARN_ON_ONCE(idx != 0 && idx != 1);
197+
folio->_mm_id[idx] &= ~MM_ID_MASK;
198+
folio->_mm_id[idx] |= id;
199+
}
200+
201+
static inline void __folio_large_mapcount_sanity_checks(const struct folio *folio,
202+
int diff, mm_id_t mm_id)
203+
{
204+
VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
205+
VM_WARN_ON_ONCE(diff <= 0);
206+
VM_WARN_ON_ONCE(mm_id < MM_ID_MIN || mm_id > MM_ID_MAX);
207+
208+
/*
209+
* Make sure we can detect at least one complete PTE mapping of the
210+
* folio in a single MM as "exclusively mapped". This is primarily
211+
* a check on 32bit, where we currently reduce the size of the per-MM
212+
* mapcount to a short.
213+
*/
214+
VM_WARN_ON_ONCE(diff > folio_large_nr_pages(folio));
215+
VM_WARN_ON_ONCE(folio_large_nr_pages(folio) - 1 > MM_ID_MAPCOUNT_MAX);
216+
217+
VM_WARN_ON_ONCE(folio_mm_id(folio, 0) == MM_ID_DUMMY &&
218+
folio->_mm_id_mapcount[0] != -1);
219+
VM_WARN_ON_ONCE(folio_mm_id(folio, 0) != MM_ID_DUMMY &&
220+
folio->_mm_id_mapcount[0] < 0);
221+
VM_WARN_ON_ONCE(folio_mm_id(folio, 1) == MM_ID_DUMMY &&
222+
folio->_mm_id_mapcount[1] != -1);
223+
VM_WARN_ON_ONCE(folio_mm_id(folio, 1) != MM_ID_DUMMY &&
224+
folio->_mm_id_mapcount[1] < 0);
225+
VM_WARN_ON_ONCE(!folio_mapped(folio) &&
226+
folio_test_large_maybe_mapped_shared(folio));
227+
}
228+
229+
static __always_inline void folio_set_large_mapcount(struct folio *folio,
230+
int mapcount, struct vm_area_struct *vma)
231+
{
232+
__folio_large_mapcount_sanity_checks(folio, mapcount, vma->vm_mm->mm_id);
233+
234+
VM_WARN_ON_ONCE(folio_mm_id(folio, 0) != MM_ID_DUMMY);
235+
VM_WARN_ON_ONCE(folio_mm_id(folio, 1) != MM_ID_DUMMY);
236+
237+
/* Note: mapcounts start at -1. */
238+
atomic_set(&folio->_large_mapcount, mapcount - 1);
239+
folio->_mm_id_mapcount[0] = mapcount - 1;
240+
folio_set_mm_id(folio, 0, vma->vm_mm->mm_id);
241+
}
242+
243+
static __always_inline void folio_add_large_mapcount(struct folio *folio,
244+
int diff, struct vm_area_struct *vma)
245+
{
246+
const mm_id_t mm_id = vma->vm_mm->mm_id;
247+
int new_mapcount_val;
248+
249+
folio_lock_large_mapcount(folio);
250+
__folio_large_mapcount_sanity_checks(folio, diff, mm_id);
251+
252+
new_mapcount_val = atomic_read(&folio->_large_mapcount) + diff;
253+
atomic_set(&folio->_large_mapcount, new_mapcount_val);
254+
255+
/*
256+
* If a folio is mapped more than once into an MM on 32bit, we
257+
* can in theory overflow the per-MM mapcount (although only for
258+
* fairly large folios), turning it negative. In that case, just
259+
* free up the slot and mark the folio "mapped shared", otherwise
260+
* we might be in trouble when unmapping pages later.
261+
*/
262+
if (folio_mm_id(folio, 0) == mm_id) {
263+
folio->_mm_id_mapcount[0] += diff;
264+
if (!IS_ENABLED(CONFIG_64BIT) && unlikely(folio->_mm_id_mapcount[0] < 0)) {
265+
folio->_mm_id_mapcount[0] = -1;
266+
folio_set_mm_id(folio, 0, MM_ID_DUMMY);
267+
folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT;
268+
}
269+
} else if (folio_mm_id(folio, 1) == mm_id) {
270+
folio->_mm_id_mapcount[1] += diff;
271+
if (!IS_ENABLED(CONFIG_64BIT) && unlikely(folio->_mm_id_mapcount[1] < 0)) {
272+
folio->_mm_id_mapcount[1] = -1;
273+
folio_set_mm_id(folio, 1, MM_ID_DUMMY);
274+
folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT;
275+
}
276+
} else if (folio_mm_id(folio, 0) == MM_ID_DUMMY) {
277+
folio_set_mm_id(folio, 0, mm_id);
278+
folio->_mm_id_mapcount[0] = diff - 1;
279+
/* We might have other mappings already. */
280+
if (new_mapcount_val != diff - 1)
281+
folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT;
282+
} else if (folio_mm_id(folio, 1) == MM_ID_DUMMY) {
283+
folio_set_mm_id(folio, 1, mm_id);
284+
folio->_mm_id_mapcount[1] = diff - 1;
285+
/* Slot 0 certainly has mappings as well. */
286+
folio->_mm_ids |= FOLIO_MM_IDS_SHARED_BIT;
287+
}
288+
folio_unlock_large_mapcount(folio);
289+
}
290+
291+
static __always_inline void folio_sub_large_mapcount(struct folio *folio,
292+
int diff, struct vm_area_struct *vma)
293+
{
294+
const mm_id_t mm_id = vma->vm_mm->mm_id;
295+
int new_mapcount_val;
296+
297+
folio_lock_large_mapcount(folio);
298+
__folio_large_mapcount_sanity_checks(folio, diff, mm_id);
299+
300+
new_mapcount_val = atomic_read(&folio->_large_mapcount) - diff;
301+
atomic_set(&folio->_large_mapcount, new_mapcount_val);
302+
303+
/*
304+
* There are valid corner cases where we might underflow a per-MM
305+
* mapcount (some mappings added when no slot was free, some mappings
306+
* added once a slot was free), so we always set it to -1 once we go
307+
* negative.
308+
*/
309+
if (folio_mm_id(folio, 0) == mm_id) {
310+
folio->_mm_id_mapcount[0] -= diff;
311+
if (folio->_mm_id_mapcount[0] >= 0)
312+
goto out;
313+
folio->_mm_id_mapcount[0] = -1;
314+
folio_set_mm_id(folio, 0, MM_ID_DUMMY);
315+
} else if (folio_mm_id(folio, 1) == mm_id) {
316+
folio->_mm_id_mapcount[1] -= diff;
317+
if (folio->_mm_id_mapcount[1] >= 0)
318+
goto out;
319+
folio->_mm_id_mapcount[1] = -1;
320+
folio_set_mm_id(folio, 1, MM_ID_DUMMY);
321+
}
322+
323+
/*
324+
* If one MM slot owns all mappings, the folio is mapped exclusively.
325+
* Note that if the folio is now unmapped (new_mapcount_val == -1), both
326+
* slots must be free (mapcount == -1), and we'll also mark it as
327+
* exclusive.
328+
*/
329+
if (folio->_mm_id_mapcount[0] == new_mapcount_val ||
330+
folio->_mm_id_mapcount[1] == new_mapcount_val)
331+
folio->_mm_ids &= ~FOLIO_MM_IDS_SHARED_BIT;
332+
out:
333+
folio_unlock_large_mapcount(folio);
334+
}
335+
#else /* !CONFIG_MM_ID */
336+
/*
337+
* See __folio_rmap_sanity_checks(), we might map large folios even without
338+
* CONFIG_TRANSPARENT_HUGEPAGE. We'll keep that working for now.
339+
*/
176340
static inline void folio_set_large_mapcount(struct folio *folio, int mapcount,
177341
struct vm_area_struct *vma)
178342
{
@@ -191,6 +355,7 @@ static inline void folio_sub_large_mapcount(struct folio *folio,
191355
{
192356
atomic_sub(diff, &folio->_large_mapcount);
193357
}
358+
#endif /* CONFIG_MM_ID */
194359

195360
#define folio_inc_large_mapcount(folio, vma) \
196361
folio_add_large_mapcount(folio, 1, vma)

kernel/fork.c

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -802,6 +802,36 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
802802
#define mm_free_pgd(mm)
803803
#endif /* CONFIG_MMU */
804804

805+
#ifdef CONFIG_MM_ID
806+
static DEFINE_IDA(mm_ida);
807+
808+
static inline int mm_alloc_id(struct mm_struct *mm)
809+
{
810+
int ret;
811+
812+
ret = ida_alloc_range(&mm_ida, MM_ID_MIN, MM_ID_MAX, GFP_KERNEL);
813+
if (ret < 0)
814+
return ret;
815+
mm->mm_id = ret;
816+
return 0;
817+
}
818+
819+
static inline void mm_free_id(struct mm_struct *mm)
820+
{
821+
const mm_id_t id = mm->mm_id;
822+
823+
mm->mm_id = MM_ID_DUMMY;
824+
if (id == MM_ID_DUMMY)
825+
return;
826+
if (WARN_ON_ONCE(id < MM_ID_MIN || id > MM_ID_MAX))
827+
return;
828+
ida_free(&mm_ida, id);
829+
}
830+
#else /* !CONFIG_MM_ID */
831+
static inline int mm_alloc_id(struct mm_struct *mm) { return 0; }
832+
static inline void mm_free_id(struct mm_struct *mm) {}
833+
#endif /* CONFIG_MM_ID */
834+
805835
static void check_mm(struct mm_struct *mm)
806836
{
807837
int i;
@@ -905,6 +935,7 @@ void __mmdrop(struct mm_struct *mm)
905935

906936
WARN_ON_ONCE(mm == current->active_mm);
907937
mm_free_pgd(mm);
938+
mm_free_id(mm);
908939
destroy_context(mm);
909940
mmu_notifier_subscriptions_destroy(mm);
910941
check_mm(mm);
@@ -1289,6 +1320,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
12891320
if (mm_alloc_pgd(mm))
12901321
goto fail_nopgd;
12911322

1323+
if (mm_alloc_id(mm))
1324+
goto fail_noid;
1325+
12921326
if (init_new_context(p, mm))
12931327
goto fail_nocontext;
12941328

@@ -1308,6 +1342,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
13081342
fail_cid:
13091343
destroy_context(mm);
13101344
fail_nocontext:
1345+
mm_free_id(mm);
1346+
fail_noid:
13111347
mm_free_pgd(mm);
13121348
fail_nopgd:
13131349
free_mm(mm);

mm/Kconfig

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -815,11 +815,15 @@ config ARCH_WANT_GENERAL_HUGETLB
815815
config ARCH_WANTS_THP_SWAP
816816
def_bool n
817817

818+
config MM_ID
819+
def_bool n
820+
818821
menuconfig TRANSPARENT_HUGEPAGE
819822
bool "Transparent Hugepage Support"
820823
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
821824
select COMPACTION
822825
select XARRAY_MULTI
826+
select MM_ID
823827
help
824828
Transparent Hugepages allows the kernel to use huge pages and
825829
huge tlb transparently to the applications whenever possible.

mm/internal.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -763,6 +763,11 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
763763
folio_set_order(folio, order);
764764
atomic_set(&folio->_large_mapcount, -1);
765765
atomic_set(&folio->_nr_pages_mapped, 0);
766+
if (IS_ENABLED(CONFIG_MM_ID)) {
767+
folio->_mm_ids = 0;
768+
folio->_mm_id_mapcount[0] = -1;
769+
folio->_mm_id_mapcount[1] = -1;
770+
}
766771
if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
767772
atomic_set(&folio->_pincount, 0);
768773
atomic_set(&folio->_entire_mapcount, -1);

0 commit comments

Comments
 (0)