Skip to content

Commit a211c65

Browse files
hnazakpm00
authored andcommitted
mm: page_alloc: defrag_mode kswapd/kcompactd watermarks
The previous patch added pageblock_order reclaim to kswapd/kcompactd, which helps, but produces only one block at a time. Allocation stalls and THP failure rates are still higher than they could be. To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the watermarking for kswapd & kcompactd: instead of targeting the high watermark in order-0 pages and checking for one suitable block, simply require that the high watermark is entirely met in pageblocks. To this end, track the number of free pages within contiguous pageblocks, then change pgdat_balanced() and compact_finished() to check watermarks against this new value. This further reduces THP latencies and allocation stalls, and improves THP success rates against the previous patch: DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%) Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%) Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%) Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%) Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%) THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%) THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%) Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%) Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%) Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%) Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%) Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%) Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%) Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%) Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%) Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%) Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%) Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%) Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%) Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%) Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%) Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%) Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%) Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%) Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%) File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%) Also of note is that compaction work overall is reduced. The reason for this is that when free pageblocks are more readily available, allocations are also much more likely to get physically placed in LRU order, instead of being forced to scavenge free space here and there. This means that reclaim by itself has better chances of freeing up whole blocks, and the system relies less on compaction. Comparing all changes to the vanilla kernel: VANILLA DEFRAGMODE-ASYNC-WMARKS Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%) Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%) Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%) Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%) Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%) THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%) THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%) Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%) Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%) Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%) Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%) Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%) Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%) Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%) Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%) Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%) Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%) Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%) Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%) Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%) Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%) Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%) Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%) Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%) Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%) File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%) THP allocation latencies and %sys time are down dramatically. THP allocation failures are down from nearly 50% to 8.5%. And to recall previous data points, the success rates are steady and reliable without the cumulative deterioration of fragmentation events. Compaction work is down overall. Direct compaction work especially is drastically reduced. As an aside, its success rate of 4% indicates there is room for improvement. For now it's good to rely on it less. Reclaim work is up overall, however direct reclaim work is down. Part of the increase can be attributed to a higher use of THPs, which due to internal fragmentation increase the memory footprint. This is not necessarily an unexpected side-effect for users of THP. However, taken both points together, there may well be some opportunities for fine tuning in the reclaim/compaction coordination. [hannes@cmpxchg.org: fix squawks from rebasing] Link: https://lkml.kernel.org/r/20250314210558.GD1316033@cmpxchg.org Link: https://lkml.kernel.org/r/20250313210647.1314586-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 101f9d6 commit a211c65

File tree

6 files changed

+73
-15
lines changed

6 files changed

+73
-15
lines changed

include/linux/mmzone.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ enum numa_stat_item {
138138
enum zone_stat_item {
139139
/* First 128 byte cacheline (assuming 64 bit words) */
140140
NR_FREE_PAGES,
141+
NR_FREE_PAGES_BLOCKS,
141142
NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
142143
NR_ZONE_INACTIVE_ANON = NR_ZONE_LRU_BASE,
143144
NR_ZONE_ACTIVE_ANON,

mm/compaction.c

Lines changed: 33 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2328,6 +2328,22 @@ static enum compact_result __compact_finished(struct compact_control *cc)
23282328
if (!pageblock_aligned(cc->migrate_pfn))
23292329
return COMPACT_CONTINUE;
23302330

2331+
/*
2332+
* When defrag_mode is enabled, make kcompactd target
2333+
* watermarks in whole pageblocks. Because they can be stolen
2334+
* without polluting, no further fallback checks are needed.
2335+
*/
2336+
if (defrag_mode && !cc->direct_compaction) {
2337+
if (__zone_watermark_ok(cc->zone, cc->order,
2338+
high_wmark_pages(cc->zone),
2339+
cc->highest_zoneidx, cc->alloc_flags,
2340+
zone_page_state(cc->zone,
2341+
NR_FREE_PAGES_BLOCKS)))
2342+
return COMPACT_SUCCESS;
2343+
2344+
return COMPACT_CONTINUE;
2345+
}
2346+
23312347
/* Direct compactor: Is a suitable page free? */
23322348
ret = COMPACT_NO_SUITABLE_PAGE;
23332349
for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
@@ -2495,13 +2511,19 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
24952511
static enum compact_result
24962512
compaction_suit_allocation_order(struct zone *zone, unsigned int order,
24972513
int highest_zoneidx, unsigned int alloc_flags,
2498-
bool async)
2514+
bool async, bool kcompactd)
24992515
{
2516+
unsigned long free_pages;
25002517
unsigned long watermark;
25012518

2519+
if (kcompactd && defrag_mode)
2520+
free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
2521+
else
2522+
free_pages = zone_page_state(zone, NR_FREE_PAGES);
2523+
25022524
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
2503-
if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
2504-
alloc_flags))
2525+
if (__zone_watermark_ok(zone, order, watermark, highest_zoneidx,
2526+
alloc_flags, free_pages))
25052527
return COMPACT_SUCCESS;
25062528

25072529
/*
@@ -2557,7 +2579,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
25572579
ret = compaction_suit_allocation_order(cc->zone, cc->order,
25582580
cc->highest_zoneidx,
25592581
cc->alloc_flags,
2560-
cc->mode == MIGRATE_ASYNC);
2582+
cc->mode == MIGRATE_ASYNC,
2583+
!cc->direct_compaction);
25612584
if (ret != COMPACT_CONTINUE)
25622585
return ret;
25632586
}
@@ -3051,6 +3074,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
30513074
struct zone *zone;
30523075
enum zone_type highest_zoneidx = pgdat->kcompactd_highest_zoneidx;
30533076
enum compact_result ret;
3077+
unsigned int alloc_flags = defrag_mode ?
3078+
ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN;
30543079

30553080
for (zoneid = 0; zoneid <= highest_zoneidx; zoneid++) {
30563081
zone = &pgdat->node_zones[zoneid];
@@ -3060,8 +3085,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
30603085

30613086
ret = compaction_suit_allocation_order(zone,
30623087
pgdat->kcompactd_max_order,
3063-
highest_zoneidx, ALLOC_WMARK_MIN,
3064-
false);
3088+
highest_zoneidx, alloc_flags,
3089+
false, true);
30653090
if (ret == COMPACT_CONTINUE)
30663091
return true;
30673092
}
@@ -3084,7 +3109,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
30843109
.mode = MIGRATE_SYNC_LIGHT,
30853110
.ignore_skip_hint = false,
30863111
.gfp_mask = GFP_KERNEL,
3087-
.alloc_flags = ALLOC_WMARK_MIN,
3112+
.alloc_flags = defrag_mode ? ALLOC_WMARK_HIGH : ALLOC_WMARK_MIN,
30883113
};
30893114
enum compact_result ret;
30903115

@@ -3104,7 +3129,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
31043129

31053130
ret = compaction_suit_allocation_order(zone,
31063131
cc.order, zoneid, cc.alloc_flags,
3107-
false);
3132+
false, true);
31083133
if (ret != COMPACT_CONTINUE)
31093134
continue;
31103135

mm/internal.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -536,6 +536,7 @@ extern char * const zone_names[MAX_NR_ZONES];
536536
DECLARE_STATIC_KEY_MAYBE(CONFIG_DEBUG_VM, check_pages_enabled);
537537

538538
extern int min_free_kbytes;
539+
extern int defrag_mode;
539540

540541
void setup_per_zone_wmarks(void);
541542
void calculate_min_free_kbytes(void);

mm/page_alloc.c

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ int min_free_kbytes = 1024;
273273
int user_min_free_kbytes = -1;
274274
static int watermark_boost_factor __read_mostly = 15000;
275275
static int watermark_scale_factor = 10;
276-
static int defrag_mode;
276+
int defrag_mode;
277277

278278
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
279279
int movable_zone;
@@ -660,16 +660,20 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
660660
bool tail)
661661
{
662662
struct free_area *area = &zone->free_area[order];
663+
int nr_pages = 1 << order;
663664

664665
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
665666
"page type is %lu, passed migratetype is %d (nr=%d)\n",
666-
get_pageblock_migratetype(page), migratetype, 1 << order);
667+
get_pageblock_migratetype(page), migratetype, nr_pages);
667668

668669
if (tail)
669670
list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
670671
else
671672
list_add(&page->buddy_list, &area->free_list[migratetype]);
672673
area->nr_free++;
674+
675+
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
676+
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
673677
}
674678

675679
/*
@@ -681,24 +685,34 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
681685
unsigned int order, int old_mt, int new_mt)
682686
{
683687
struct free_area *area = &zone->free_area[order];
688+
int nr_pages = 1 << order;
684689

685690
/* Free page moving can fail, so it happens before the type update */
686691
VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
687692
"page type is %lu, passed migratetype is %d (nr=%d)\n",
688-
get_pageblock_migratetype(page), old_mt, 1 << order);
693+
get_pageblock_migratetype(page), old_mt, nr_pages);
689694

690695
list_move_tail(&page->buddy_list, &area->free_list[new_mt]);
691696

692-
account_freepages(zone, -(1 << order), old_mt);
693-
account_freepages(zone, 1 << order, new_mt);
697+
account_freepages(zone, -nr_pages, old_mt);
698+
account_freepages(zone, nr_pages, new_mt);
699+
700+
if (order >= pageblock_order &&
701+
is_migrate_isolate(old_mt) != is_migrate_isolate(new_mt)) {
702+
if (!is_migrate_isolate(old_mt))
703+
nr_pages = -nr_pages;
704+
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
705+
}
694706
}
695707

696708
static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
697709
unsigned int order, int migratetype)
698710
{
711+
int nr_pages = 1 << order;
712+
699713
VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
700714
"page type is %lu, passed migratetype is %d (nr=%d)\n",
701-
get_pageblock_migratetype(page), migratetype, 1 << order);
715+
get_pageblock_migratetype(page), migratetype, nr_pages);
702716

703717
/* clear reported state and update reported page count */
704718
if (page_reported(page))
@@ -708,6 +722,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
708722
__ClearPageBuddy(page);
709723
set_page_private(page, 0);
710724
zone->free_area[order].nr_free--;
725+
726+
if (order >= pageblock_order && !is_migrate_isolate(migratetype))
727+
__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
711728
}
712729

713730
static inline void del_page_from_free_list(struct page *page, struct zone *zone,

mm/vmscan.c

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6724,11 +6724,24 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
67246724
* meet watermarks.
67256725
*/
67266726
for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
6727+
unsigned long free_pages;
6728+
67276729
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
67286730
mark = promo_wmark_pages(zone);
67296731
else
67306732
mark = high_wmark_pages(zone);
6731-
if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx))
6733+
6734+
/*
6735+
* In defrag_mode, watermarks must be met in whole
6736+
* blocks to avoid polluting allocator fallbacks.
6737+
*/
6738+
if (defrag_mode)
6739+
free_pages = zone_page_state(zone, NR_FREE_PAGES_BLOCKS);
6740+
else
6741+
free_pages = zone_page_state(zone, NR_FREE_PAGES);
6742+
6743+
if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
6744+
0, free_pages))
67326745
return true;
67336746
}
67346747

mm/vmstat.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1190,6 +1190,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
11901190
const char * const vmstat_text[] = {
11911191
/* enum zone_stat_item counters */
11921192
"nr_free_pages",
1193+
"nr_free_pages_blocks",
11931194
"nr_zone_inactive_anon",
11941195
"nr_zone_active_anon",
11951196
"nr_zone_inactive_file",

0 commit comments

Comments
 (0)