Skip to content

Commit e341f9c

Browse files
joshuahahnakpm00
authored andcommitted
mm/mempolicy: Weighted Interleave Auto-tuning
On machines with multiple memory nodes, interleaving page allocations across nodes allows for better utilization of each node's bandwidth. Previous work by Gregory Price [1] introduced weighted interleave, which allowed for pages to be allocated across nodes according to user-set ratios. Ideally, these weights should be proportional to their bandwidth, so that under bandwidth pressure, each node uses its maximal efficient bandwidth and prevents latency from increasing exponentially. Previously, weighted interleave's default weights were just 1s -- which would be equivalent to the (unweighted) interleave mempolicy, which goes through the nodes in a round-robin fashion, ignoring bandwidth information. This patch has two main goals: First, it makes weighted interleave easier to use for users who wish to relieve bandwidth pressure when using nodes with varying bandwidth (CXL). By providing a set of "real" default weights that just work out of the box, users who might not have the capability (or wish to) perform experimentation to find the most optimal weights for their system can still take advantage of bandwidth-informed weighted interleave. Second, it allows for weighted interleave to dynamically adjust to hotplugged memory with new bandwidth information. Instead of manually updating node weights every time new bandwidth information is reported or taken off, weighted interleave adjusts and provides a new set of default weights for weighted interleave to use when there is a change in bandwidth information. To meet these goals, this patch introduces an auto-configuration mode for the interleave weights that provides a reasonable set of default weights, calculated using bandwidth data reported by the system. In auto mode, weights are dynamically adjusted based on whatever the current bandwidth information reports (and responds to hotplug events). This patch still supports users manually writing weights into the nodeN sysfs interface by entering into manual mode. When a user enters manual mode, the system stops dynamically updating any of the node weights, even during hotplug events that shift the optimal weight distribution. A new sysfs interface "auto" is introduced, which allows users to switch between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The system also automatically enters manual mode when a nodeN interface is manually written to. There is one functional change that this patch makes to the existing weighted_interleave ABI: previously, writing 0 directly to a nodeN interface was said to reset the weight to the system default. Before this patch, the default for all weights were 1, which meant that writing 0 and 1 were functionally equivalent. With this patch, writing 0 is invalid. Link: https://lkml.kernel.org/r/20250520141236.2987309-1-joshua.hahnjy@gmail.com [joshua.hahnjy@gmail.com: wordsmithing changes, simplification, fixes] Link: https://lkml.kernel.org/r/20250511025840.2410154-1-joshua.hahnjy@gmail.com [joshua.hahnjy@gmail.com: remove auto_kobj_attr field from struct sysfs_wi_group] Link: https://lkml.kernel.org/r/20250512142511.3959833-1-joshua.hahnjy@gmail.com https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ [1] Link: https://lkml.kernel.org/r/20250505182328.4148265-1-joshua.hahnjy@gmail.com Co-developed-by: Gregory Price <gourry@gourry.net> Signed-off-by: Gregory Price <gourry@gourry.net> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com> Suggested-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Ying Huang <ying.huang@linux.alibaba.com> Suggested-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 9e619cd commit e341f9c

File tree

4 files changed

+311
-63
lines changed

4 files changed

+311
-63
lines changed

Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,35 @@ Description: Weight configuration interface for nodeN
2020
Minimum weight: 1
2121
Maximum weight: 255
2222

23-
Writing an empty string or `0` will reset the weight to the
24-
system default. The system default may be set by the kernel
25-
or drivers at boot or during hotplug events.
23+
Writing invalid values (i.e. any values not in [1,255],
24+
empty string, ...) will return -EINVAL.
25+
26+
Changing the weight to a valid value will automatically
27+
switch the system to manual mode as well.
28+
29+
What: /sys/kernel/mm/mempolicy/weighted_interleave/auto
30+
Date: May 2025
31+
Contact: Linux memory management mailing list <linux-mm@kvack.org>
32+
Description: Auto-weighting configuration interface
33+
34+
Configuration mode for weighted interleave. 'true' indicates
35+
that the system is in auto mode, and a 'false' indicates that
36+
the system is in manual mode.
37+
38+
In auto mode, all node weights are re-calculated and overwritten
39+
(visible via the nodeN interfaces) whenever new bandwidth data
40+
is made available during either boot or hotplug events.
41+
42+
In manual mode, node weights can only be updated by the user.
43+
Note that nodes that are onlined with previously set weights
44+
will reuse those weights. If they were not previously set or
45+
are onlined with missing bandwidth data, the weights will use
46+
a default weight of 1.
47+
48+
Writing any true value string (e.g. Y or 1) will enable auto
49+
mode, while writing any false value string (e.g. N or 0) will
50+
enable manual mode. All other strings are ignored and will
51+
return -EINVAL.
52+
53+
Writing a new weight to a node directly via the nodeN interface
54+
will also automatically switch the system to manual mode.

drivers/base/node.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
#include <linux/init.h>
88
#include <linux/mm.h>
99
#include <linux/memory.h>
10+
#include <linux/mempolicy.h>
1011
#include <linux/vmstat.h>
1112
#include <linux/notifier.h>
1213
#include <linux/node.h>
@@ -214,6 +215,14 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
214215
break;
215216
}
216217
}
218+
219+
/* When setting CPU access coordinates, update mempolicy */
220+
if (access == ACCESS_COORDINATE_CPU) {
221+
if (mempolicy_set_node_perf(nid, coord)) {
222+
pr_info("failed to set mempolicy attrs for node %d\n",
223+
nid);
224+
}
225+
}
217226
}
218227
EXPORT_SYMBOL_GPL(node_set_perf_attrs);
219228

include/linux/mempolicy.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include <linux/slab.h>
1212
#include <linux/rbtree.h>
1313
#include <linux/spinlock.h>
14+
#include <linux/node.h>
1415
#include <linux/nodemask.h>
1516
#include <linux/pagemap.h>
1617
#include <uapi/linux/mempolicy.h>
@@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
178179

179180
extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
180181

182+
extern int mempolicy_set_node_perf(unsigned int node,
183+
struct access_coordinate *coords);
184+
181185
#else
182186

183187
struct mempolicy {};

0 commit comments

Comments
 (0)