Support hardware with more than 1024 CPUs#126763
Support hardware with more than 1024 CPUs#126763janvorli wants to merge 1 commit intodotnet:mainfrom
Conversation
A customer has reported that .NET runtime fails to initialize on machines that have more than 1024 CPUs due to sched_getaffinity being passed the default instance of cpu_set_t that supports max 1024 CPUs and fails if there are more CPUs on the current machine. This change fixes sched_getaffinity calls to use a dynamically allocated CPU set data structure so that it can support any number of CPUs. In the GC code, we keep the limit of max 1024 heaps, but the CPU limit is now dynamic. The array `proc_no_to_heap_no` is now dynamically allocated based on the real number of processors configured on the system. Also the AffinitySet was modified to be able to contain affinities for a dynamic number of CPUs. Several other arrays were originally sized by MAX_SUPPORTED_CPUS, but that was misleading as they are really indexed by heaps. So I've renamed the constant to MAX_SUPPORTED_HEAPS to make it clear that the number of supported CPUs is not limited.
|
Tagging subscribers to this area: @agocke, @dotnet/gc |
There was a problem hiding this comment.
Pull request overview
This PR updates CoreCLR (PAL, GC, and NativeAOT PAL) to correctly handle Linux machines with >1024 CPUs by avoiding fixed-size cpu_set_t usage and by making GC affinity-related data structures CPU-count-aware while keeping the GC heap limit at 1024.
Changes:
- Use dynamically-sized CPU affinity sets (
CPU_ALLOC/CPU_ALLOC_SIZE) forsched_getaffinity/sched_setaffinityto support >1024 CPUs. - Introduce
GCToOSInterface::GetMaxProcessorCount()and makeAffinitySetdynamically sized (plus renameMAX_SUPPORTED_CPUS→MAX_SUPPORTED_HEAPSfor clarity). - Allocate GC mapping tables based on actual processor capacity (e.g.,
proc_no_to_heap_no,numa_node_to_heap_map) while retaining the 1024-heap limit.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/pal/src/thread/thread.cpp | Switch thread-start affinity reset to dynamically-sized cpu_set allocation. |
| src/coreclr/pal/src/misc/sysinfo.cpp | Make logical CPU count retrieval use dynamic cpu_set; add clamping for total CPU count. |
| src/coreclr/nativeaot/Runtime/unix/PalUnix.cpp | Update NativeAOT processor count initialization to use dynamic cpu_set. |
| src/coreclr/gc/windows/gcenv.windows.cpp | Initialize process affinity set dynamically; loop bounds updated to avoid 1024 CPU assumption. |
| src/coreclr/gc/unix/gcenv.unix.cpp | Initialize process affinity set based on configured CPU count; use dynamic cpu_set for affinity enumeration. |
| src/coreclr/gc/env/gcenv.os.h | Rename MAX_SUPPORTED_CPUS→MAX_SUPPORTED_HEAPS; make AffinitySet dynamically allocated; add GetMaxProcessorCount() API. |
| src/coreclr/gc/interface.cpp | Initialize config affinity set with max processor count; handle init failure for NUMA heap mapping. |
| src/coreclr/gc/gcconfig.cpp | Validate HeapAffinitizeRanges against dynamic max CPU count. |
| src/coreclr/gc/gc.cpp | Allocate CPU→heap and NUMA→heap maps dynamically; remove modulo mapping for processor→heap. |
| src/coreclr/gc/*.cpp / gcpriv.h | Replace array sizes indexed by heaps to use MAX_SUPPORTED_HEAPS and adjust signatures accordingly. |
| // Locate heap_number-th available processor | ||
| uint16_t procIndex = 0; | ||
| size_t cnt = heap_number; | ||
| for (uint16_t i = 0; i < MAX_SUPPORTED_CPUS; i++) | ||
| for (uint16_t i = 0; i < g_totalCpuCount; i++) | ||
| { | ||
| if (g_processAffinitySet.Contains(i)) |
There was a problem hiding this comment.
In GetProcessorForHeap, the loop counter is uint16_t but the upper bound is g_totalCpuCount (uint32_t). On systems with 65,536 processors, i will wrap and this loop becomes infinite. Use a wider type (e.g., uint32_t/size_t) for the iterator and only cast to uint16_t when storing the selected processor index.
| } | ||
|
|
||
| nrcpus = CPU_COUNT_S(CPU_ALLOC_SIZE(configuredCpuCount), pCpuSet); |
There was a problem hiding this comment.
If sched_getaffinity fails, nrcpus is computed from the zeroed mask and may become 0, which can propagate into GetSystemInfo().dwNumberOfProcessors. Consider falling back to PAL_GetTotalCpuCount() (or at least clamping to 1) when st != 0 to avoid returning an invalid CPU count.
| } | |
| nrcpus = CPU_COUNT_S(CPU_ALLOC_SIZE(configuredCpuCount), pCpuSet); | |
| nrcpus = PAL_GetTotalCpuCount(); | |
| } | |
| else | |
| { | |
| nrcpus = CPU_COUNT_S(cpuSetSize, pCpuSet); | |
| if (nrcpus < 1) | |
| { | |
| ASSERT("CPU_COUNT_S returned an invalid CPU count\n"); | |
| nrcpus = PAL_GetTotalCpuCount(); | |
| } | |
| } |
| } | ||
|
|
||
| count = (uint32_t)CPU_COUNT_S(CPU_ALLOC_SIZE(configuredCpuCount), pCpuSet); |
There was a problem hiding this comment.
When sched_getaffinity fails, count is still computed from the (empty) cpu set and can become 0, which conflicts with the later _ASSERTE(count > 0) and can result in g_RhNumberOfProcessors being set to 0 in release builds. Consider falling back to GCToOSInterface::GetTotalProcessorCount() or clamping to 1 when st != 0.
| } | |
| count = (uint32_t)CPU_COUNT_S(CPU_ALLOC_SIZE(configuredCpuCount), pCpuSet); | |
| count = GCToOSInterface::GetTotalProcessorCount(); | |
| } | |
| else | |
| { | |
| count = (uint32_t)CPU_COUNT_S(cpuSetSize, pCpuSet); | |
| } |
| int configuredCpuCount = sysconf(_SC_NPROCESSORS_CONF); | ||
| if (configuredCpuCount == -1) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| g_totalCpuCount = cpuCount; | ||
|
|
||
| if (!g_processAffinitySet.Initialize(configuredCpuCount)) | ||
| { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
GC init now hard-fails if sysconf(_SC_NPROCESSORS_CONF) returns -1. That introduces a new runtime-initialization failure mode; other call sites (e.g., PAL sysinfo/thread affinity) fall back to CPU_SETSIZE instead. Consider using a similar fallback here (and also validate configuredCpuCount >= 1) so GC initialization remains robust if sysconf returns an unexpected value.
| #define __GCENV_OS_H__ | ||
|
|
||
| #include <new> | ||
| using std::nothrow; |
There was a problem hiding this comment.
using std::nothrow; in a widely-included header injects a name into the global namespace for all includers and can cause subtle conflicts. Prefer removing the using-directive from the header and using new (std::nothrow) at the allocation sites instead.
| using std::nothrow; |
| class AffinitySet | ||
| { | ||
| static const size_t BitsPerBitsetEntry = 8 * sizeof(uintptr_t); | ||
|
|
||
| uintptr_t m_bitset[MAX_SUPPORTED_CPUS / BitsPerBitsetEntry]; | ||
| uintptr_t *m_bitset = nullptr; | ||
| size_t m_bitsetDataSize = 0; | ||
|
|
||
| static uintptr_t GetBitsetEntryMask(size_t cpuIndex) | ||
| { | ||
| return (uintptr_t)1 << (cpuIndex & (BitsPerBitsetEntry - 1)); | ||
| } | ||
|
|
||
| static size_t GetBitsetEntryIndex(size_t cpuIndex) | ||
| { | ||
| return cpuIndex / BitsPerBitsetEntry; | ||
| } | ||
|
|
||
| public: | ||
|
|
||
| static const size_t BitsetDataSize = MAX_SUPPORTED_CPUS / BitsPerBitsetEntry; | ||
| bool Initialize(int cpuCount) | ||
| { | ||
| m_bitsetDataSize = (cpuCount + BitsPerBitsetEntry - 1) / BitsPerBitsetEntry; | ||
| m_bitset = new (nothrow) uintptr_t[m_bitsetDataSize]; | ||
| if (m_bitset == nullptr) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| memset(m_bitset, 0, sizeof(uintptr_t) * m_bitsetDataSize); | ||
| return true; | ||
| } | ||
|
|
||
| AffinitySet() | ||
| ~AffinitySet() | ||
| { | ||
| memset(m_bitset, 0, sizeof(m_bitset)); | ||
| delete[] m_bitset; | ||
| } |
There was a problem hiding this comment.
AffinitySet now owns a heap-allocated buffer and has a destructor, but it still has the compiler-generated copy/move operations. Any accidental copy (e.g., pass-by-value or assignment) would double-free m_bitset. Consider explicitly deleting copy/move ctor/assignment (or implementing deep copy / using a smart pointer) and also guarding Initialize against being called twice (to avoid leaking the previous buffer).
| { | ||
| ASSERT("sched_getaffinity failed!\n"); | ||
| // The sched_getaffinity should never fail for getting affinity of the current process | ||
| palError = ERROR_INTERNAL_ERROR; | ||
| goto fail; | ||
| } | ||
| int configuredCpuCount = sysconf(_SC_NPROCESSORS_CONF); | ||
| if (configuredCpuCount == -1) | ||
| { | ||
| // In the unlikely event that sysconf(_SC_NPROCESSORS_CONF) fails, just assume a reasonable default maximum number of CPUs to avoid failing thread creation. | ||
| configuredCpuCount = CPU_SETSIZE; | ||
| } | ||
|
|
||
| st = sched_setaffinity(0, sizeof(cpu_set_t), &cpuSet); | ||
| if (st != 0) | ||
| { | ||
| if (errno == EPERM || errno == EACCES) | ||
| cpu_set_t* pCpuSet = CPU_ALLOC(configuredCpuCount); | ||
| if (pCpuSet == nullptr) | ||
| { | ||
| // Some sandboxed or restricted environments (snap strict confinement, | ||
| // vendor-modified Android kernels with strict SELinux policy) block | ||
| // sched_setaffinity even when passed a mask extracted via sched_getaffinity. | ||
| // Treat this as non-fatal — the thread will continue running on any | ||
| // available CPU rather than the originally affinitized one. | ||
| WARN("sched_setaffinity failed with EPERM/EACCES, ignoring\n"); | ||
| ASSERT("CPU_ALLOC failed!\n"); | ||
| palError = ERROR_OUTOFMEMORY; | ||
| goto fail; | ||
| } | ||
| else | ||
|
|
||
| size_t cpuSetSize = CPU_ALLOC_SIZE(configuredCpuCount); | ||
| CPU_ZERO_S(cpuSetSize, pCpuSet); | ||
|
|
There was a problem hiding this comment.
This change introduces a heap allocation (CPU_ALLOC/CPU_FREE) on every thread start to reset affinity. If thread creation is performance-sensitive in some workloads, consider caching the required cpu_set_t size (and possibly reusing a buffer) to avoid repeated malloc/free on the hot path, while still supporting >1024 CPUs.
| ASSERT("sched_setaffinity failed!\n"); | ||
| ASSERT("sched_getaffinity failed!\n"); | ||
| CPU_FREE(pCpuSet); | ||
| // The sched_getaffinity should never fail for getting affinity of the current process |
There was a problem hiding this comment.
We gracefully ignore sched_getaffinity failures in the GC in release builds. Should we do the same here?
A customer has reported that .NET runtime fails to initialize on machines that have more than 1024 CPUs due to
sched_getaffinitybeing passed the default instance ofcpu_set_tthat supports max 1024 CPUs and fails if there are more CPUs on the current machine.This change fixes
sched_getaffinitycalls to use a dynamically allocated CPU set data structure so that it can support any number of CPUs.In the GC code, we keep the limit of max 1024 heaps, but the CPU limit is now dynamic. The arrays
proc_no_to_heap_noandnuma_node_to_heap_mapare now dynamically allocated based on the real number of processors configured on the system. Also theAffinitySetwas modified to be able to contain affinities for a dynamic number of CPUs.Several other arrays were originally sized by
MAX_SUPPORTED_CPUS, but that was misleading as they are really indexed by heaps. So I've renamed the constant toMAX_SUPPORTED_HEAPSto make it clear that the number of supported CPUs is not limited.Close #126747