[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964

foriequal0 · 2023-03-27T10:25:27Z

Description

runtime/src/coreclr/gc/unix/gcenv.unix.cpp

Line 787 in 8d5f520

static size_t GetLogicalProcessorCacheSizeFromOS()

Some linux misreports sysconf(_SC_LEVEL3_CACHE_SIZE) as 0.
Other cache size from sysconf also differ from /sys/devices/system/cpu/cpu0/cache/index*/size.
This might make gen0_min_size smaller than it should be (e.g. 2M instead of 25M)

Reproduction Steps

We checked that in folling environment.

Cloud provider: Tencent Cloud.
Kubernetes: TKE 1.20.6-tke.27, containerd 1.4.3
Node: S3.8XLARGE64, Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
OS/Kernel: tlinux2.6(tkernel4)x86_64 / 5.4.119-1-tlinux4-0010.1
Pod image: aspnet:7.0.4

$ getconf -a | grep CACHE
LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                8
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  2097152
LEVEL2_CACHE_ASSOC                 8
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  0
LEVEL3_CACHE_ASSOC                 0
LEVEL3_CACHE_LINESIZE              0
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

getconf uses sysconf.

$ cat /sys/devices/system/cpu/cpu0/cache/index*/size
32K
32K
4096K
36608K

L2/L3 size is differ from getconf.
I'm not sure which one is correct.

$ cat/proc/cpuinfo | grep 'cache size' | head -n1
cache size      : 36608 KB

Expected behavior

It should ignore if sysconf(_SC_LEVEL3_CACHE_SIZE) returns 0, then fallback.

Actual behavior

It consider smaller L2 cache size as GetLogicalProcessorCacheSizeFromOS, rather than L3 cache size.

Regression?

No response

Known Workarounds

We're considering LD_PRELOAD trick to temporarily fix the sysconf for this peculiar environment.

Configuration

No response

Other information

I haven't found any other OS that misreports sysconf(_SC_LEVEL3_CACHE_SIZE).

I discovered it while investigating the OutOfMemoryException issue on our service.
I want to share some context before I reproduce the OOM in a controlled environment.

Our service runs on ASPNET servers and handles < 350req/s (allocation rate: 180MB/s) during the day, and < 100req/s (allocation rate: 35MB/s) at night.
We also have a daily batch job. The job is simple. It consists of fetching data from DBs (1,000,000 entries of small records) and performing calculations, then saving the result to DBs. It's a single-threaded job, and it allocates around 250MB/s. It takes around 1 min.
So we pick up some random server and let it handle during the minimum load time. (It might be a bad idea to mix very different types of loads in a server)

The node has 64G memory. We set a limit of 32G memory to pods, which has never been reached. The committed size is <700MB during the peak time, and never over 3.5G during the batch job. We missed the working set size in our metrics, but the node's available memory was never below 59G. While the virtual memory size is large, we don't have a limit on it. A huge page is not enabled.

However, we got OOM 7 times during the last month. It occurred during .NET 7.0.2 and 7.0.3, but it hasn't been on 7.0.4 (It's been about a week)

We're attempting to reproduce the issue in a controlled environment, but we haven't been able to do so consistently.
We've generated load cycles with the following configurations:

.NET=7.0.3, L3=36608K
.NET=7.0.3, L3=0K
.NET=7.0.3, GCName=libclrgc.so, L3=36608K
.NET=7.0.3, GCName=libclrgc.so, L3=0K
.NET=7.0.4, L3=36608K
.NET=7.0.4, L3=0K

The test ran on AWS, c5.9xlarge, Ubuntu 18.04, Linux 5.4 Kernel, and Docker version 20.10.21. All L3 suppression is done with the LD_PRELOAD trick.
The load generator was tuned so the GC metrics behave similarly to the production's on .NET=7.0.3, L3=0K configuration, then applied to all configurations. The metrics were collected using Prometheus every 15s.

Here's what we've found so far:

Only for L3=0K, except GCName=libclrgc.so, gen0 remains exactly 0 bytes for a long time even during the simulated peak load. It is at least a few KBs for other configurations.
Only for .NET 7.0.3, L3=0K, gen2 becomes very noisy after being exposed to the high load. At that point, gen2 fragmentation is exactly 0 bytes. It remains stable for other configurations during the load cycles.
For L3=0K, the GC rate is 3-5 times higher for gen0, 2-3 times higher for gen1, and 3 times higher for gen2 regardless of the .NET version and GCName.

The text was updated successfully, but these errors were encountered:

ghost · 2023-03-27T10:25:38Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

runtime/src/coreclr/gc/unix/gcenv.unix.cpp

Line 787 in 8d5f520

static size_t GetLogicalProcessorCacheSizeFromOS()

Some linux misreports sysconf(_SC_LEVEL3_CACHE_SIZE) as 0.
But they reports /sys/devices/system/cpu/cpu0/cache/index*/{index.size} correctly. This might make gen0_min_size` smaller than it should be (e.g. 2M instead of 25M)

Reproduction Steps

We checked that in folling environment.

Cloud provider: Tencent Cloud.
Kubernetes: TKE 1.20.6-tke.27, containerd 1.4.3
Node: S3.8XLARGE64, Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz (cache size: L1 32K, L2 4096K, L3 36608K)
OS/Kernel: tlinux2.6(tkernel4)x86_64 / 5.4.119-1-tlinux4-0010.1
Pod image: aspnet:7.0.4

$ getconf -a | grep CACHE
LEVEL1_ICACHE_SIZE                 32768
LEVEL1_ICACHE_ASSOC                8
LEVEL1_ICACHE_LINESIZE             64
LEVEL1_DCACHE_SIZE                 32768
LEVEL1_DCACHE_ASSOC                8
LEVEL1_DCACHE_LINESIZE             64
LEVEL2_CACHE_SIZE                  2097152
LEVEL2_CACHE_ASSOC                 8
LEVEL2_CACHE_LINESIZE              64
LEVEL3_CACHE_SIZE                  0
LEVEL3_CACHE_ASSOC                 0
LEVEL3_CACHE_LINESIZE              0
LEVEL4_CACHE_SIZE                  0
LEVEL4_CACHE_ASSOC                 0
LEVEL4_CACHE_LINESIZE              0

getconf uses sysconf.

$ cat /sys/devices/system/cpu/cpu0/cache/index*/size
32K
32K
4096K
36608K

L2/L3 size is differ from getconf.

$ cat/proc/cpuinfo | grep 'cache size' | head -n1
cache size      : 36608 KB

Expected behavior

It should ignore if sysconf(_SC_LEVEL3_CACHE_SIZE) returns 0, then fallback.

Actual behavior

It consider smaller L2 cache size as GetLogicalProcessorCacheSizeFromOS, rather than L3 cache size.

Regression?

No response

Known Workarounds

We're considering LD_PRELOAD trick to temporarily fix the sysconf for this peculiar environment.

Configuration

No response

Other information

I haven't found any other OS that misreports sysconf(_SC_LEVEL3_CACHE_SIZE).

I discovered it while investigating the OutOfMemoryException issue on our service.
I want to share some context before I reproduce the OOM in a controlled environment.

Our service runs on ASPNET servers and handles < 350req/s (allocation rate: 180MB/s) during the day, and < 100req/s (allocation rate: 35MB/s) at night.
We also have a daily batch job. The job is simple. It consists of fetching data from DBs (1,000,000 entries of small records) and performing calculations, then saving the result to DBs. It's a single-threaded job, and it allocates around 250MB/s. It takes around 1 min.
So we pick up some random server and let it handle during the minimum load time. (It might be a bad idea to mix very different types of loads in a server)

The node has 64G memory. We set a limit of 32G memory to pods, which has never been reached. The committed size is <700MB during the peak time, and never over 3.5G during the batch job. We missed the working set size in our metrics, but the node's available memory was never below 59G. While the virtual memory size is large, we don't have a limit on it. A huge page is not enabled.

However, we got OOM 7 times during the last month. It occurred during .NET 7.0.2 and 7.0.3, but it hasn't been on 7.0.4 (It's been about a week)

We're attempting to reproduce the issue in a controlled environment, but we haven't been able to do so consistently.
We've generated load cycles with the following configurations:

.NET=7.0.3, L3=36608K
.NET=7.0.3, L3=0K
.NET=7.0.3, GCName=libclrgc.so, L3=36608K
.NET=7.0.3, GCName=libclrgc.so, L3=0K
.NET=7.0.4, L3=36608K
.NET=7.0.4, L3=0K

The test ran on AWS, c5.9xlarge, Ubuntu 18.04, Linux 5.4 Kernel, and Docker version 20.10.21. All L3 suppression is done with the LD_PRELOAD trick.
The load generator was tuned so the GC metrics behave similarly to the production's on .NET=7.0.3, L3=0K configuration, then applied to all configurations. The metrics were collected using Prometheus every 15s.

Here's what we've found so far:

Only for L3=0K, except GCName=libclrgc.so, gen0 remains exactly 0 bytes for a long time even during the simulated peak load. It is at least a few KBs for other configurations.
Only for .NET 7.0.3, L3=0K, gen2 becomes very noisy after being exposed to the high load. At that point, gen2 fragmentation is exactly 0 bytes. It remains stable for other configurations during the load cycles.
For L3=0K, the GC rate is 3-5 times higher for gen0, 2-3 times higher for gen1, and 3 times higher for gen2 regardless of the .NET version and GCName.

Author:	foriequal0
Assignees:	-
Labels:	`area-GC-coreclr`, `untriaged`
Milestone:	-

EgorBo · 2023-03-27T12:58:10Z

We have a protection for when OS gives us invalid cache size, but it's only done for Arm64 where we noticed it, presumably, it should be extended to x64 as well.

#71029

mangod9 · 2023-04-03T15:21:20Z

@janvorli ?

mrsharm · 2023-06-22T20:24:33Z

@janvorli - any updates? We saw that this issue was recently updated during our GC sync.

dotnet-issue-labeler bot added the area-GC-coreclr label Mar 27, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Mar 27, 2023

mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 3, 2023

mangod9 added this to the 8.0.0 milestone Apr 3, 2023

mangod9 modified the milestones: 8.0.0, 9.0.0 Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964

[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964

foriequal0 commented Mar 27, 2023 •

edited

ghost commented Mar 27, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

EgorBo commented Mar 27, 2023

mangod9 commented Apr 3, 2023

mrsharm commented Jun 22, 2023

[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964

[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964

Comments

foriequal0 commented Mar 27, 2023 • edited

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

ghost commented Mar 27, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

EgorBo commented Mar 27, 2023

mangod9 commented Apr 3, 2023

mrsharm commented Jun 22, 2023

foriequal0 commented Mar 27, 2023 •

edited