New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GC] GetLogicalProcessorCacheSizeFromOS should fallback per cache level #83964
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionruntime/src/coreclr/gc/unix/gcenv.unix.cpp Line 787 in 8d5f520
Some linux misreports Reproduction StepsWe checked that in folling environment. Cloud provider: Tencent Cloud.
getconf uses sysconf.
L2/L3 size is differ from getconf.
Expected behaviorIt should ignore if Actual behaviorIt consider smaller L2 cache size as Regression?No response Known WorkaroundsWe're considering ConfigurationNo response Other informationI haven't found any other OS that misreports I discovered it while investigating the OutOfMemoryException issue on our service. Our service runs on ASPNET servers and handles < 350req/s (allocation rate: 180MB/s) during the day, and < 100req/s (allocation rate: 35MB/s) at night. The node has 64G memory. We set a limit of 32G memory to pods, which has never been reached. The committed size is <700MB during the peak time, and never over 3.5G during the batch job. We missed the working set size in our metrics, but the node's available memory was never below 59G. While the virtual memory size is large, we don't have a limit on it. A huge page is not enabled. However, we got OOM 7 times during the last month. It occurred during .NET 7.0.2 and 7.0.3, but it hasn't been on 7.0.4 (It's been about a week) We're attempting to reproduce the issue in a controlled environment, but we haven't been able to do so consistently.
The test ran on AWS, c5.9xlarge, Ubuntu 18.04, Linux 5.4 Kernel, and Docker version 20.10.21. All L3 suppression is done with the LD_PRELOAD trick. Here's what we've found so far:
|
We have a protection for when OS gives us invalid cache size, but it's only done for Arm64 where we noticed it, presumably, it should be extended to x64 as well. |
@janvorli - any updates? We saw that this issue was recently updated during our GC sync. |
Description
runtime/src/coreclr/gc/unix/gcenv.unix.cpp
Line 787 in 8d5f520
Some linux misreports
sysconf(_SC_LEVEL3_CACHE_SIZE)
as 0.Other cache size from
sysconf
also differ from/sys/devices/system/cpu/cpu0/cache/index*/size
.This might make
gen0_min_size
smaller than it should be (e.g. 2M instead of 25M)Reproduction Steps
We checked that in folling environment.
Cloud provider: Tencent Cloud.
Kubernetes: TKE 1.20.6-tke.27, containerd 1.4.3
Node: S3.8XLARGE64, Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
OS/Kernel: tlinux2.6(tkernel4)x86_64 / 5.4.119-1-tlinux4-0010.1
Pod image: aspnet:7.0.4
getconf uses sysconf.
L2/L3 size is differ from getconf.
I'm not sure which one is correct.
Expected behavior
It should ignore if
sysconf(_SC_LEVEL3_CACHE_SIZE)
returns 0, then fallback.Actual behavior
It consider smaller L2 cache size as
GetLogicalProcessorCacheSizeFromOS
, rather than L3 cache size.Regression?
No response
Known Workarounds
We're considering
LD_PRELOAD
trick to temporarily fix thesysconf
for this peculiar environment.Configuration
No response
Other information
I haven't found any other OS that misreports
sysconf(_SC_LEVEL3_CACHE_SIZE)
.I discovered it while investigating the OutOfMemoryException issue on our service.
I want to share some context before I reproduce the OOM in a controlled environment.
Our service runs on ASPNET servers and handles < 350req/s (allocation rate: 180MB/s) during the day, and < 100req/s (allocation rate: 35MB/s) at night.
We also have a daily batch job. The job is simple. It consists of fetching data from DBs (1,000,000 entries of small records) and performing calculations, then saving the result to DBs. It's a single-threaded job, and it allocates around 250MB/s. It takes around 1 min.
So we pick up some random server and let it handle during the minimum load time. (It might be a bad idea to mix very different types of loads in a server)
The node has 64G memory. We set a limit of 32G memory to pods, which has never been reached. The committed size is <700MB during the peak time, and never over 3.5G during the batch job. We missed the working set size in our metrics, but the node's available memory was never below 59G. While the virtual memory size is large, we don't have a limit on it. A huge page is not enabled.
However, we got OOM 7 times during the last month. It occurred during .NET 7.0.2 and 7.0.3, but it hasn't been on 7.0.4 (It's been about a week)
We're attempting to reproduce the issue in a controlled environment, but we haven't been able to do so consistently.
We've generated load cycles with the following configurations:
The test ran on AWS, c5.9xlarge, Ubuntu 18.04, Linux 5.4 Kernel, and Docker version 20.10.21. All L3 suppression is done with the LD_PRELOAD trick.
The load generator was tuned so the GC metrics behave similarly to the production's on .NET=7.0.3, L3=0K configuration, then applied to all configurations. The metrics were collected using Prometheus every 15s.
Here's what we've found so far:
The text was updated successfully, but these errors were encountered: