Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC does not correctly evaluate the memory load on Linux #13371

Closed
kevingosse opened this issue Sep 6, 2019 · 10 comments · Fixed by dotnet/coreclr#26764
Closed

GC does not correctly evaluate the memory load on Linux #13371

kevingosse opened this issue Sep 6, 2019 · 10 comments · Fixed by dotnet/coreclr#26764
Assignees
Milestone

Comments

@kevingosse
Copy link
Contributor

On Linux, when running in an unrestricted environment, the GC uses sysconf(SYSCONF_PAGES) * sysconf(_SC_PAGE_SIZE) to evaluate the total memory consumption of the system (https://github.com/dotnet/coreclr/blob/master/src/pal/src/misc/sysinfo.cpp#L368).

SYSCONF_PAGES is mapped on _SC_AVPHYS_PAGES. Unfortunately, it counts the memory used by the page cache (which is automatically freed by the OS as needed), and therefore overestimates the system load.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:            62G         26G         30G        1.5M        6.1G         35G
Swap:            0B          0B          0B

$ getconf _AVPHYS_PAGES
7968847

$ getconf PAGESIZE
4096

We can see here that _AVPHYS_PAGES * PAGESIZE is 32 GB, even though only 26 GB of resident memory is actually used. We've seen instances where the GC incorrectly concludes that more of 90% of the memory is used, and start doing blocking collections even though it shouldn't be needed.

@sergiy-k
Copy link
Contributor

sergiy-k commented Sep 6, 2019

/cc: @janvorli @Maoni0

@mattzink
Copy link

mattzink commented Sep 6, 2019

Wow. Hope the fix for this can make 3.0.

@janvorli
Copy link
Member

janvorli commented Sep 9, 2019

@kevingosse the _SC_AVPHYS_PAGES represents number of available pages , not the number of used pages. The free command above reported 30GB of free memory and _AVPHYS_PAGES * PAGESIZE from your numbers is 30GB, not 32. So the value we are using seems correct.

Since the used / free in your case is close to each other, I guess it has mislead you. Here is in example from my machine:

$free -h
              total        used        free      shared  buff/cache   available
Mem:            23G        3,4G         15G        5,0M        4,6G         19G
Swap:           23G        1,1G         22G

$ getconf _AVPHYS_PAGES
4074733

$ getconf PAGESIZE
4096

_AVPHYS_PAGES * PAGESIZE = 16690106368 B = 16298932 KB = 15917 MB = 15 GB

@kevingosse
Copy link
Contributor Author

kevingosse commented Sep 9, 2019

My point is that the "available" column should be used instead of the "free" column (reported by _AVPHYS_PAGES). A large part of the memory reported in "buff/cache" is automatically freed by the OS when needed, and therefore shouldn't be counted by the GC when evaluating the system load.

@janvorli
Copy link
Member

janvorli commented Sep 9, 2019

Ah, got it. It seems we would need to parse the /proc/meminfo file to get the Buffers and Cached size then. I wonder if the Slab should be included too. Based on a Redhat article (https://access.redhat.com/solutions/406773), the free reports buff/cache as a sum of Buffers, Cached and Slab from /proc/meminfo.

@kevingosse
Copy link
Contributor Author

kevingosse commented Sep 9, 2019

In /proc/meminfo, MemAvailable has been added precisely so that users wouldn't have to ask themselves this kind of questions: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773

I don't think it's available on RHEL6 though, so if it's still supported for .net core 3.0 you may need to add a fallback path where you manually compute the value.

Edit: Never mind, it has been backported: https://access.redhat.com/solutions/776393

@agoretsky
Copy link

looks like it's important one, could you consider porting fix in 3.0.x/3.1?

@sergiy-k
Copy link
Contributor

Yes, we do want to try to get this change approved for 3.1.
Getting into 3.0.x servicing will be very hard.

@mattzink
Copy link

Could you explain further why this miscalculation would be hard to address in a 3.0.x release?

@jkotas
Copy link
Member

jkotas commented Sep 20, 2019

3.0 is a very short-lived release. It will be quickly superseded by 3.1 LTS in November. 3.1 LTS is where it is important to get bugs like this one fixed and it is what we are focused on.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants