Skip to content

High CPU usage in native code and application hang in GC #11587

@kripergvg

Description

@kripergvg

High CPU usage and application hang

We are hosting a .NET Core app with Server GC on Linux and have a very strange problem with CPU usage. The app is a combination of ASP.NET and TCP network work. The application now at idle state and there are no requests to it but it uses 300% CPU (3 of 8 cores).
htop
image

I started an investigation and found out that all time spent at native code
image
All of 3 threads have the same stack and stuck at
libcoreclr.so <<libcoreclr.so!_ZN7CorUnix10CPalThread11ThreadEntryEPv>>
We have a metric that shows when the last GC happened and I see from it that it happened like one day ago and full application hangs from this moment. It seems like GC is just triggiring infinity.

(lldb) process attach -p 29683
(lldb) plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/2.1.6/libsosplugin.so

(lldb) thread select 23
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007f37cd120c33 libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::partial_object_p(r=0x0000000000000000) at gc.cpp:17693
(lldb) dumpstack
OS Thread Id: 0x7402 (23)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr.so!SVR::gc_heap::mark_steal() + 0x113 [/root/coreclr/src/gc/gc.cpp:17693]
Child-SP         RetAddr          Caller, Callee
00007F37BBC85C80 00007f37cd11771c libcoreclr.so!SVR::gc_heap::mark_phase(int, int) + 0x7ac [/root/coreclr/src/gc/gc.cpp:19725], calling libcoreclr.so!SVR::gc_heap::mark_steal() [/root/coreclr/src/gc/gc.cpp:17944]
00007F37BBC85CA0 00007f37cdc6eea6 libc.so.6!__clock_gettime + 0x26
00007F37BBC85CD0 00007f37cd2932c7 libcoreclr.so!QueryPerformanceCounter + 0x27 [/root/coreclr/src/pal/src/misc/time.cpp:212], calling libcoreclr.so!clock_gettime
00007F37BBC85D20 00007f37cd113ef2 libcoreclr.so!SVR::gc_heap::gc1() + 0x1d2 [/root/coreclr/src/gc/gc.cpp:15390], calling libcoreclr.so!SVR::gc_heap::mark_phase(int, int) [/root/coreclr/src/gc/gc.cpp:19472]
00007F37BBC85D50 00007f37cd11b270 libcoreclr.so!SVR::gc_heap::descr_generations(int) + 0x20 [/root/coreclr/src/gc/gc.cpp:32133], calling libcoreclr.so!StressLog::StressLogOn(unsigned int, unsigned int) [/root/coreclr/src/utilcode/stresslog.cpp:594]
00007F37BBC85D90 00007f37cd106a8d libcoreclr.so!SVR::gc_heap::garbage_collect(int) + 0x101d [/root/coreclr/src/gc/gc.cpp:16973], calling libcoreclr.so!SVR::gc_heap::gc1() [/root/coreclr/src/gc/gc.cpp:15323]
00007F37BBC85E30 00007f37cd1054c8 libcoreclr.so!SVR::gc_heap::gc_thread_function() + 0x2b8 [/root/coreclr/src/gc/gc.cpp:5399], calling libcoreclr.so!SVR::gc_heap::garbage_collect(int) [/root/coreclr/src/gc/gc.cpp:16565]
00007F37BBC85E80 00007f37cd105202 libcoreclr.so + 0xffffffff [/root/coreclr/src/gc/gc.cpp:24933], calling libcoreclr.so!SVR::gc_heap::gc_thread_function() [/root/coreclr/src/gc/gc.cpp:5355]
00007F37BBC85EA0 00007f37cd175d78 libcoreclr.so!(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) + 0xa8 [/root/coreclr/src/vm/gcenv.ee.cpp:1309]
00007F37BBC85ED0 00007f37cd2b1b62 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x132 [/root/coreclr/src/pal/src/thread/thread.cpp:1685]
00007F37BBC85EF0 00007f37ce8756db libpthread.so.0!start_thread + 0xdb
00007F37BBC85FB0 00007f37cdc5f88f libc.so.6!clone + 0x3f

(lldb) thread select 23
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
    frame #0: 0x00007f37cd120def libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::gc_heap::check_next_mark_stack(this=<unavailable>, next_heap=<unavailable>) at gc.cpp:18146
(lldb) dumpstack
OS Thread Id: 0x7402 (23)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr.so!SVR::gc_heap::mark_steal() + 0x2cf [/root/coreclr/src/gc/gc.cpp:18146]
Child-SP         RetAddr          Caller, Callee
00007F37BBC85C80 00007f37cd11771c libcoreclr.so!SVR::gc_heap::mark_phase(int, int) + 0x7ac [/root/coreclr/src/gc/gc.cpp:19725], calling libcoreclr.so!SVR::gc_heap::mark_steal() [/root/coreclr/src/gc/gc.cpp:17944]
00007F37BBC85CA0 00007f37cdc6eea6 libc.so.6!__clock_gettime + 0x26
00007F37BBC85CD0 00007f37cd2932c7 libcoreclr.so!QueryPerformanceCounter + 0x27 [/root/coreclr/src/pal/src/misc/time.cpp:212], calling libcoreclr.so!clock_gettime
00007F37BBC85D20 00007f37cd113ef2 libcoreclr.so!SVR::gc_heap::gc1() + 0x1d2 [/root/coreclr/src/gc/gc.cpp:15390], calling libcoreclr.so!SVR::gc_heap::mark_phase(int, int) [/root/coreclr/src/gc/gc.cpp:19472]
00007F37BBC85D50 00007f37cd11b270 libcoreclr.so!SVR::gc_heap::descr_generations(int) + 0x20 [/root/coreclr/src/gc/gc.cpp:32133], calling libcoreclr.so!StressLog::StressLogOn(unsigned int, unsigned int) [/root/coreclr/src/utilcode/stresslog.cpp:594]
00007F37BBC85D90 00007f37cd106a8d libcoreclr.so!SVR::gc_heap::garbage_collect(int) + 0x101d [/root/coreclr/src/gc/gc.cpp:16973], calling libcoreclr.so!SVR::gc_heap::gc1() [/root/coreclr/src/gc/gc.cpp:15323]
00007F37BBC85E30 00007f37cd1054c8 libcoreclr.so!SVR::gc_heap::gc_thread_function() + 0x2b8 [/root/coreclr/src/gc/gc.cpp:5399], calling libcoreclr.so!SVR::gc_heap::garbage_collect(int) [/root/coreclr/src/gc/gc.cpp:16565]
00007F37BBC85E80 00007f37cd105202 libcoreclr.so + 0xffffffff [/root/coreclr/src/gc/gc.cpp:24933], calling libcoreclr.so!SVR::gc_heap::gc_thread_function() [/root/coreclr/src/gc/gc.cpp:5355]
00007F37BBC85EA0 00007f37cd175d78 libcoreclr.so!(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) + 0xa8 [/root/coreclr/src/vm/gcenv.ee.cpp:1309]
00007F37BBC85ED0 00007f37cd2b1b62 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x132 [/root/coreclr/src/pal/src/thread/thread.cpp:1685]
00007F37BBC85EF0 00007f37ce8756db libpthread.so.0!start_thread + 0xdb
00007F37BBC85FB0 00007f37cdc5f88f libc.so.6!clone + 0x3f

(lldb) thread  backtrace
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
  * frame #0: 0x00007f37cd120def libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::gc_heap::check_next_mark_stack(this=<unavailable>, next_heap=<unavailable>) at gc.cpp:18146
    frame dotnet/coreclr#1: 0x00007f37cd120ddd libcoreclr.so`SVR::gc_heap::mark_steal(this=<unavailable>) at gc.cpp:17969
    frame dotnet/coreclr#2: 0x00007f37cd11771c libcoreclr.so`SVR::gc_heap::mark_phase(this=<unavailable>, condemned_gen_number=2, mark_only_p=NO) at gc.cpp:19714
    frame dotnet/coreclr#3: 0x00007f37cd113ef2 libcoreclr.so`SVR::gc_heap::gc1(this=<unavailable>) at gc.cpp:15388
    frame dotnet/coreclr#4: 0x00007f37cd106a8d libcoreclr.so`SVR::gc_heap::garbage_collect(this=<unavailable>, n=2) at gc.cpp:16964
    frame dotnet/coreclr#5: 0x00007f37cd1054c8 libcoreclr.so`SVR::gc_heap::gc_thread_function(this=<unavailable>) at gc.cpp:5397
    frame dotnet/coreclr#6: 0x00007f37cd105202 libcoreclr.so`SVR::gc_heap::gc_thread_stub(arg=<unavailable>) at gc.cpp:24933
    frame dotnet/coreclr#7: 0x00007f37cd175d78 libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) [inlined] (anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::operator()(void*) const at gcenv.ee.cpp:1324
    frame dotnet/coreclr#8: 0x00007f37cd175cdd libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(argument=<unavailable>)(void*), void*, char const*)::$_1::__invoke(void*) at gcenv.ee.cpp:1309
    frame dotnet/coreclr#9: 0x00007f37cd2b1b62 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000000147d7f0) at thread.cpp:1682
    frame dotnet/coreclr#10: 0x00007f37ce8756db libpthread.so.0`start_thread + 219
    frame dotnet/coreclr#11: 0x00007f37cdc5f88f libc.so.6`clone + 63

dotnet --version 2.1.500


hostnamectl
           Chassis: vm
    Virtualization: kvm
  Operating System: Ubuntu 18.04.1 LTS
            Kernel: Linux 4.15.0-39-generic
      Architecture: x86-64

PerfCollect trace
sampleTrace1.trace.zip

New info
I have just looked at GC 2 Count. This is a picture before app hang
image

We have server GC and at the same time, we have a lot of GC 2 triggering count (300 000) although memory doesn't free up and memory traffic very low (100mb/5 hours). How this can happen? We don't have any limitation on process memory usage.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions