-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
High CPU usage and application hang
We are hosting a .NET Core app with Server GC on Linux and have a very strange problem with CPU usage. The app is a combination of ASP.NET and TCP network work. The application now at idle state and there are no requests to it but it uses 300% CPU (3 of 8 cores).
htop

I started an investigation and found out that all time spent at native code

All of 3 threads have the same stack and stuck at
libcoreclr.so <<libcoreclr.so!_ZN7CorUnix10CPalThread11ThreadEntryEPv>>
We have a metric that shows when the last GC happened and I see from it that it happened like one day ago and full application hangs from this moment. It seems like GC is just triggiring infinity.
(lldb) process attach -p 29683
(lldb) plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/2.1.6/libsosplugin.so
(lldb) thread select 23
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
frame #0: 0x00007f37cd120c33 libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::partial_object_p(r=0x0000000000000000) at gc.cpp:17693
(lldb) dumpstack
OS Thread Id: 0x7402 (23)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr.so!SVR::gc_heap::mark_steal() + 0x113 [/root/coreclr/src/gc/gc.cpp:17693]
Child-SP RetAddr Caller, Callee
00007F37BBC85C80 00007f37cd11771c libcoreclr.so!SVR::gc_heap::mark_phase(int, int) + 0x7ac [/root/coreclr/src/gc/gc.cpp:19725], calling libcoreclr.so!SVR::gc_heap::mark_steal() [/root/coreclr/src/gc/gc.cpp:17944]
00007F37BBC85CA0 00007f37cdc6eea6 libc.so.6!__clock_gettime + 0x26
00007F37BBC85CD0 00007f37cd2932c7 libcoreclr.so!QueryPerformanceCounter + 0x27 [/root/coreclr/src/pal/src/misc/time.cpp:212], calling libcoreclr.so!clock_gettime
00007F37BBC85D20 00007f37cd113ef2 libcoreclr.so!SVR::gc_heap::gc1() + 0x1d2 [/root/coreclr/src/gc/gc.cpp:15390], calling libcoreclr.so!SVR::gc_heap::mark_phase(int, int) [/root/coreclr/src/gc/gc.cpp:19472]
00007F37BBC85D50 00007f37cd11b270 libcoreclr.so!SVR::gc_heap::descr_generations(int) + 0x20 [/root/coreclr/src/gc/gc.cpp:32133], calling libcoreclr.so!StressLog::StressLogOn(unsigned int, unsigned int) [/root/coreclr/src/utilcode/stresslog.cpp:594]
00007F37BBC85D90 00007f37cd106a8d libcoreclr.so!SVR::gc_heap::garbage_collect(int) + 0x101d [/root/coreclr/src/gc/gc.cpp:16973], calling libcoreclr.so!SVR::gc_heap::gc1() [/root/coreclr/src/gc/gc.cpp:15323]
00007F37BBC85E30 00007f37cd1054c8 libcoreclr.so!SVR::gc_heap::gc_thread_function() + 0x2b8 [/root/coreclr/src/gc/gc.cpp:5399], calling libcoreclr.so!SVR::gc_heap::garbage_collect(int) [/root/coreclr/src/gc/gc.cpp:16565]
00007F37BBC85E80 00007f37cd105202 libcoreclr.so + 0xffffffff [/root/coreclr/src/gc/gc.cpp:24933], calling libcoreclr.so!SVR::gc_heap::gc_thread_function() [/root/coreclr/src/gc/gc.cpp:5355]
00007F37BBC85EA0 00007f37cd175d78 libcoreclr.so!(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) + 0xa8 [/root/coreclr/src/vm/gcenv.ee.cpp:1309]
00007F37BBC85ED0 00007f37cd2b1b62 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x132 [/root/coreclr/src/pal/src/thread/thread.cpp:1685]
00007F37BBC85EF0 00007f37ce8756db libpthread.so.0!start_thread + 0xdb
00007F37BBC85FB0 00007f37cdc5f88f libc.so.6!clone + 0x3f
(lldb) thread select 23
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
frame #0: 0x00007f37cd120def libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::gc_heap::check_next_mark_stack(this=<unavailable>, next_heap=<unavailable>) at gc.cpp:18146
(lldb) dumpstack
OS Thread Id: 0x7402 (23)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr.so!SVR::gc_heap::mark_steal() + 0x2cf [/root/coreclr/src/gc/gc.cpp:18146]
Child-SP RetAddr Caller, Callee
00007F37BBC85C80 00007f37cd11771c libcoreclr.so!SVR::gc_heap::mark_phase(int, int) + 0x7ac [/root/coreclr/src/gc/gc.cpp:19725], calling libcoreclr.so!SVR::gc_heap::mark_steal() [/root/coreclr/src/gc/gc.cpp:17944]
00007F37BBC85CA0 00007f37cdc6eea6 libc.so.6!__clock_gettime + 0x26
00007F37BBC85CD0 00007f37cd2932c7 libcoreclr.so!QueryPerformanceCounter + 0x27 [/root/coreclr/src/pal/src/misc/time.cpp:212], calling libcoreclr.so!clock_gettime
00007F37BBC85D20 00007f37cd113ef2 libcoreclr.so!SVR::gc_heap::gc1() + 0x1d2 [/root/coreclr/src/gc/gc.cpp:15390], calling libcoreclr.so!SVR::gc_heap::mark_phase(int, int) [/root/coreclr/src/gc/gc.cpp:19472]
00007F37BBC85D50 00007f37cd11b270 libcoreclr.so!SVR::gc_heap::descr_generations(int) + 0x20 [/root/coreclr/src/gc/gc.cpp:32133], calling libcoreclr.so!StressLog::StressLogOn(unsigned int, unsigned int) [/root/coreclr/src/utilcode/stresslog.cpp:594]
00007F37BBC85D90 00007f37cd106a8d libcoreclr.so!SVR::gc_heap::garbage_collect(int) + 0x101d [/root/coreclr/src/gc/gc.cpp:16973], calling libcoreclr.so!SVR::gc_heap::gc1() [/root/coreclr/src/gc/gc.cpp:15323]
00007F37BBC85E30 00007f37cd1054c8 libcoreclr.so!SVR::gc_heap::gc_thread_function() + 0x2b8 [/root/coreclr/src/gc/gc.cpp:5399], calling libcoreclr.so!SVR::gc_heap::garbage_collect(int) [/root/coreclr/src/gc/gc.cpp:16565]
00007F37BBC85E80 00007f37cd105202 libcoreclr.so + 0xffffffff [/root/coreclr/src/gc/gc.cpp:24933], calling libcoreclr.so!SVR::gc_heap::gc_thread_function() [/root/coreclr/src/gc/gc.cpp:5355]
00007F37BBC85EA0 00007f37cd175d78 libcoreclr.so!(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) + 0xa8 [/root/coreclr/src/vm/gcenv.ee.cpp:1309]
00007F37BBC85ED0 00007f37cd2b1b62 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x132 [/root/coreclr/src/pal/src/thread/thread.cpp:1685]
00007F37BBC85EF0 00007f37ce8756db libpthread.so.0!start_thread + 0xdb
00007F37BBC85FB0 00007f37cdc5f88f libc.so.6!clone + 0x3f
(lldb) thread backtrace
* thread dotnet/coreclr#23, name = 'dotnet', stop reason = signal SIGSTOP
* frame #0: 0x00007f37cd120def libcoreclr.so`SVR::gc_heap::mark_steal() [inlined] SVR::gc_heap::check_next_mark_stack(this=<unavailable>, next_heap=<unavailable>) at gc.cpp:18146
frame dotnet/coreclr#1: 0x00007f37cd120ddd libcoreclr.so`SVR::gc_heap::mark_steal(this=<unavailable>) at gc.cpp:17969
frame dotnet/coreclr#2: 0x00007f37cd11771c libcoreclr.so`SVR::gc_heap::mark_phase(this=<unavailable>, condemned_gen_number=2, mark_only_p=NO) at gc.cpp:19714
frame dotnet/coreclr#3: 0x00007f37cd113ef2 libcoreclr.so`SVR::gc_heap::gc1(this=<unavailable>) at gc.cpp:15388
frame dotnet/coreclr#4: 0x00007f37cd106a8d libcoreclr.so`SVR::gc_heap::garbage_collect(this=<unavailable>, n=2) at gc.cpp:16964
frame dotnet/coreclr#5: 0x00007f37cd1054c8 libcoreclr.so`SVR::gc_heap::gc_thread_function(this=<unavailable>) at gc.cpp:5397
frame dotnet/coreclr#6: 0x00007f37cd105202 libcoreclr.so`SVR::gc_heap::gc_thread_stub(arg=<unavailable>) at gc.cpp:24933
frame dotnet/coreclr#7: 0x00007f37cd175d78 libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::__invoke(void*) [inlined] (anonymous namespace)::CreateNonSuspendableThread(void (*)(void*), void*, char const*)::$_1::operator()(void*) const at gcenv.ee.cpp:1324
frame dotnet/coreclr#8: 0x00007f37cd175cdd libcoreclr.so`(anonymous namespace)::CreateNonSuspendableThread(argument=<unavailable>)(void*), void*, char const*)::$_1::__invoke(void*) at gcenv.ee.cpp:1309
frame dotnet/coreclr#9: 0x00007f37cd2b1b62 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000000000147d7f0) at thread.cpp:1682
frame dotnet/coreclr#10: 0x00007f37ce8756db libpthread.so.0`start_thread + 219
frame dotnet/coreclr#11: 0x00007f37cdc5f88f libc.so.6`clone + 63
dotnet --version 2.1.500
hostnamectl
Chassis: vm
Virtualization: kvm
Operating System: Ubuntu 18.04.1 LTS
Kernel: Linux 4.15.0-39-generic
Architecture: x86-64
PerfCollect trace
sampleTrace1.trace.zip
New info
I have just looked at GC 2 Count. This is a picture before app hang

We have server GC and at the same time, we have a lot of GC 2 triggering count (300 000) although memory doesn't free up and memory traffic very low (100mb/5 hours). How this can happen? We don't have any limitation on process memory usage.