dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

egmkang · 2020-04-08T07:16:28Z

perf record -F 99 -p 16317 -g -- sleep 120
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg

OS: CentOS 7.6 64bit
dotnet core: 3.1.201
hardware: Intel Xeon Cascade Lake 8255C(2.5 GHz) 32 Core 64G Cloud VM

my app is a MMOG GameServer, there are about 5000 input message per second and about 100000 output message per second.

i use windows server 2012 R2 to profile and optimize. and Windows runs very well.

but on CentOS 7.6, it's bad. about half of the process time, it's runs __pv_queued_spin_lock_slowpath. and i don't know who calls __pv_queued_spin_lock_slowpath and why. it seems like not my logic code calling __pv_queued_spin_lock_slowpath.

what the difference between Windows and Linux strategy? how can i solve this problem?

thanks a lot

The text was updated successfully, but these errors were encountered:

egmkang · 2020-04-08T07:18:38Z

this is my flamegraph
flamegraph.zip

egmkang · 2020-04-09T01:49:43Z

[root@VM_16_46_centos ~]# dotnet --info
.NET Core SDK (reflecting any global.json):
Version: 3.1.201
Commit: b1768b4ae7

Runtime Environment:
OS Name: centos
OS Version: 7
OS Platform: Linux
RID: centos.7-x64
Base Path: /usr/share/dotnet/sdk/3.1.201/

Host (useful for support):
Version: 3.1.3
Commit: 4a9f85e9f8

.NET Core SDKs installed:
3.1.201 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
Microsoft.AspNetCore.App 3.1.3 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 3.1.3 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
https://aka.ms/dotnet-download

egmkang · 2020-04-09T02:07:49Z

CPU off time.zip

janvorli · 2020-04-14T13:25:28Z

@egmkang the __pv_queued_spin_lock_slowpath is a spinlock implementation in the Linux kernel. In the flame graph, it seems to be called from the futex syscall, called by pthread mutex locking. There are so many places in the runtime that use the mutexes that I cannot guess what are the ones that you are seeing on the flame graph. I am not sure why the flame graph doesn't show the functions from the native runtime that are calling it and just shows a large "dotnet" box.

Can you switch the view of the profiler samples to view inclusive times? That might help to highlight the actual functions in the runtime that are related to the problematic calls to the mutex functions?

egmkang · 2020-04-14T15:26:15Z

@janvorli i'm sorry, english is not my mother language. can you tell me how to switch the view of the profiler samples to view inclusice times?
perf, or flamegraph's parameters?

is it perf timechart?

janvorli · 2020-04-16T08:50:54Z

@egmkang I am sorry for a delayed response. I am actually not that familiar with the perf tool details, but I assume you've printed the second image in this issue using the "perf report" command, correct? From the doc it seems that from perf tool version 3.16 on, unless you've passed --no-children option, it should show the inclusive percentages (inclusive means that for each function it shows, the percentage would represent time spent in the function and in all of the functions it has called while exclusive means the percentage shows only the time spent in the function itself). What is the perf tool version you have and the command line arguments you've used?

egmkang · 2020-04-17T07:50:41Z

@janvorli i have some tests, and found out, only Intel CPU can reproduce this problem!!!

CPU: Intel and AMD
WKS: WorkStations GC
SVR: Server GC
Number: means online player count(robot count)

AMD_SVR_4000

AMD_WKS_4000

Intel_SVR_3600

Intel_WKS_2500

Intel CPU in both mode, spin lock cost more time; and AMD CPU, only WKS mode cost time allocating object.

egmkang · 2020-04-17T07:51:12Z

processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
stepping : 5
microcode : 0x1
cpu MHz : 2499.998
cache size : 36608 KB
physical id : 1
siblings : 16
core id : 15
cpu cores : 16
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

egmkang · 2020-04-17T07:51:27Z

processor : 31
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7K62 48-Core Processor
stepping : 0
microcode : 0x1000065
cpu MHz : 2595.124
cache size : 512 KB
physical id : 0
siblings : 32
core id : 59
cpu cores : 32
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid amd_dcm eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips : 5190.24
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:

egmkang · 2020-04-17T07:55:52Z

and this is flamegraphs

flamegraph_Intel_WKS_2500

flamegraph_Intel_SVR_3600

flamegraph_AMD_WKS_4000

flamegraph_AMD_SVR_4000

egmkang · 2020-04-17T07:56:48Z

this is all flamegraph svg files
all_flamegraph.zip

egmkang · 2020-04-17T07:59:58Z

and on Intel CPU's Cloud VM, only support few players online; AMD's, WKS can support 4000+ players, only the frame rate is unstable, SVR works fine as well as windows.

janvorli · 2020-04-17T09:11:06Z

For the WKS GC, more locking is expected compared to SVR. But I'll defer further help here to the dotnet perf folks.

@dotnet/dotnet-perf, could someone please help @egmkang to get better.

@Maoni0 - this may be of an interest to you too - there seems to be a substantial difference in the time spent on locks for GC (between AMD and Intel CPUs with the same app) and also substantial delta between the WKS / SVR differences on AMD and Intel.

brianrob · 2020-04-17T15:54:33Z

Interesting that this only happens on Intel CPUs. Unfortunately, it looks like the stack is broken, and so we can't tell from the flame graph what's actually calling __pv_queued_spin_lock_slowpath. perf uses frame pointers to walk the stack, and so likely the issue here is that there is a library somewhere that is compiled with -fomit-frame-pointer. There are a couple of options here:

Try recompile the possible set of libraries with -fno-omit-frame-pointer. There's a good chance this could be libc.
Given that there is a pretty high percentage of time in this function, you can probably pretty easily capture some stacks using the debugger.

Either of these options is reasonable to try, but I suspect you will have faster results with option 2. You'll just want to make sure that you're using LLDB so that you can load the SOS plugin and resolve both managed and native frames.

Here are some instructions on how to install SOS: https://github.com/dotnet/diagnostics/blob/master/documentation/installing-sos-instructions.md

egmkang · 2020-04-18T08:46:51Z

the 32 Core Intel VM has been recycled. another 8 core Intel 8255C vm looks normal. i tested several GameServer versions(Span of a month), they are normal too.......................

egmkang · 2020-05-17T07:18:57Z

@brianrob

https://gist.github.com/CoffeeFlux/fd13e74e60c0ceac48b5a23cf096732d

egmkang · 2020-05-17T07:44:55Z

https://gist.github.com/CoffeeFlux/fda55c40429d00d62691abb7096e89ec

egmkang · 2020-05-17T09:38:14Z

i revert codes to date 2020-04-08, which version had a great lot of New/NewArray operations.

and today codes, reduced a lot New/NewArray.

and all test in one cloud vm, Intel CPU.

and i found:

dot net core dose not have Escape Analysis
C# don't have zero abstraction, such as Action, Func, Linq, have abstraction overhead. when the overhead is allocating an object, the overhead is too expensive.
WKS GC amplifies this overhead. ServerGC is good, but the key point is reduce object allocate.
could JIT can inline closure Action? #34634
ThreadLocal is great
i use ThreadLocal reduced a lot of collection container object, from linq or logic code. the code is very ugly, but useful.
Managed language is not suitable for IO programming
IO usually contains high frequency small object allocation. it's hard to write a high performance, memory safe IO program.
DotNetty/ASP.NET Core bedrock looks good, but languages with RAII may be more advantageous.
but, .net core is suitable for writing computationally intensive code, if you don't allocate object, anyone could write high performance codes. C++/Rust is hard to learn, C# is easy.
MongoDB is very bad
Visual Studio 2017 Community's .NET Object Allocation Tracking is very good. i use this found out the real problem.

reflectronic · 2020-05-17T10:48:35Z

Managed language is not suitable for IO programming

Are you sure about that?

john-h-k · 2020-05-17T10:50:16Z

Managed language is not suitable for IO programming
IO usually contains high frequency small object allocation

use struct

Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Apr 8, 2020

egmkang changed the title ~~dotnet core 3.1 call __pv_queued_spin_lock_slowpath~~ dotnet core 3.1 __pv_queued_spin_lock_slowpath Apr 8, 2020

egmkang mentioned this issue Apr 8, 2020

what's the difference between Windows and Linux? #34684

Closed

jkotas added area-System.Threading tenet-performance Performance related issue and removed area-Meta labels Apr 8, 2020

mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 10, 2020

mangod9 added this to the 5.0 milestone Apr 10, 2020

kouvel removed this from the 5.0.0 milestone Jul 22, 2020

kouvel modified the milestones: 6.0.0, Future Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

egmkang commented Apr 8, 2020 •

edited

Loading

egmkang commented Apr 8, 2020

egmkang commented Apr 9, 2020

egmkang commented Apr 9, 2020

janvorli commented Apr 14, 2020

egmkang commented Apr 14, 2020

janvorli commented Apr 16, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

janvorli commented Apr 17, 2020

brianrob commented Apr 17, 2020

egmkang commented Apr 18, 2020

egmkang commented May 17, 2020 •

edited by CoffeeFlux

Loading

egmkang commented May 17, 2020 •

edited by CoffeeFlux

Loading

egmkang commented May 17, 2020

reflectronic commented May 17, 2020

john-h-k commented May 17, 2020

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

Comments

egmkang commented Apr 8, 2020 • edited Loading

egmkang commented Apr 8, 2020

egmkang commented Apr 9, 2020

egmkang commented Apr 9, 2020

janvorli commented Apr 14, 2020

egmkang commented Apr 14, 2020

janvorli commented Apr 16, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

egmkang commented Apr 17, 2020

janvorli commented Apr 17, 2020

brianrob commented Apr 17, 2020

egmkang commented Apr 18, 2020

egmkang commented May 17, 2020 • edited by CoffeeFlux Loading

egmkang commented May 17, 2020 • edited by CoffeeFlux Loading

egmkang commented May 17, 2020

reflectronic commented May 17, 2020

john-h-k commented May 17, 2020

egmkang commented Apr 8, 2020 •

edited

Loading

egmkang commented May 17, 2020 •

edited by CoffeeFlux

Loading

egmkang commented May 17, 2020 •

edited by CoffeeFlux

Loading