Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

Open
egmkang opened this issue Apr 8, 2020 · 20 comments
Open

dotnet core 3.1 __pv_queued_spin_lock_slowpath #34678

egmkang opened this issue Apr 8, 2020 · 20 comments
Labels
Milestone

Comments

@egmkang
Copy link

egmkang commented Apr 8, 2020

1
2
3
4

perf record -F 99 -p 16317 -g -- sleep 120
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg

OS: CentOS 7.6 64bit
dotnet core: 3.1.201
hardware: Intel Xeon Cascade Lake 8255C(2.5 GHz) 32 Core 64G Cloud VM

my app is a MMOG GameServer, there are about 5000 input message per second and about 100000 output message per second.

i use windows server 2012 R2 to profile and optimize. and Windows runs very well.

but on CentOS 7.6, it's bad. about half of the process time, it's runs __pv_queued_spin_lock_slowpath. and i don't know who calls __pv_queued_spin_lock_slowpath and why. it seems like not my logic code calling __pv_queued_spin_lock_slowpath.

what the difference between Windows and Linux strategy? how can i solve this problem?

thanks a lot

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Apr 8, 2020
@egmkang
Copy link
Author

egmkang commented Apr 8, 2020

this is my flamegraph
flamegraph.zip

@egmkang egmkang changed the title dotnet core 3.1 call __pv_queued_spin_lock_slowpath dotnet core 3.1 __pv_queued_spin_lock_slowpath Apr 8, 2020
@jkotas jkotas added area-System.Threading tenet-performance Performance related issue and removed area-Meta labels Apr 8, 2020
@egmkang
Copy link
Author

egmkang commented Apr 9, 2020

[root@VM_16_46_centos ~]# dotnet --info
.NET Core SDK (reflecting any global.json):
Version: 3.1.201
Commit: b1768b4ae7

Runtime Environment:
OS Name: centos
OS Version: 7
OS Platform: Linux
RID: centos.7-x64
Base Path: /usr/share/dotnet/sdk/3.1.201/

Host (useful for support):
Version: 3.1.3
Commit: 4a9f85e9f8

.NET Core SDKs installed:
3.1.201 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
Microsoft.AspNetCore.App 3.1.3 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 3.1.3 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
https://aka.ms/dotnet-download

@egmkang
Copy link
Author

egmkang commented Apr 9, 2020

CPU off time.zip

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 10, 2020
@mangod9 mangod9 added this to the 5.0 milestone Apr 10, 2020
@janvorli
Copy link
Member

@egmkang the __pv_queued_spin_lock_slowpath is a spinlock implementation in the Linux kernel. In the flame graph, it seems to be called from the futex syscall, called by pthread mutex locking. There are so many places in the runtime that use the mutexes that I cannot guess what are the ones that you are seeing on the flame graph. I am not sure why the flame graph doesn't show the functions from the native runtime that are calling it and just shows a large "dotnet" box.

Can you switch the view of the profiler samples to view inclusive times? That might help to highlight the actual functions in the runtime that are related to the problematic calls to the mutex functions?

@egmkang
Copy link
Author

egmkang commented Apr 14, 2020

@janvorli i'm sorry, english is not my mother language. can you tell me how to switch the view of the profiler samples to view inclusice times?
perf, or flamegraph's parameters?

is it perf timechart?

@janvorli
Copy link
Member

@egmkang I am sorry for a delayed response. I am actually not that familiar with the perf tool details, but I assume you've printed the second image in this issue using the "perf report" command, correct? From the doc it seems that from perf tool version 3.16 on, unless you've passed --no-children option, it should show the inclusive percentages (inclusive means that for each function it shows, the percentage would represent time spent in the function and in all of the functions it has called while exclusive means the percentage shows only the time spent in the function itself). What is the perf tool version you have and the command line arguments you've used?

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

@janvorli i have some tests, and found out, only Intel CPU can reproduce this problem!!!

CPU: Intel and AMD
WKS: WorkStations GC
SVR: Server GC
Number: means online player count(robot count)

AMD_SVR_4000
AMD_SVR_4000
AMD_WKS_4000
AMD_WKS_4000
Intel_SVR_3600
Intel_SVR_3600
Intel_WKS_2500
Intel_WKS_2500

Intel CPU in both mode, spin lock cost more time; and AMD CPU, only WKS mode cost time allocating object.

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
stepping : 5
microcode : 0x1
cpu MHz : 2499.998
cache size : 36608 KB
physical id : 1
siblings : 16
core id : 15
cpu cores : 16
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

processor : 31
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7K62 48-Core Processor
stepping : 0
microcode : 0x1000065
cpu MHz : 2595.124
cache size : 512 KB
physical id : 0
siblings : 32
core id : 59
cpu cores : 32
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm art rep_good nopl extd_apicid amd_dcm eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext retpoline_amd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips : 5190.24
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

and this is flamegraphs

flamegraph_Intel_WKS_2500
flamegraph_Intel_WKS_2500
flamegraph_Intel_SVR_3600
flamegraph_Intel_SVR_3600
flamegraph_AMD_WKS_4000
flamegraph_AMD_WKS_4000
flamegraph_AMD_SVR_4000
flamegraph_AMD_SVR_4000

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

this is all flamegraph svg files
all_flamegraph.zip

@egmkang
Copy link
Author

egmkang commented Apr 17, 2020

and on Intel CPU's Cloud VM, only support few players online; AMD's, WKS can support 4000+ players, only the frame rate is unstable, SVR works fine as well as windows.

@janvorli
Copy link
Member

For the WKS GC, more locking is expected compared to SVR. But I'll defer further help here to the dotnet perf folks.

@dotnet/dotnet-perf, could someone please help @egmkang to get better.

@Maoni0 - this may be of an interest to you too - there seems to be a substantial difference in the time spent on locks for GC (between AMD and Intel CPUs with the same app) and also substantial delta between the WKS / SVR differences on AMD and Intel.

@brianrob
Copy link
Member

Interesting that this only happens on Intel CPUs. Unfortunately, it looks like the stack is broken, and so we can't tell from the flame graph what's actually calling __pv_queued_spin_lock_slowpath. perf uses frame pointers to walk the stack, and so likely the issue here is that there is a library somewhere that is compiled with -fomit-frame-pointer. There are a couple of options here:

  1. Try recompile the possible set of libraries with -fno-omit-frame-pointer. There's a good chance this could be libc.
  2. Given that there is a pretty high percentage of time in this function, you can probably pretty easily capture some stacks using the debugger.

Either of these options is reasonable to try, but I suspect you will have faster results with option 2. You'll just want to make sure that you're using LLDB so that you can load the SOS plugin and resolve both managed and native frames.

Here are some instructions on how to install SOS: https://github.com/dotnet/diagnostics/blob/master/documentation/installing-sos-instructions.md

@egmkang
Copy link
Author

egmkang commented Apr 18, 2020

the 32 Core Intel VM has been recycled. another 8 core Intel 8255C vm looks normal. i tested several GameServer versions(Span of a month), they are normal too.......................

@egmkang
Copy link
Author

egmkang commented May 17, 2020

@egmkang
Copy link
Author

egmkang commented May 17, 2020

@egmkang
Copy link
Author

egmkang commented May 17, 2020

i revert codes to date 2020-04-08, which version had a great lot of New/NewArray operations.
Intel_WKS_4000
and today codes, reduced a lot New/NewArray.
Intel_SVR_4000

and all test in one cloud vm, Intel CPU.

and i found:

  1. dot net core dose not have Escape Analysis
    C# don't have zero abstraction, such as Action, Func, Linq, have abstraction overhead. when the overhead is allocating an object, the overhead is too expensive.
    WKS GC amplifies this overhead. ServerGC is good, but the key point is reduce object allocate.
    could JIT can inline closure Action? #34634
  2. ThreadLocal is great
    i use ThreadLocal reduced a lot of collection container object, from linq or logic code. the code is very ugly, but useful.
  3. Managed language is not suitable for IO programming
    IO usually contains high frequency small object allocation. it's hard to write a high performance, memory safe IO program.
    DotNetty/ASP.NET Core bedrock looks good, but languages with RAII may be more advantageous.
    but, .net core is suitable for writing computationally intensive code, if you don't allocate object, anyone could write high performance codes. C++/Rust is hard to learn, C# is easy.
  4. MongoDB is very bad
  5. Visual Studio 2017 Community's .NET Object Allocation Tracking is very good. i use this found out the real problem.

@reflectronic
Copy link
Contributor

Managed language is not suitable for IO programming

Are you sure about that?

@john-h-k
Copy link
Contributor

Managed language is not suitable for IO programming
IO usually contains high frequency small object allocation

use struct

@kouvel kouvel removed this from the 5.0.0 milestone Jul 22, 2020
@kouvel kouvel modified the milestones: 6.0.0, Future Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants