Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT/ARM32] low performance compared to amd64 investigation, data memory barrier usage. #13482

Open
viewizard opened this issue Sep 25, 2019 · 0 comments
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Milestone

Comments

@viewizard
Copy link
Member

Initial thread was started by @alpencolt #12361
Initial performance test results generated by @alpencolt
https://gist.github.com/alpencolt/0580af0be86e49bb9d89508dabcd8615

During arm32 performance investigation we found, that the one of the point of performance degradation is data memory barrier usage. Note, that in case of arm32 we use it for volatile variables, plus, it present in atomic memory access functions.

For example, __sync_val_compare_and_swap(value, comp_val, new_val) implementation for armv7 looks like:

  sub sp, sp, dotnet/coreclr#8
  movs r3, dotnet/coreclr#1
  add r1, sp, dotnet/coreclr#4
  str r3, [sp, dotnet/coreclr#4]
  movs r3, #0
  dmb ish
.L2:
  ldrex r2, [r1]
  cmp r2, dotnet/coreclr#5
  bne .L3
  strex r0, r3, [r1]
  cmp r0, #0
  bne .L2
.L3:
  dmb ish
  add sp, sp, dotnet/coreclr#8
  bx lr

in the same time, for arm64 we have

  mov QWORD PTR [rsp-8], 1
  xor edx, edx
  mov eax, 5
  lock cmpxchg QWORD PTR [rsp-8], rdx
  ret

We also compared the results of tests running with a setting flag COMPlus_JitNoMemoryBarriers and without it.
For example:
https://github.com/dotnet/performance/tree/master/src/benchmarks/micro/corefx/System.Collections/Concurrent

System.Collections.Concurrent.Count

Results running with COMPlus_JitNoMemoryBarriers = "":

[2019/08/08 11:33:37][INFO] | Method | Size |     Mean |     Error |    StdDev |   Median |      Min |      Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/08/08 11:33:37][INFO] |------- |----- |---------:|----------:|----------:|---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
[2019/08/08 11:33:37][INFO] |  Stack |  512 | 1.103 us | 0.0014 us | 0.0012 us | 1.103 us | 1.102 us | 1.106 us |           - |           - |           - |                   - |

Results running with COMPlus_JitNoMemoryBarriers = 1:

[2019/08/08 11:49:19][INFO] | Method | Size |     Mean |     Error |    StdDev |   Median |      Min |      Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/08/08 11:49:19][INFO] |------- |----- |---------:|----------:|----------:|---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
[2019/08/08 11:49:19][INFO] |  Stack |  512 | 1.047 us | 0.0005 us | 0.0005 us | 1.046 us | 1.046 us | 1.047 us |           - |           - |           - |                   - |

System.Collections.Concurrent.IsEmpty

Results running with COMPlus_JitNoMemoryBarriers = "":

[2019/08/08 12:01:06][INFO] | Method | Size |     Mean |     Error |    StdDev |   Median |      Min |      Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/08/08 12:01:06][INFO] |------- |----- |---------:|----------:|----------:|---------:|---------:|---------:|------------:|------------:|------------:|--------------------:|
[2019/08/08 12:01:06][INFO] |  Stack |    0 | 62.26 ns | 0.8814 ns | 0.7360 ns | 61.98 ns | 61.78 ns | 63.90 ns |           - |           - |           - |                   - |
[2019/08/08 12:01:06][INFO] |  Stack |  512 | 67.73 ns | 4.5348 ns | 5.2223 ns | 65.76 ns | 63.02 ns | 76.57 ns |           - |           - |           - |                   - |

Results running with COMPlus_JitNoMemoryBarriers = 1:

[2019/08/08 12:08:37][INFO] | Method | Size |      Mean |     Error |    StdDev |    Median |       Min |      Max | Gen 0/1k Op | Gen 1/1k Op | Gen 2/1k Op | Allocated Memory/Op |
[2019/08/08 12:08:37][INFO] |------- |----- |----------:|----------:|----------:|----------:|----------:|---------:|------------:|------------:|------------:|--------------------:|
[2019/08/08 12:08:37][INFO] |  Stack |    0 | 0.9811 ns | 0.0621 ns | 0.0581 ns | 0.9774 ns | 0.8880 ns | 1.080 ns |           - |           - |           - |                   - |
[2019/08/08 12:08:37][INFO] |  Stack |  512 | 0.9913 ns | 0.0864 ns | 0.0809 ns | 0.9951 ns | 0.8675 ns | 1.124 ns |           - |           - |           - |                   - |

category:cq
theme:barriers
skill-level:expert
cost:medium

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitUntriaged CLR JIT issues needing additional triage tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants