Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

Closed
DrewScoggins opened this issue May 5, 2021 · 13 comments
Closed
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline ddaa1c3c503f989db48102f50a693765e4e6ae2a
Compare 8c3180d932146b1a76600601c03082c6a81e1a53
Diff Diff

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
FirstSpanTenSegments - Duration of single invocation 21.75 ns 22.89 ns 1.05 0.00
FirstSpanArray - Duration of single invocation 10.22 ns 14.64 ns 1.43 0.01
FirstSpanMemory - Duration of single invocation 11.59 ns 16.02 ns 1.38 0.01

graph
graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Buffers.Tests.ReadOnlySequenceTests&lt;Char&gt;*'

Payloads

Baseline
Compare

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments


System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray


System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels May 5, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels May 5, 2021
@ghost
Copy link

ghost commented Jul 6, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline ddaa1c3c503f989db48102f50a693765e4e6ae2a
Compare 8c3180d932146b1a76600601c03082c6a81e1a53
Diff Diff

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Benchmark Baseline Test Test/Base Test Quality Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
FirstSpanTenSegments - Duration of single invocation 21.75 ns 22.89 ns 1.05 0.00
FirstSpanArray - Duration of single invocation 10.22 ns 14.64 ns 1.43 0.01
FirstSpanMemory - Duration of single invocation 11.59 ns 16.02 ns 1.38 0.01

graph
graph
graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Buffers.Tests.ReadOnlySequenceTests&lt;Char&gt;*'

Payloads

Baseline
Compare

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments


System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray


System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: -
Labels:

arch-x64, area-System.Buffers, area-System.Threading, os-windows, tenet-performance, tenet-performance-benchmarks, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

I checked the history for https://github.com/dotnet/runtime/commits/main/src/libraries/System.Memory and don't see any changes between the baseline & compare commits.

Could this be caused by #51593? ReadOnlySequence<T> goes a bit wild with inlining, so anything that affects how inlining works may have outsized impact on this set of APIs. Moving to codegen area for additional investigation.

@GrabYourPitchforks GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Buffers labels Jul 6, 2021
@kunalspathak
Copy link
Member

It might be worth checking if #51190 affects any of the generated code for these benchmarks.

@GrabYourPitchforks
Copy link
Member

#51190 would definitely affect codegen, but it was about removing movsx instructions in favor of zero-latency mov instructions (or even being able to elide them entirely). I'll double-check the codegen to make sure nothing strange got introduced.

@DrewScoggins
Copy link
Member Author

All of these tests have improved with this change, #52708. You can look at the tests by clicking on the links in the table, and see. But I am going to go ahead and close this since we are no longer seeing the regression.

@kunalspathak
Copy link
Member

#52708 did improve many benchmarks and that does not mean it fixed the problem that caused the regression in this issue. We should understand what caused the regression in first place because it might be still present, but just that it got covered up by the inlining changes.

image

@kunalspathak kunalspathak reopened this Jul 7, 2021
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Jul 7, 2021
@kunalspathak
Copy link
Member

@GrabYourPitchforks - I am assigning this to you for now. Once you have data and you think it is a codegen issue, feel free to assign back to me.

@GrabYourPitchforks
Copy link
Member

Using a baseline of 1983573 and a compare of de591a8, there's something strange going on with the method prolog for this function.

I changed that function to vary the number of consume += sequence.First.Length; statements which appear. (The original benchmark has 16 such statements.) The number of stack bytes carved out in the method prolog (via sub rsp, <xxx>) is as follows.

Statement count baseline compare
1 0x40 0x40
2 0x68 0x68
3 0x88 0x88
4 0xB8 0xB8
5 0xD8 0xD8
6 0xF8 0xF8
7 0x118 0xF8

A few interesting items here. First, there's a jump of 0x30 instead of 0x20 between statement counts 3 & 4. Second, I think there's opportunity for the JIT to reuse temp locals between invocations of the inlined method ReadOnlySequence<T>.GetFirstSpan(), which would further reduce the required stack space across the board. Finally, at statement count = 7, there's a weird discrepancy between the baseline and compare.

;; baseline
00007ff8`73c00230 4157            push    r15
00007ff8`73c00232 4156            push    r14
00007ff8`73c00234 4155            push    r13
00007ff8`73c00236 4154            push    r12
00007ff8`73c00238 57              push    rdi
00007ff8`73c00239 56              push    rsi
00007ff8`73c0023a 55              push    rbp
00007ff8`73c0023b 53              push    rbx
00007ff8`73c0023c 4881ec18010000  sub     rsp,118h
00007ff8`73c00243 33c0            xor     eax,eax
00007ff8`73c00245 4889442428      mov     qword ptr [rsp+28h],rax
00007ff8`73c0024a c5d857e4        vxorps  xmm4,xmm4,xmm4
00007ff8`73c0024e c5f97f642430    vmovdqa xmmword ptr [rsp+30h],xmm4
00007ff8`73c00254 c5f97f642440    vmovdqa xmmword ptr [rsp+40h],xmm4
00007ff8`73c0025a 48b840ffffffffffffff mov rax,0FFFFFFFFFFFFFF40h
00007ff8`73c00264 c5f97fa40410010000 vmovdqa xmmword ptr [rsp+rax+110h],xmm4
00007ff8`73c0026d c5f97fa40420010000 vmovdqa xmmword ptr [rsp+rax+120h],xmm4
00007ff8`73c00276 c5f97fa40430010000 vmovdqa xmmword ptr [rsp+rax+130h],xmm4
00007ff8`73c0027f 4883c030        add     rax,30h
00007ff8`73c00283 75df            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4424 (00007ff8`73c00264)
00007ff8`73c00285 4889842410010000 mov     qword ptr [rsp+110h],rax
00007ff8`73c0028d 488bf2          mov     rsi,rdx
00007ff8`73c00290 4c8b06          mov     r8,qword ptr [rsi]
00007ff8`73c00293 4d85c0          test    r8,r8
00007ff8`73c00296 7519            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4471 (00007ff8`73c002b1)

;; compare
00007ff8`73e10220 4157            push    r15
00007ff8`73e10222 4156            push    r14
00007ff8`73e10224 4155            push    r13
00007ff8`73e10226 4154            push    r12
00007ff8`73e10228 57              push    rdi
00007ff8`73e10229 56              push    rsi
00007ff8`73e1022a 55              push    rbp
00007ff8`73e1022b 53              push    rbx
00007ff8`73e1022c 4881ecf8000000  sub     rsp,0F8h
00007ff8`73e10233 33c0            xor     eax,eax
00007ff8`73e10235 4889442428      mov     qword ptr [rsp+28h],rax
00007ff8`73e1023a c5d857e4        vxorps  xmm4,xmm4,xmm4
00007ff8`73e1023e 48b840ffffffffffffff mov rax,0FFFFFFFFFFFFFF40h
00007ff8`73e10248 c5f97fa404f0000000 vmovdqa xmmword ptr [rsp+rax+0F0h],xmm4
00007ff8`73e10251 c5f97fa40400010000 vmovdqa xmmword ptr [rsp+rax+100h],xmm4
00007ff8`73e1025a c5f97fa40410010000 vmovdqa xmmword ptr [rsp+rax+110h],xmm4
00007ff8`73e10263 4883c030        add     rax,30h
00007ff8`73e10267 75df            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4408 (00007ff8`73e10248)
00007ff8`73e10269 48898424f0000000 mov     qword ptr [rsp+0F0h],rax
00007ff8`73e10271 488bf2          mov     rsi,rdx
00007ff8`73e10274 4c8b06          mov     r8,qword ptr [rsi]
00007ff8`73e10277 4d85c0          test    r8,r8
00007ff8`73e1027a 7519            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4455 (00007ff8`73e10295)

Up until sequence count = 7, the codegen looks fine for both the baseline and the compare. The only difference between the two methods is that some movsxd statements in the baseline got converted into mov statements in the compare, saving 1 byte of codegen each time such a conversion was replaced.

Kunul, punting this back your way because I don't know how to investigate this further.

@kunalspathak
Copy link
Member

I cherry-picked each commit in 508e560...466deef and compared the disassembly generated. It turns out that there are lot of regressions in FirstSpan from @GrabYourPitchforks 's de591a8 changes. The diffs are here. I will further investigate.

@GrabYourPitchforks
Copy link
Member

@kunalspathak Right, this matches what I saw as well. The codegen for ReadOnlySequence<T>.GetFirstSpan() itself is nearly identical, modulo some movsx / mov differences. (This is expected since that was the entire point of the commit.) That method is being inlined into its caller in the benchmark class, which executes it over and over again. Things still look fine until the method is inlined 7 or more times, at which point something goes amiss and the codegen diverges considerably. That part I don't know how to debug.

@kunalspathak
Copy link
Member

Since most of the Unsafe.Add(byref,int) were converted to Unsafe.Add(byref,nint), we did not do constant folding which was earlier possible.

image

For such variables, when we inlined the methods, we started introducing extra temps (18 extras, to be precise) as seen in the variable assignments screenshot below:

image
image

Because of these, we hit the local variable limit and stop promoting the structs after one point causing the inefficient code gen.

image

The regression is amplified because there are 16 calls to sequence.FirstSpan.Length which is not something that a real-world app would do. I also verified the benchmark's disasm and dumps on latest main and they are no longer present because we do not create as many locals - My guess is because of the Egor's inline change, we do better job in inlining/eliminating redundant variables or other optimizations.

To summarize:

  • We definitely create more temps because of Avoid conv.i opcodes in hot paths in CoreLib #51190 but not sure how much it would impact real world.
  • There is a known limit on how many locals can be present after which JIT gives up. @dotnet/jit-contrib - Is there any plan to increase the limit?

I will close this issue tomorrow unless anyone else have any other comments.

@BruceForstall
Copy link
Member

The default tracked variable limit is 1024. It turns out this is configurable with COMPlus_JitMaxLocalsToTrack. Extensive measurements were made when it was upped to this (from something much smaller -- 64 I think), and larger than this there was little CQ benefit but growing throughput cost.

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2021

The default tracked variable limit is 1024. It turns out this is configurable with COMPlus_JitMaxLocalsToTrack. Extensive measurements were made when it was upped to this (from something much smaller -- 64 I think), and larger than this there was little CQ benefit but growing throughput cost.

I incremented it in #52708
It's not difficult to hit the limit of 512, e.g. sharplab.io, even some TE benchmark hit that, but I don't remember which exactly - I can check if it's important.
Also, I noticed this issue several times in SIMD-related code written by community - e.g. lots of Unsafe.X here and there, lots of large methods with AggressiveInlining on them led to those terrible spills.
I didn't measure TP impact for this change, but I did measure it for the whole PR (mainly via "startup" and "time-to-first request" metrics and it was mostly tolerable).

Btw, setting the limit to 1024 decreases the size of prejitted CoreLib by 20kb (I can share the diffs)

@ghost ghost locked as resolved and limited conversation to collaborators Aug 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

7 participants