[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

DrewScoggins · 2021-05-05T17:22:34Z

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	ddaa1c3c503f989db48102f50a693765e4e6ae2a
Compare	8c3180d932146b1a76600601c03082c6a81e1a53
Diff	Diff

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Benchmark	Baseline	Test	Test/Base	Test Quality
FirstSpanTenSegments - Duration of single invocation	21.75 ns	22.89 ns	1.05	0.00
FirstSpanArray - Duration of single invocation	10.22 ns	14.64 ns	1.43	0.01
FirstSpanMemory - Duration of single invocation	11.59 ns	16.02 ns	1.38	0.01

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Buffers.Tests.ReadOnlySequenceTests&lt;Char&gt;*'

Payloads

Baseline
Compare

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-06T05:47:32Z

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	ddaa1c3c503f989db48102f50a693765e4e6ae2a
Compare	8c3180d932146b1a76600601c03082c6a81e1a53
Diff	Diff

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Benchmark	Baseline	Test	Test/Base	Test Quality
FirstSpanTenSegments - Duration of single invocation	21.75 ns	22.89 ns	1.05	0.00
FirstSpanArray - Duration of single invocation	10.22 ns	14.64 ns	1.43	0.01
FirstSpanMemory - Duration of single invocation	11.59 ns	16.02 ns	1.38	0.01

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Buffers.Tests.ReadOnlySequenceTests&lt;Char&gt;*'

Payloads

Baseline
Compare

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	DrewScoggins
Assignees:	-
Labels:	`arch-x64`, `area-System.Buffers`, `area-System.Threading`, `os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `untriaged`
Milestone:	-

GrabYourPitchforks · 2021-07-06T17:26:24Z

I checked the history for https://github.com/dotnet/runtime/commits/main/src/libraries/System.Memory and don't see any changes between the baseline & compare commits.

Could this be caused by #51593? ReadOnlySequence<T> goes a bit wild with inlining, so anything that affects how inlining works may have outsized impact on this set of APIs. Moving to codegen area for additional investigation.

kunalspathak · 2021-07-06T18:54:18Z

It might be worth checking if #51190 affects any of the generated code for these benchmarks.

GrabYourPitchforks · 2021-07-06T19:05:25Z

#51190 would definitely affect codegen, but it was about removing movsx instructions in favor of zero-latency mov instructions (or even being able to elide them entirely). I'll double-check the codegen to make sure nothing strange got introduced.

DrewScoggins · 2021-07-07T00:17:47Z

All of these tests have improved with this change, #52708. You can look at the tests by clicking on the links in the table, and see. But I am going to go ahead and close this since we are no longer seeing the regression.

kunalspathak · 2021-07-07T01:59:43Z

#52708 did improve many benchmarks and that does not mean it fixed the problem that caused the regression in this issue. We should understand what caused the regression in first place because it might be still present, but just that it got covered up by the inlining changes.

kunalspathak · 2021-07-07T21:12:28Z

@GrabYourPitchforks - I am assigning this to you for now. Once you have data and you think it is a codegen issue, feel free to assign back to me.

GrabYourPitchforks · 2021-07-07T22:46:54Z

Using a baseline of 1983573 and a compare of de591a8, there's something strange going on with the method prolog for this function.

I changed that function to vary the number of consume += sequence.First.Length; statements which appear. (The original benchmark has 16 such statements.) The number of stack bytes carved out in the method prolog (via sub rsp, <xxx>) is as follows.

Statement count	baseline	compare
1	`0x40`	`0x40`
2	`0x68`	`0x68`
3	`0x88`	`0x88`
4	`0xB8`	`0xB8`
5	`0xD8`	`0xD8`
6	`0xF8`	`0xF8`
7	`0x118`	`0xF8`

A few interesting items here. First, there's a jump of 0x30 instead of 0x20 between statement counts 3 & 4. Second, I think there's opportunity for the JIT to reuse temp locals between invocations of the inlined method ReadOnlySequence<T>.GetFirstSpan(), which would further reduce the required stack space across the board. Finally, at statement count = 7, there's a weird discrepancy between the baseline and compare.

;; baseline
00007ff8`73c00230 4157            push    r15
00007ff8`73c00232 4156            push    r14
00007ff8`73c00234 4155            push    r13
00007ff8`73c00236 4154            push    r12
00007ff8`73c00238 57              push    rdi
00007ff8`73c00239 56              push    rsi
00007ff8`73c0023a 55              push    rbp
00007ff8`73c0023b 53              push    rbx
00007ff8`73c0023c 4881ec18010000  sub     rsp,118h
00007ff8`73c00243 33c0            xor     eax,eax
00007ff8`73c00245 4889442428      mov     qword ptr [rsp+28h],rax
00007ff8`73c0024a c5d857e4        vxorps  xmm4,xmm4,xmm4
00007ff8`73c0024e c5f97f642430    vmovdqa xmmword ptr [rsp+30h],xmm4
00007ff8`73c00254 c5f97f642440    vmovdqa xmmword ptr [rsp+40h],xmm4
00007ff8`73c0025a 48b840ffffffffffffff mov rax,0FFFFFFFFFFFFFF40h
00007ff8`73c00264 c5f97fa40410010000 vmovdqa xmmword ptr [rsp+rax+110h],xmm4
00007ff8`73c0026d c5f97fa40420010000 vmovdqa xmmword ptr [rsp+rax+120h],xmm4
00007ff8`73c00276 c5f97fa40430010000 vmovdqa xmmword ptr [rsp+rax+130h],xmm4
00007ff8`73c0027f 4883c030        add     rax,30h
00007ff8`73c00283 75df            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4424 (00007ff8`73c00264)
00007ff8`73c00285 4889842410010000 mov     qword ptr [rsp+110h],rax
00007ff8`73c0028d 488bf2          mov     rsi,rdx
00007ff8`73c00290 4c8b06          mov     r8,qword ptr [rsi]
00007ff8`73c00293 4d85c0          test    r8,r8
00007ff8`73c00296 7519            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4471 (00007ff8`73c002b1)

;; compare
00007ff8`73e10220 4157            push    r15
00007ff8`73e10222 4156            push    r14
00007ff8`73e10224 4155            push    r13
00007ff8`73e10226 4154            push    r12
00007ff8`73e10228 57              push    rdi
00007ff8`73e10229 56              push    rsi
00007ff8`73e1022a 55              push    rbp
00007ff8`73e1022b 53              push    rbx
00007ff8`73e1022c 4881ecf8000000  sub     rsp,0F8h
00007ff8`73e10233 33c0            xor     eax,eax
00007ff8`73e10235 4889442428      mov     qword ptr [rsp+28h],rax
00007ff8`73e1023a c5d857e4        vxorps  xmm4,xmm4,xmm4
00007ff8`73e1023e 48b840ffffffffffffff mov rax,0FFFFFFFFFFFFFF40h
00007ff8`73e10248 c5f97fa404f0000000 vmovdqa xmmword ptr [rsp+rax+0F0h],xmm4
00007ff8`73e10251 c5f97fa40400010000 vmovdqa xmmword ptr [rsp+rax+100h],xmm4
00007ff8`73e1025a c5f97fa40410010000 vmovdqa xmmword ptr [rsp+rax+110h],xmm4
00007ff8`73e10263 4883c030        add     rax,30h
00007ff8`73e10267 75df            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4408 (00007ff8`73e10248)
00007ff8`73e10269 48898424f0000000 mov     qword ptr [rsp+0F0h],rax
00007ff8`73e10271 488bf2          mov     rsi,rdx
00007ff8`73e10274 4c8b06          mov     r8,qword ptr [rsi]
00007ff8`73e10277 4d85c0          test    r8,r8
00007ff8`73e1027a 7519            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.ReadOnlySequenceTests`1[[System.Char, System.Private.CoreLib]].FirstSpan(System.Buffers.ReadOnlySequence`1<Char>)+0x4455 (00007ff8`73e10295)

Up until sequence count = 7, the codegen looks fine for both the baseline and the compare. The only difference between the two methods is that some movsxd statements in the baseline got converted into mov statements in the compare, saving 1 byte of codegen each time such a conversion was replaced.

Kunul, punting this back your way because I don't know how to investigate this further.

kunalspathak · 2021-07-09T22:09:08Z

I cherry-picked each commit in 508e560...466deef and compared the disassembly generated. It turns out that there are lot of regressions in FirstSpan from @GrabYourPitchforks 's de591a8 changes. The diffs are here. I will further investigate.

GrabYourPitchforks · 2021-07-12T18:18:06Z

@kunalspathak Right, this matches what I saw as well. The codegen for ReadOnlySequence<T>.GetFirstSpan() itself is nearly identical, modulo some movsx / mov differences. (This is expected since that was the entire point of the commit.) That method is being inlined into its caller in the benchmark class, which executes it over and over again. Things still look fine until the method is inlined 7 or more times, at which point something goes amiss and the codegen diverges considerably. That part I don't know how to debug.

kunalspathak · 2021-07-13T07:44:26Z

Since most of the Unsafe.Add(byref,int) were converted to Unsafe.Add(byref,nint), we did not do constant folding which was earlier possible.

For such variables, when we inlined the methods, we started introducing extra temps (18 extras, to be precise) as seen in the variable assignments screenshot below:

Because of these, we hit the local variable limit and stop promoting the structs after one point causing the inefficient code gen.

The regression is amplified because there are 16 calls to sequence.FirstSpan.Length which is not something that a real-world app would do. I also verified the benchmark's disasm and dumps on latest main and they are no longer present because we do not create as many locals - My guess is because of the Egor's inline change, we do better job in inlining/eliminating redundant variables or other optimizations.

To summarize:

We definitely create more temps because of Avoid conv.i opcodes in hot paths in CoreLib #51190 but not sure how much it would impact real world.
There is a known limit on how many locals can be present after which JIT gives up. @dotnet/jit-contrib - Is there any plan to increase the limit?

I will close this issue tomorrow unless anyone else have any other comments.

BruceForstall · 2021-07-13T18:43:37Z

The default tracked variable limit is 1024. It turns out this is configurable with COMPlus_JitMaxLocalsToTrack. Extensive measurements were made when it was upped to this (from something much smaller -- 64 I think), and larger than this there was little CQ benefit but growing throughput cost.

EgorBo · 2021-07-13T19:09:42Z

The default tracked variable limit is 1024. It turns out this is configurable with COMPlus_JitMaxLocalsToTrack. Extensive measurements were made when it was upped to this (from something much smaller -- 64 I think), and larger than this there was little CQ benefit but growing throughput cost.

I incremented it in #52708
It's not difficult to hit the limit of 512, e.g. sharplab.io, even some TE benchmark hit that, but I don't remember which exactly - I can check if it's important.
Also, I noticed this issue several times in SIMD-related code written by community - e.g. lots of Unsafe.X here and there, lots of large methods with AggressiveInlining on them led to those terrible spills.
I didn't measure TP impact for this change, but I did measure it for the whole PR (mainly via "startup" and "time-to-first request" metrics and it was mostly tolerable).

Btw, setting the limit to 1024 decreases the size of prejitted CoreLib by 20kb (I can share the diffs)

DrewScoggins added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels May 5, 2021

dotnet-issue-labeler bot added area-System.Threading untriaged New issue has not been triaged by the area owner labels May 5, 2021

mangod9 added the area-System.Buffers label Jul 6, 2021

mangod9 removed the area-System.Threading label Jul 6, 2021

GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Buffers labels Jul 6, 2021

DrewScoggins closed this as completed Jul 7, 2021

kunalspathak reopened this Jul 7, 2021

JulieLeeMSFT assigned kunalspathak Jul 7, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Jul 7, 2021

kunalspathak assigned GrabYourPitchforks and unassigned kunalspathak Jul 7, 2021

GrabYourPitchforks assigned kunalspathak and unassigned GrabYourPitchforks Jul 7, 2021

kunalspathak closed this as completed Jul 13, 2021

ghost locked as resolved and limited conversation to collaborators Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

DrewScoggins commented May 5, 2021

Payloads

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

ghost commented Jul 6, 2021

Run Information

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Repro

Payloads

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

GrabYourPitchforks commented Jul 6, 2021

kunalspathak commented Jul 6, 2021

GrabYourPitchforks commented Jul 6, 2021

DrewScoggins commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

GrabYourPitchforks commented Jul 7, 2021

kunalspathak commented Jul 9, 2021

GrabYourPitchforks commented Jul 12, 2021

kunalspathak commented Jul 13, 2021

BruceForstall commented Jul 13, 2021

EgorBo commented Jul 13, 2021

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #52312

Comments

DrewScoggins commented May 5, 2021

Run Information

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Repro

Payloads

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

ghost commented Jul 6, 2021

Run Information

Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char>

Repro

Payloads

Histogram

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanTenSegments

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanArray

System.Buffers.Tests.ReadOnlySequenceTests<Char>.FirstSpanMemory

Docs

GrabYourPitchforks commented Jul 6, 2021

kunalspathak commented Jul 6, 2021

GrabYourPitchforks commented Jul 6, 2021

DrewScoggins commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

kunalspathak commented Jul 7, 2021

GrabYourPitchforks commented Jul 7, 2021

kunalspathak commented Jul 9, 2021

GrabYourPitchforks commented Jul 12, 2021

kunalspathak commented Jul 13, 2021

BruceForstall commented Jul 13, 2021

EgorBo commented Jul 13, 2021