Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf -30%] System.Memory.Span<Byte>.BinarySearch #39721

Closed
DrewScoggins opened this issue Jul 21, 2020 · 6 comments
Closed

[Perf -30%] System.Memory.Span<Byte>.BinarySearch #39721

DrewScoggins opened this issue Jul 21, 2020 · 6 comments
Assignees
Labels
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS Windows 10.0.18362
Changes diff

Regressions in System.Memory.Span

Benchmark Baseline Test Test/Base Modality Baseline Outlier
BinarySearch 14.86 ns 19.26 ns 1.30 Bimodal True

Related Issue on x86 Windows

[Perf -25%] System.Memory.Span.IndexOfAnyThreeValues

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span<Byte>*';

Histogram

System.Memory.Span.BinarySearch(Size: 512)

[13.136 ; 14.227) | @@@@@@@@
[14.227 ; 14.616) | 
[14.616 ; 15.707) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[15.707 ; 16.840) | @@@@
[16.840 ; 17.931) | 
[17.931 ; 18.355) | 
[18.355 ; 19.446) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Jul 21, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Jul 21, 2020
@kunalspathak
Copy link
Member

I pulled out latest graph and if I compared it with the graph in this issue, it looks like we see regression once in a while, although in latest graph below, the regression continues.

image

On my local machine, I came up with standalone test to verify it.

Stopwatch st = new Stopwatch();
st.Start();
BinarySearchTest<char> test = new BinarySearchTest<char>();
for (int i = 0; i < 6 * 1000 * 1000 * 200; i++)
{
  test.BinarySearch();
}
st.Stop();
Console.WriteLine("Time: " + st.ElapsedMilliseconds + " msec.");
Base: 16725 msec.
Test: 19845 msec.

Base = 4f703e6
Test = 4c18fb2

I can clearly see the regression on my local machine as well, but I couldn't see why it regressed. The JIT code produced during entire run is identical. The JIT throughput itself was better in test compared to base as seen below:

(left is base and right is test).
image

Drilling down further doesn't tell much. The time is just spent in executing the assembly code of BinarySearch.

image

Perhaps, one of @GrabYourPitchforks or @adamsitnik should investigate further.

@adamsitnik
Copy link
Member

The JIT code produced during entire run is identical.

Have you used the latest 5.0 bits? I've seen the same assembly code and no regression when using .NET Core SDK=5.0.100-preview.8.20362.3 but when I use the python script to download the latest SDK (RC1) the there is a diff and a regression.

https://www.diffchecker.com/OfXOVRte

obraz

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Memory.Span<Byte>.BinarySearch' --bdn-arguments "--disasm true"

3.1

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.18363.959 (1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20411.10
[Host] : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT
Job-IUPLXQ : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 3.1 Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated Code Size
BinarySearch 512 11.87 ns 0.119 ns 0.105 ns 11.84 ns 11.68 ns 12.01 ns - - - - 116 B

.NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), X64 RyuJIT

; System.Memory.Span`1[[System.Byte, System.Private.CoreLib]].BinarySearch()
       sub       rsp,28
       mov       rdx,[rcx+18]
       test      rdx,rdx
       jne       short M00_L00
       xor       r8d,r8d
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       r8,[rdx+10]
       mov       edx,[rdx+8]
M00_L01:
       movzx     eax,byte ptr [rcx+2C]
       mov       rcx,r8
       mov       r8d,eax
       call      System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       nop
       add       rsp,28
       ret
; Total bytes of code 48
; System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       mov       [rsp+18],r8d
       xor       eax,eax
       dec       edx
       test      edx,edx
       jl        short M01_L03
M01_L00:
       lea       r8d,[rdx+rax]
       shr       r8d,1
       movsxd    r9,r8d
       movzx     r9d,byte ptr [rcx+r9]
       movzx     r10d,byte ptr [rsp+18]
       sub       r10d,r9d
       test      r10d,r10d
       je        short M01_L04
       test      r10d,r10d
       jle       short M01_L01
       lea       eax,[r8+1]
       jmp       short M01_L02
M01_L01:
       lea       edx,[r8+0FFFF]
M01_L02:
       cmp       eax,edx
       jle       short M01_L00
M01_L03:
       not       eax
       ret
M01_L04:
       mov       eax,r8d
       ret
; Total bytes of code 68

5.0

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.18363.959 (1909/November2019Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20411.10
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
Job-PXLKUC : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated Code Size
BinarySearch 512 16.56 ns 0.089 ns 0.075 ns 16.57 ns 16.42 ns 16.68 ns - - - - 113 B

.NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT

; System.Memory.Span`1[[System.Byte, System.Private.CoreLib]].BinarySearch()
       sub       rsp,28
       mov       rdx,[rcx+18]
       test      rdx,rdx
       jne       short M00_L00
       xor       r8d,r8d
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       r8,[rdx+10]
       mov       edx,[rdx+8]
M00_L01:
       movzx     eax,byte ptr [rcx+2C]
       mov       rcx,r8
       mov       r8d,eax
       call      System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       nop
       add       rsp,28
       ret
; Total bytes of code 48
; System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       mov       [rsp+18],r8d
       xor       eax,eax
       dec       edx
       test      edx,edx
       jl        short M01_L03
M01_L00:
       lea       r8d,[rdx+rax]
       shr       r8d,1
       movsxd    r9,r8d
       movzx     r9d,byte ptr [rcx+r9]
       movzx     r10d,byte ptr [rsp+18]
       sub       r10d,r9d
       je        short M01_L04
       test      r10d,r10d
       jle       short M01_L01
       lea       eax,[r8+1]
       jmp       short M01_L02
M01_L01:
       lea       edx,[r8+0FFFF]
M01_L02:
       cmp       eax,edx
       jle       short M01_L00
M01_L03:
       not       eax
       ret
M01_L04:
       mov       eax,r8d
       ret
; Total bytes of code 65

@adamsitnik adamsitnik added this to the 5.0.0 milestone Aug 14, 2020
@kunalspathak
Copy link
Member

The diff is related to #38586 and shouldn't introduce regression.

@GrabYourPitchforks
Copy link
Member

How important is this scenario? I think we can get substantial perf wins for primitive types like byte, char, int, etc. But I don't want to risk introducing churn this late in the release cycle unless this is a critical scenario.

@adamsitnik
Copy link
Member

The diff is related to #38586 and shouldn't introduce regression.

@kunalspathak you are right, it should not. I've dig a little bit more into that and confirmed that removal of extra test instruction has affected the code alignment and has caused the regression. Full details below:

.AddDiagnoser(new DisassemblyDiagnoser(new DisassemblyDiagnoserConfig(printInstructionAddresses: true)))
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span<Byte>.BinarySearch' --bdn-arguments "--envVars COMPlus_JitAlignLoops:0"
Method Size Mean Error StdDev Median Min Max Code Size
BinarySearch 512 17.55 ns 0.313 ns 0.278 ns 17.56 ns 17.23 ns 18.13 ns 113 B

.NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT

; System.Memory.Span`1[[System.Byte, System.Private.CoreLib]].BinarySearch()
       7FFB797C58B0 sub       rsp,28
       7FFB797C58B4 mov       rdx,[rcx+18]
       7FFB797C58B8 test      rdx,rdx
       7FFB797C58BB jne       short M00_L00
       7FFB797C58BD xor       r8d,r8d
       7FFB797C58C0 xor       edx,edx
       7FFB797C58C2 jmp       short M00_L01
M00_L00:
       7FFB797C58C4 lea       r8,[rdx+10]
       7FFB797C58C8 mov       edx,[rdx+8]
M00_L01:
       7FFB797C58CB movzx     eax,byte ptr [rcx+2C]
       7FFB797C58CF mov       rcx,r8
       7FFB797C58D2 mov       r8d,eax
       7FFB797C58D5 call      System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       7FFB797C58DA nop
       7FFB797C58DB add       rsp,28
       7FFB797C58DF ret
; Total bytes of code 48
; System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       7FFB797C5980 mov       [rsp+18],r8d
       7FFB797C5985 xor       eax,eax
       7FFB797C5987 dec       edx
       7FFB797C5989 test      edx,edx
       7FFB797C598B jl        short M01_L03
M01_L00:
       7FFB797C598D lea       r8d,[rdx+rax]
       7FFB797C5991 shr       r8d,1
       7FFB797C5994 movsxd    r9,r8d
       7FFB797C5997 movzx     r9d,byte ptr [rcx+r9]
       7FFB797C599C movzx     r10d,byte ptr [rsp+18]
       7FFB797C59A2 sub       r10d,r9d
       7FFB797C59A5 je        short M01_L04
       7FFB797C59A7 test      r10d,r10d
       7FFB797C59AA jle       short M01_L01
       7FFB797C59AC lea       eax,[r8+1]
       7FFB797C59B0 jmp       short M01_L02
M01_L01:
       7FFB797C59B2 lea       edx,[r8+0FFFF]
M01_L02:
       7FFB797C59B6 cmp       eax,edx
       7FFB797C59B8 jle       short M01_L00
M01_L03:
       7FFB797C59BA not       eax
       7FFB797C59BC ret
M01_L04:
       7FFB797C59BD mov       eax,r8d
       7FFB797C59C0 ret
; Total bytes of code 65
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Memory.Span<Byte>.BinarySearch' --bdn-arguments "--envVars COMPlus_JitAlignLoops:1"
Method Size Mean Error StdDev Median Min Max Code Size
BinarySearch 512 11.80 ns 0.172 ns 0.153 ns 11.77 ns 11.61 ns 12.12 ns 116 B

.NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT

; System.Memory.Span`1[[System.Byte, System.Private.CoreLib]].BinarySearch()
       7FFB797D5930 sub       rsp,28
       7FFB797D5934 mov       rdx,[rcx+18]
       7FFB797D5938 test      rdx,rdx
       7FFB797D593B jne       short M00_L00
       7FFB797D593D xor       r8d,r8d
       7FFB797D5940 xor       edx,edx
       7FFB797D5942 jmp       short M00_L01
M00_L00:
       7FFB797D5944 lea       r8,[rdx+10]
       7FFB797D5948 mov       edx,[rdx+8]
M00_L01:
       7FFB797D594B movzx     eax,byte ptr [rcx+2C]
       7FFB797D594F mov       rcx,r8
       7FFB797D5952 mov       r8d,eax
       7FFB797D5955 call      System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       7FFB797D595A nop
       7FFB797D595B add       rsp,28
       7FFB797D595F ret
; Total bytes of code 48
; System.SpanHelpers.BinarySearch[[System.Byte, System.Private.CoreLib],[System.Byte, System.Private.CoreLib]](Byte ByRef, Int32, Byte)
       7FFB797D5A00 mov       [rsp+18],r8d
       7FFB797D5A05 xor       eax,eax
       7FFB797D5A07 dec       edx
       7FFB797D5A09 test      edx,edx
       7FFB797D5A0B jl        short M01_L03
       7FFB797D5A0D nop       dword ptr [rax]
M01_L00:
       7FFB797D5A10 lea       r8d,[rdx+rax]
       7FFB797D5A14 shr       r8d,1
       7FFB797D5A17 movsxd    r9,r8d
       7FFB797D5A1A movzx     r9d,byte ptr [rcx+r9]
       7FFB797D5A1F movzx     r10d,byte ptr [rsp+18]
       7FFB797D5A25 sub       r10d,r9d
       7FFB797D5A28 je        short M01_L04
       7FFB797D5A2A test      r10d,r10d
       7FFB797D5A2D jle       short M01_L01
       7FFB797D5A2F lea       eax,[r8+1]
       7FFB797D5A33 jmp       short M01_L02
M01_L01:
       7FFB797D5A35 lea       edx,[r8+0FFFF]
M01_L02:
       7FFB797D5A39 cmp       eax,edx
       7FFB797D5A3B jle       short M01_L00
M01_L03:
       7FFB797D5A3D not       eax
       7FFB797D5A3F ret
M01_L04:
       7FFB797D5A40 mov       eax,r8d
       7FFB797D5A43 ret
; Total bytes of code 68

Since this #38586 is something that we definitely want and we have very little control of code alignment, I am closing this issue.

@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Aug 17, 2020
@AndyAyersMS
Copy link
Member

Linking to #8108.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants