Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Collections.Sort<BigStruct>.LinqQuery has regressed on all configs except Windows 64 bit #66776

Closed
adamsitnik opened this issue Mar 17, 2022 · 17 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

This regression seems to be specific to all configs except of Windows 64 bit.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net6.0 net7.0 --filter 'System.Collections.Sort<BigStruct>.LinqQuery'

Ubuntu Historical results

image

The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)

Windows Historical results

image

Result Base Diff Ratio Operating System Bit
Same 47068.40 46996.12 1.00 Windows 11 X64
Same 25061.00 25213.92 0.99 Windows 11 X64
Same 81332.07 82470.68 0.99 Windows 11 X64
Same 48471.02 49394.98 0.98 Windows 10 X64
Same 61753.97 65909.26 0.94 Windows 11 X64
Same 79322.94 78292.41 1.01 Windows 11 X64
Slower 33152.41 48551.85 0.68 ubuntu 18.04 X64
Slower 33670.35 49233.18 0.68 ubuntu 20.04 X64
Slower 65475.08 84542.75 0.77 ubuntu 18.04 X64
Same 102906.71 95691.63 1.08 ubuntu 18.04 X64
Slower 78941.99 99516.66 0.79 pop 20.04 X64
Slower 58025.14 76420.12 0.76 alpine 3.13 X64
Slower 58358.38 87952.92 0.66 debian 11 X64
Same 39738.00 38447.43 1.03 macOS Monterey 12.2.1 Arm64
Same 81077.12 83539.94 0.97 Windows 10 Arm64
Same 84261.45 85918.34 0.98 Windows 11 Arm64
Slower 51385.76 75022.36 0.68 Windows 11 X86
Slower 68915.32 91940.60 0.75 Windows 10 X86
Slower 61701.11 79972.87 0.77 Windows 10 X86
Slower 57559.08 70356.86 0.82 Windows 10 X86
Same 151162.22 145951.59 1.04 Windows 10 Arm
Slower 90819.89 108997.55 0.83 macOS Big Sur 11.6.3 X64
Slower 73211.06 98121.94 0.75 macOS Monterey 12.2.1 X64
Slower 79186.88 106613.19 0.74 macOS Monterey 12.2.1 X64
@ghost
Copy link

ghost commented Mar 17, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

This regression seems to be specific to all configs except of Windows 64 bit.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net6.0 net7.0 --filter 'System.Collections.Sort<BigStruct>.LinqQuery'

Ubuntu Historical results

image

The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)

Windows Historical results

image

Result Base Diff Ratio Operating System Bit
Same 47068.40 46996.12 1.00 Windows 11 X64
Same 25061.00 25213.92 0.99 Windows 11 X64
Same 81332.07 82470.68 0.99 Windows 11 X64
Same 48471.02 49394.98 0.98 Windows 10 X64
Same 61753.97 65909.26 0.94 Windows 11 X64
Same 79322.94 78292.41 1.01 Windows 11 X64
Slower 33152.41 48551.85 0.68 ubuntu 18.04 X64
Slower 33670.35 49233.18 0.68 ubuntu 20.04 X64
Slower 65475.08 84542.75 0.77 ubuntu 18.04 X64
Same 102906.71 95691.63 1.08 ubuntu 18.04 X64
Slower 78941.99 99516.66 0.79 pop 20.04 X64
Slower 58025.14 76420.12 0.76 alpine 3.13 X64
Slower 58358.38 87952.92 0.66 debian 11 X64
Same 39738.00 38447.43 1.03 macOS Monterey 12.2.1 Arm64
Same 81077.12 83539.94 0.97 Windows 10 Arm64
Same 84261.45 85918.34 0.98 Windows 11 Arm64
Slower 51385.76 75022.36 0.68 Windows 11 X86
Slower 68915.32 91940.60 0.75 Windows 10 X86
Slower 61701.11 79972.87 0.77 Windows 10 X86
Slower 57559.08 70356.86 0.82 Windows 10 X86
Same 151162.22 145951.59 1.04 Windows 10 Arm
Slower 90819.89 108997.55 0.83 macOS Big Sur 11.6.3 X64
Slower 73211.06 98121.94 0.75 macOS Monterey 12.2.1 X64
Slower 79186.88 106613.19 0.74 macOS Monterey 12.2.1 X64
Author: adamsitnik
Assignees: -
Labels:

area-System.Collections, tenet-performance

Milestone: -

@AndyAyersMS
Copy link
Member

#59287 is locked so doesn't get cross linked. That seems unfortunate.

That change should have purely impacted jit diagnostics, so it's unlikely to have caused regressions.

@jeffhandley jeffhandley added this to the 7.0.0 milestone Jul 10, 2022
@eiriktsarpalis eiriktsarpalis added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Aug 5, 2022
@ghost
Copy link

ghost commented Aug 5, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This regression seems to be specific to all configs except of Windows 64 bit.

Repro:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net6.0 net7.0 --filter 'System.Collections.Sort<BigStruct>.LinqQuery'

Ubuntu Historical results

image

The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)

Windows Historical results

image

Result Base Diff Ratio Operating System Bit
Same 47068.40 46996.12 1.00 Windows 11 X64
Same 25061.00 25213.92 0.99 Windows 11 X64
Same 81332.07 82470.68 0.99 Windows 11 X64
Same 48471.02 49394.98 0.98 Windows 10 X64
Same 61753.97 65909.26 0.94 Windows 11 X64
Same 79322.94 78292.41 1.01 Windows 11 X64
Slower 33152.41 48551.85 0.68 ubuntu 18.04 X64
Slower 33670.35 49233.18 0.68 ubuntu 20.04 X64
Slower 65475.08 84542.75 0.77 ubuntu 18.04 X64
Same 102906.71 95691.63 1.08 ubuntu 18.04 X64
Slower 78941.99 99516.66 0.79 pop 20.04 X64
Slower 58025.14 76420.12 0.76 alpine 3.13 X64
Slower 58358.38 87952.92 0.66 debian 11 X64
Same 39738.00 38447.43 1.03 macOS Monterey 12.2.1 Arm64
Same 81077.12 83539.94 0.97 Windows 10 Arm64
Same 84261.45 85918.34 0.98 Windows 11 Arm64
Slower 51385.76 75022.36 0.68 Windows 11 X86
Slower 68915.32 91940.60 0.75 Windows 10 X86
Slower 61701.11 79972.87 0.77 Windows 10 X86
Slower 57559.08 70356.86 0.82 Windows 10 X86
Same 151162.22 145951.59 1.04 Windows 10 Arm
Slower 90819.89 108997.55 0.83 macOS Big Sur 11.6.3 X64
Slower 73211.06 98121.94 0.75 macOS Monterey 12.2.1 X64
Slower 79186.88 106613.19 0.74 macOS Monterey 12.2.1 X64
Author: adamsitnik
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: 7.0.0

@AndyAyersMS
Copy link
Member

Digging through it looks like we expected this to be resolved -- see dotnet/perf-autofiling-issues#1501 (comment)

But that only fixed issues on Windows, Ubuntu did not benefit. So we still have a regression.

newplot - 2022-08-08T081954 386

(Windows is slightly worse off too)

newplot - 2022-08-08T082051 573

@AndyAyersMS
Copy link
Member

Looks like this is still unassigned. I'll take it for now.

@AndyAyersMS AndyAyersMS self-assigned this Aug 8, 2022
@AndyAyersMS
Copy link
Member

Can reproduce running locally (via wsl2)

BenchmarkDotNet=v0.13.1.1823-nightly, OS=ubuntu 20.04
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22408.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT
  Job-CFAJOE : .NET 5.0.1 (5.0.120.57516), X64 RyuJIT
  Job-JPHJBC : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
  Job-KPSCOL : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  InvocationCount=5000  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  MinWarmupIterationCount=6
UnrollFactor=1  WarmupCount=-1
Method Job Runtime Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Allocated Alloc Ratio
LinqQuery Job-CFAJOE .NET 5.0 net5.0 512 56.80 us 0.729 us 0.682 us 56.79 us 55.69 us 58.34 us 1.00 0.00 5.4000 0.4000 34.33 KB 1.00
LinqQuery Job-JPHJBC .NET 6.0 net6.0 512 58.25 us 0.707 us 0.662 us 57.97 us 57.41 us 59.44 us 1.03 0.02 5.4000 0.4000 34.33 KB 1.00
LinqQuery Job-KPSCOL .NET 7.0 net7.0 512 72.44 us 1.321 us 1.235 us 72.28 us 70.45 us 74.83 us 1.28 0.03 5.6000 0.6000 34.33 KB 1.00

@AndyAyersMS
Copy link
Member

@adamsitnik is it expected that with -p EP I won't get cpu sample events? If so, any way to enable these via the command line?

@AndyAyersMS
Copy link
Member

Hmm, I guess there are sample events but not ones that perfview recognizes?

image

@AndyAyersMS
Copy link
Member

From the above I can get a crude profile of sorts. But not sure it is helping me spot which method(s) have regressed.

@adamsitnik
Copy link
Member Author

I guess there are sample events but not ones that perfview recognizes?

In case of EventPipe we just get different CPU samples (events emitted by the .NET Runtime, not the OS). In PerfView you need to open the "Thread Time" view (not "CPU Stacks" like usual):

image

image

Or you can take the .speedscope file generated by BDN:

Exported 1 trace file(s). Example:
D:\projects\performance\artifacts\bin\MicroBenchmarks\Release\net7.0\BenchmarkDotNet.Artifacts\System.Collections.Sort_BigStruct_.LinqQuery(Size_ 512)-20220809-091754.speedscope.json

and open it with speedscope

@AndyAyersMS
Copy link
Member

Still didn't find that very helpful. But here's perf (via WSL2) on the two:

image

If this is credible then the issue is in this bit of code.

;; 6.0 

; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T03] (  2,  1.36)  struct (32) [rbp+10H]   do-not-enreg[SF] ld-addr-op single-def
;  V02 arg2         [V02,T04] (  1,  1   )  struct (32) [rbp+30H]   do-not-enreg[SB] single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T01] (  2,  4   )  struct (32) [rbp-20H]   do-not-enreg[SFB] "Inlining Arg"
;  V05 tmp2         [V05,T02] (  4,  1.50)     int  ->  rax         "Inline return value spill temp"
;  V06 tmp3         [V06,T00] (  3,  4.71)     int  ->  rax         "Inlining Arg"
;
; Lcl frame size = 32

G_M25642_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2420           lea      rbp, [rsp+20H]
						;; bbWeight=1    PerfScore 2.75
G_M25642_IG02:              ;; offset=000DH
       C5FA6F4530           vmovdqu  xmm0, xmmword ptr [rbp+30H]
       C5FA7F45E0           vmovdqu  xmmword ptr [rbp-20H], xmm0
       C5FA6F4540           vmovdqu  xmm0, xmmword ptr [rbp+40H]
       C5FA7F45F0           vmovdqu  xmmword ptr [rbp-10H], xmm0
       8B45EC               mov      eax, dword ptr [rbp-14H]
       39451C               cmp      dword ptr [rbp+1CH], eax
       7C14                 jl       SHORT G_M25642_IG07
						;; bbWeight=1    PerfScore 7.00
G_M25642_IG03:              ;; offset=0029H
       39451C               cmp      dword ptr [rbp+1CH], eax
       7F08                 jg       SHORT G_M25642_IG06
						;; bbWeight=0.36 PerfScore 0.71
G_M25642_IG04:              ;; offset=002EH
       33C0                 xor      eax, eax
						;; bbWeight=0.26 PerfScore 0.06
G_M25642_IG05:              ;; offset=0030H
       4883C420             add      rsp, 32
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M25642_IG06:              ;; offset=0036H
       B801000000           mov      eax, 1
       EBF3                 jmp      SHORT G_M25642_IG05
						;; bbWeight=0.10 PerfScore 0.22
G_M25642_IG07:              ;; offset=003DH
       B8FFFFFFFF           mov      eax, -1
       EBEC                 jmp      SHORT G_M25642_IG05
						;; bbWeight=0.14 PerfScore 0.32

versus

;; 7.0

; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T03] (  2,  1.35)  struct (32) [rbp+10H]   do-not-enreg[SF] ld-addr-op single-def
;  V02 arg2         [V02,T04] (  1,  1   )  struct (32) [rbp+30H]   do-not-enreg[S] single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T01] (  2,  4   )  struct (32) [rbp-20H]   do-not-enreg[SF] "Inlining Arg"
;  V05 tmp2         [V05,T02] (  4,  1.50)     int  ->  rax         "Inline return value spill temp"
;  V06 tmp3         [V06,T00] (  3,  4.70)     int  ->  rax         "Inlining Arg"
;
; Lcl frame size = 32

G_M25642_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2420           lea      rbp, [rsp+20H]
						;; size=13 bbWeight=1    PerfScore 2.75
G_M25642_IG02:              ;; offset=000DH
       C5FE6F4530           vmovdqu  ymm0, ymmword ptr[rbp+30H]
       C5FE7F45E0           vmovdqu  ymmword ptr[rbp-20H], ymm0
       8B45EC               mov      eax, dword ptr [rbp-14H]
       39451C               cmp      dword ptr [rbp+1CH], eax
       7C17                 jl       SHORT G_M25642_IG07
						;; size=18 bbWeight=1    PerfScore 9.00
G_M25642_IG03:              ;; offset=001FH
       39451C               cmp      dword ptr [rbp+1CH], eax
       7F0B                 jg       SHORT G_M25642_IG06
						;; size=5 bbWeight=0.35 PerfScore 1.06
G_M25642_IG04:              ;; offset=0024H
       33C0                 xor      eax, eax
						;; size=2 bbWeight=0.25 PerfScore 0.06
G_M25642_IG05:              ;; offset=0026H
       C5F877               vzeroupper 
       4883C420             add      rsp, 32
       5D                   pop      rbp
       C3                   ret      
						;; size=9 bbWeight=1    PerfScore 2.75
G_M25642_IG06:              ;; offset=002FH
       B801000000           mov      eax, 1
       EBF0                 jmp      SHORT G_M25642_IG05
						;; size=7 bbWeight=0.10 PerfScore 0.22
G_M25642_IG07:              ;; offset=0036H
       B8FFFFFFFF           mov      eax, -1
       EBE9                 jmp      SHORT G_M25642_IG05
						;; size=7 bbWeight=0.15 PerfScore 0.33

@AndyAyersMS
Copy link
Member

Note with AVX/AVX2 disabled 6 and 7 match perf (and match 6 with avx enabled)

BenchmarkDotNet=v0.13.1.1823-nightly, OS=ubuntu 20.04
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22408.1
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
Job-KAQRRV : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
Job-SXOIEW : .NET 7.0.0 (7.0.22.40308), X64 RyuJIT

EnvironmentVariables=COMPlus_EnableAVX2=0,COMPlus_EnableAVX=0 PowerPlanMode=00000000-0000-0000-0000-000000000000 InvocationCount=5000
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
MinWarmupIterationCount=6 UnrollFactor=1 WarmupCount=-1

Method Job Runtime Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Allocated Alloc Ratio
LinqQuery Job-KAQRRV .NET 6.0 net6.0 512 55.18 us 0.762 us 0.675 us 55.05 us 54.16 us 56.73 us 1.00 5.4000 0.4000 34.33 KB 1.00
LinqQuery Job-SXOIEW .NET 7.0 net7.0 512 57.40 us 0.461 us 0.409 us 57.28 us 56.93 us 58.04 us 1.04 5.6000 0.6000 34.33 KB 1.00

Going to modify the jit so I can do this per-method and see if just disabling AVX for the comparer explains the perf loss.

@AndyAyersMS
Copy link
Member

Looks like the regression comes from the use of YMM registers in the two hottest methods above

  • System.Linq.EnumerableSorter2[BigStruct,BigStruct][System.Collections.BigStruct,System.Collections.BigStruct]:CompareAnyKeys(int,int)`
  • System.Collections.Generic.GenericComparer1[BigStruct][System.Collections.BigStruct]::Compare`

In both cases there is a YMM store closely followed by a narrower load:

;; Compare

       C5FE7F45E0           vmovdqu  ymmword ptr[rbp-20H], ymm0
       8B45EC               mov      eax, dword ptr [rbp-14H]

;; CompareAnyKeys

       C5FE7F45C8           vmovdqu  ymmword ptr[rbp-38H], ymm0
       C5FA6F45C8           vmovdqu  xmm0, qword ptr [rbp-38H]

@AndyAyersMS
Copy link
Member

On windows, there is similar codegen in Compare but not in CompareAnyKeys -- the latter because of ABI differences.

;; (windows) Compare

       C5FE7F442408         vmovdqu  ymmword ptr[rsp+08H], ymm0
       8B442414             mov      eax, dword ptr [rsp+14H]

Despire this, perf on windows generally seems better (around 53us). Note the store above is misaligned (as is the store in linux's CompareAnyKeys) if that matters.

Also note that in Compare the struct copy is really not needed. Seems like forward sub (or morph's copy prop) should get this case, but neither one sees the use:

;; tmp1 is single use
***** BB03
STMT00003 ( 0x010[E-] ... ??? )
               [000027] -A---------                         *  ASG       struct (copy)
               [000025] D------N---                         +--*  LCL_VAR   struct<System.Collections.BigStruct, 32> V04 tmp1         
               [000013] n----------                         \--*  OBJ       struct<System.Collections.BigStruct, 32>
               [000012] -----------                            \--*  ADDR      byref 
               [000010] -------N---                               \--*  LCL_VAR   struct<System.Collections.BigStruct, 32> V02 arg2         

***** BB03
STMT00009 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x010[E-]
               [000058] -A---------                         *  ASG       int   
               [000057] D------N---                         +--*  LCL_VAR   int    V06 tmp3         
               [000022] -----------                         \--*  FIELD     int    _int1
               [000021] -----------                            \--*  ADDR      byref 
               [000020] -------N---                               \--*  LCL_VAR   struct<System.Collections.BigStruct, 32> V04 tmp1         

;; fwd sub

    [000027]:  no next stmt use

;; morph

In BB01 New Local Copy     Assertion: V04 == V02, index = #01

fgMorphTree BB01, STMT00009 (before)
               [000058] -A---------                         *  ASG       int   
               [000057] D------N---                         +--*  LCL_VAR   int    V06 tmp3         
               [000022] -----------                         \--*  LCL_FLD   int    V04 tmp1         [+12]


@AndyAyersMS
Copy link
Member

Verified this is mitigated with the preliminary changes from #73719.

This is beyond the scope of what we can fix for .net7, so I think we're going to have to live with this regression.

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Allocated Alloc Ratio
LinqQuery Job-XBERYB .net7 512 70.09 us 0.513 us 0.455 us 70.11 us 69.06 us 70.89 us 1.21 0.02 5.6000 0.6000 34.33 KB 1.00
LinqQuery Job-WMUOPH #73719 512 56.01 us 0.644 us 0.602 us 55.81 us 55.26 us 57.21 us 0.97 0.02 5.6000 0.6000 34.33 KB 1.00
LinqQuery Job-NGYOUF .net6 512 57.87 us 1.093 us 1.023 us 57.66 us 56.53 us 59.62 us 1.00 0.00 5.4000 0.4000 34.33 KB 1.00

@AndyAyersMS
Copy link
Member

This should be fixed by #74384.

@AndyAyersMS
Copy link
Member

(ubuntu x64)

newplot - 2022-09-02T152343 171

@ghost ghost locked as resolved and limited conversation to collaborators Oct 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants