Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional array performance #5481

Closed
colgreen opened this issue Apr 3, 2016 · 4 comments
Closed

Multidimensional array performance #5481

colgreen opened this issue Apr 3, 2016 · 4 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@colgreen
Copy link
Contributor

colgreen commented Apr 3, 2016

There appears to be a long running issue with the performance of multidimensional arrays compared to a comparable jagged array. Are there any plans to look into this?

Some existing discussion here:

http://stackoverflow.com/questions/468832/why-are-multi-dimensional-arrays-in-net-slower-than-normal-arrays
http://stackoverflow.com/questions/5476000/c-multi-dimensional-array-vs-one-dimensional

I ran a quick test using .Net 4.6.1 in x64 mode (i.e. using ryuJIT), and the issue appears to still be present.

category:cq
theme:md-arrays
skill-level:expert
cost:large

@cmckinsey
Copy link
Contributor

Comparing the performance of various array representations is complex and depends on a lot of workload specifics. I won't attempt to spend time to cover it in this thread. But for this specific example, each implementation should each largely be able to perform the same speed for the specific dimensions sizes.

I'm going to use this issue to cover what is missing work in RyuJIT handling rectangular arrays to match .NET 4.5.3 (JIT64) performance for the sum(double[,], d) method.

Timings with RyuJIT:
sum took 00:00:01.7960414
sum took 00:00:03.2205602 // over 2x slower than vector version
sum took 00:00:01.7834780
sum took 00:00:01.7827718
sum took 00:00:03.2271725 // over 2x slower than vector version
sum took 00:00:01.7818470

Timings with JIT64:
sum took 00:00:01.4961448
sum took 00:00:01.6203301
sum took 00:00:01.5992218
sum took 00:00:01.4797479
sum took 00:00:01.6191445
sum took 00:00:01.5965326

The main reason for the deficiency is the failure of the d[I,..] calculation to be hoisted out of the inner loop, the array reference only varies on the inner-most dimension

G_M11570_IG04:        ; offs=000039H, size=0035H, gcrefRegs=00000040 {rsi}, byrefRegs=00000000 {}, byref, isz

IN0010: 000039 mov      r8d, edx
IN0011: 00003C sub      r8d, dword ptr [rsi+24]
IN0012: 000040 cmp      r8d, dword ptr [rsi+16]
IN0013: 000044 jae      SHORT G_M11570_IG08
IN0014: 000046 mov      r9d, ecx
IN0015: 000049 sub      r9d, dword ptr [rsi+28]
IN0016: 00004D cmp      r9d, dword ptr [rsi+20]
IN0017: 000051 jae      SHORT G_M11570_IG08
IN0018: 000053 mov      r10d, dword ptr [rsi+20]
IN0019: 000057 imul     r10, r8
IN001a: 00005B mov      r8, r9
IN001b: 00005E add      r8, r10
IN001c: 000061 addsd    xmm6, qword ptr [rsi+8*r8+32]
IN001d: 000068 inc      ecx
IN001e: 00006A cmp      ecx, eax
IN001f: 00006C jl       SHORT G_M11570_IG04

The internal representation of the rectangular array is done through the arrMD&[] operator and is not decomposed until the Lower phase which runs after the global and loop optimizers. It is only some of these individual addressing calculations that are invariant, but not the entire operator hence it cannot be fully optimized unless it is lowered earlier.

From the dumps you can see this happening during the Lowering phase in lowerxarch:


Lowering ArrElem
============
N001 (  1,  1) [000042] ------------             /--*  lclVar    ref    V00 arg0         u:2 $80
N002 (  1,  1) [000043] ------------             +--*  lclVar    int    V04 loc3         u:4 $380
N003 (  1,  1) [000044] ------------             +--*  lclVar    int    V05 loc4         u:4 $381
N004 ( 13,  9) [000045] ---X--------             *  arrMD&[,] byref  $3c0

Results of lowering ArrElem:
     (  1,  1) [000140] ------------                   /--*  const     long   0
N001 (  1,  1) [000042] ------------                   |  /--*  lclVar    ref    V00 arg0         u:2 $80
N002 (  1,  1) [000043] ------------                   |  +--*  lclVar    int    V04 loc3         u:4 $380
     (  3,  3) [000141] ---X--------                   +--*  arrMDIdx[i, ] int   
     (  1,  1) [000142] ------------                   +--*  lclVar    ref    V00 arg0         
     (  5,  5) [000143] ---X--------                /--*  arrMDOffs[i, ] long  
     (  1,  1) [000144] ------------                |  /--*  lclVar    ref    V00 arg0         
N003 (  1,  1) [000044] ------------                |  +--*  lclVar    int    V05 loc4         u:4 $381
     (  3,  3) [000145] ---X--------                +--*  arrMDIdx[*,j] int   
     (  1,  1) [000146] ------------                +--*  lclVar    ref    V00 arg0         
     (  9,  9) [000147] ---X--------             /--*  arrMDOffs[*,j] long  
     (  1,  1) [000148] ------------             +--*  lclVar    ref    V00 arg0         
     ( 11, 11) [000149] ---X----R---             *  lea(b+(i*8)+32) byref 

Resulting statement:
     ( 22, 20) [000050] ------------             *  stmtExpr  void  (top level) (IL 0x023...0x02E)
     (  1,  1) [000140] ------------             |                 /--*  const     long   0
N001 (  1,  1) [000042] ------------             |                 |  /--*  lclVar    ref    V00 arg0         u:2 $80
N002 (  1,  1) [000043] ------------             |                 |  +--*  lclVar    int    V04 loc3         u:4 $380
     (  3,  3) [000141] ---X--------             |                 +--*  arrMDIdx[i, ] int   
     (  1,  1) [000142] ------------             |                 +--*  lclVar    ref    V00 arg0         
     (  5,  5) [000143] ---X--------             |              /--*  arrMDOffs[i, ] long  
     (  1,  1) [000144] ------------             |              |  /--*  lclVar    ref    V00 arg0         
N003 (  1,  1) [000044] ------------             |              |  +--*  lclVar    int    V05 loc4         u:4 $381
     (  3,  3) [000145] ---X--------             |              +--*  arrMDIdx[*,j] int   
     (  1,  1) [000146] ------------             |              +--*  lclVar    ref    V00 arg0         
     (  9,  9) [000147] ---X--------             |           /--*  arrMDOffs[*,j] long  
     (  1,  1) [000148] ------------             |           +--*  lclVar    ref    V00 arg0         
     ( 11, 11) [000149] ---X----R---             |        /--*  lea(b+(i*8)+32) byref 
N005 ( 15, 13) [000046] ---X--------             |     /--*  indir     double $141
N006 (  1,  2) [000041] ------------             |     +--*  lclVar    double V01 loc0         u:7 (last use) $341
N007 ( 21, 19) [000047] ---X--------             |  /--*  +         double $302
N009 ( 22, 20) [000049] DA-X--------             \--*  st.lclVar double V01 loc0         d:8

The work item here is to move the rectangular array lowering to run before the optimizer so that the invariant components of the array calculation are exposed to invariant code motion. There may be additional work in invariant code motion to properly hoist the arrMDIdx operator correctly.

@colgreen
Copy link
Contributor Author

Is this still an issue in dotnet core 3? If so is there a process I can follow to get it triaged/tagged appropriately? Thanks.

@djsell
Copy link

djsell commented Aug 23, 2019

The benchmark is inaccurate for Release mode. What is actually slow is the call to multidimensional.GetLength(i). If you change that code to use the constant size 1024x1024, multidimensional is on par with the others (in .NET Core 2.2.6 on my MBP).

Release with d.GetLength(0) and d.GetLength(1)

sum took 00:00:01.3568572
sum took 00:00:03.0242274
sum took 00:00:01.4116762
sum took 00:00:01.3788118
sum took 00:00:03.0524923
sum took 00:00:01.3533231

Release with constant 1024x1024

sum took 00:00:01.3522575
sum took 00:00:01.3834065
sum took 00:00:01.3678978
sum took 00:00:01.3548391
sum took 00:00:01.3873318
sum took 00:00:01.3667302

Multidimensional array access is still considerably slower in Debug mode even with that change, though.

@BruceForstall
Copy link
Member

This is addressed in #70271. There are follow-up items there to continue to improve MD array performance.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants