New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster List Add #9539

Merged
merged 1 commit into from Feb 23, 2017

Conversation

@benaadams
Collaborator

benaadams commented Feb 12, 2017

List Add and Clear are warmspots in Kestrel

This is a mild tweak to Add however as what's being cleared is a list of GCHandle structs Clear is a significant win.

@jkotas is this a valid use of JitHelpers.ContainsReferences<T>()?

/cc @stephentoub

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Feb 12, 2017

Member

@jkotas is this a valid use of JitHelpers.ContainsReferences<T>()?

Yes, it is what it is meant for.

Member

jkotas commented Feb 12, 2017

@jkotas is this a valid use of JitHelpers.ContainsReferences<T>()?

Yes, it is what it is meant for.

@stephentoub

This comment has been minimized.

Show comment
Hide comment
@stephentoub

stephentoub Feb 12, 2017

Member

I'd made a similar change to Add in #9323 (the PR shows separating out the EnsureCapacity, but I'd reverted that part locally and just never pushed it up, so it looked basically identically to this PR), as it showed benefits on one machine, but on another it actually showed a slowdown. We should just make sure it's consistently better before changing it.

Member

stephentoub commented Feb 12, 2017

I'd made a similar change to Add in #9323 (the PR shows separating out the EnsureCapacity, but I'd reverted that part locally and just never pushed it up, so it looked basically identically to this PR), as it showed benefits on one machine, but on another it actually showed a slowdown. We should just make sure it's consistently better before changing it.

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 12, 2017

Collaborator

Going to submit separate PR for Clear and Remove as they are quite simple. Add ends up with a lot going on in the asm so will see if can take care of it seperately

Collaborator

benaadams commented Feb 12, 2017

Going to submit separate PR for Clear and Remove as they are quite simple. Add ends up with a lot going on in the asm so will see if can take care of it seperately

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo

jamesqo Feb 14, 2017

Contributor

@benaadams Awesome, nice work!

Contributor

jamesqo commented Feb 14, 2017

@benaadams Awesome, nice work!

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 14, 2017

Collaborator

Clear+Remove were picked up and merged in #9540 so this will just become Add

Collaborator

benaadams commented Feb 14, 2017

Clear+Remove were picked up and merged in #9540 so this will just become Add

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 17, 2017

Collaborator

Willc lose and reopen different PR for Add

Collaborator

benaadams commented Feb 17, 2017

Willc lose and reopen different PR for Add

@benaadams benaadams closed this Feb 17, 2017

@benaadams benaadams reopened this Feb 22, 2017

@benaadams benaadams changed the title from Faster List Add & Clear to Faster List Add Feb 22, 2017

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 22, 2017

Collaborator

Trims the asm by 10 bytes and doesn't get it permanently branded as no-inline.

Pre

; ============================================================
Marking List`1:Add(long):this as NOINLINE because of unprofitable inline
**************** Inline Tree
Inlines into 060034D4 List`1:Add(long):this
  [0 IL=0025 TR=000050 060034E3] [FAILED: noinline per IL/cached result] List`1:EnsureCapacity(int):this
Budget: initialTime=282, finalTime=282, initialBudget=2820, currentBudget=2820
Budget: initialSize=1818, finalSize=1818
; Assembly listing for method List`1:Add(long):this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 20,  18  )     ref  ->  rsi         this
;  V01 arg1         [V01,T03] (  3,   3  )    long  ->  rdi        
;  V02 loc0         [V02,T02] (  6,   6  )     int  ->  rdx        
;  V03 tmp0         [V03,T01] (  4,   8  )     ref  ->  rax        
;  V04 OutArgs      [V04    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M46198_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx

G_M46198_IG02:
       8B5618               mov      edx, dword ptr [rsi+24]
       488B4E08             mov      rcx, gword ptr [rsi+8]
       3B5108               cmp      edx, dword ptr [rcx+8]
       750D                 jne      SHORT G_M46198_IG03
       8B5618               mov      edx, dword ptr [rsi+24]
       FFC2                 inc      edx
       488BCE               mov      rcx, rsi
       E800000000           call     List`1:EnsureCapacity(int):this

G_M46198_IG03:
       488B4608             mov      rax, gword ptr [rsi+8]
       8B5618               mov      edx, dword ptr [rsi+24]
       8D4A01               lea      ecx, [rdx+1]
       894E18               mov      dword ptr [rsi+24], ecx
       3B5008               cmp      edx, dword ptr [rax+8]
       7312                 jae      SHORT G_M46198_IG05
       4863D2               movsxd   rdx, edx
       48897CD010           mov      qword ptr [rax+8*rdx+16], rdi
       FF461C               inc      dword ptr [rsi+28]

G_M46198_IG04:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M46198_IG05:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 79, prolog size 6 for method List`1:Add(long):this

Post

**************** Inline Tree
Inlines into 060034D4 List`1:Add(long):this
  [0 IL=0054 TR=000029 060034D5] [FAILED: noinline per IL/cached result] List`1:AddWithResize(long):this
Budget: initialTime=240, finalTime=240, initialBudget=2400, currentBudget=2400
Budget: initialSize=1499, finalSize=1499
; Assembly listing for method List`1:Add(long):this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 12,  10.5)     ref  ->  rcx         this
;  V01 arg1         [V01,T02] (  4,   3  )    long  ->  rdx        
;  V02 loc0         [V02,T03] (  4,   3  )     ref  ->  rax        
;  V03 loc1         [V03,T01] (  7,   5  )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M46198_IG01:
       4883EC28             sub      rsp, 40
       90                   nop      

G_M46198_IG02:
       488B4108             mov      rax, gword ptr [rcx+8]
       448B4118             mov      r8d, dword ptr [rcx+24]
       FF411C               inc      dword ptr [rcx+28]
       44394008             cmp      dword ptr [rax+8], r8d
       761C                 jbe      SHORT G_M46198_IG04
       458D4801             lea      r9d, [r8+1]
       44894918             mov      dword ptr [rcx+24], r9d
       443B4008             cmp      r8d, dword ptr [rax+8]
       731C                 jae      SHORT G_M46198_IG06
       4963C8               movsxd   rcx, r8d
       488954C810           mov      qword ptr [rax+8*rcx+16], rdx

G_M46198_IG03:
       4883C428             add      rsp, 40
       C3                   ret      

G_M46198_IG04:
       488D0500000000       lea      rax, [(reloc)]

G_M46198_IG05:
       4883C428             add      rsp, 40
       48FFE0               rex.jmp  rax

G_M46198_IG06:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 69, prolog size 5 for method List`1:Add(long):this

PTAL @stephentoub @jkotas

Collaborator

benaadams commented Feb 22, 2017

Trims the asm by 10 bytes and doesn't get it permanently branded as no-inline.

Pre

; ============================================================
Marking List`1:Add(long):this as NOINLINE because of unprofitable inline
**************** Inline Tree
Inlines into 060034D4 List`1:Add(long):this
  [0 IL=0025 TR=000050 060034E3] [FAILED: noinline per IL/cached result] List`1:EnsureCapacity(int):this
Budget: initialTime=282, finalTime=282, initialBudget=2820, currentBudget=2820
Budget: initialSize=1818, finalSize=1818
; Assembly listing for method List`1:Add(long):this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 20,  18  )     ref  ->  rsi         this
;  V01 arg1         [V01,T03] (  3,   3  )    long  ->  rdi        
;  V02 loc0         [V02,T02] (  6,   6  )     int  ->  rdx        
;  V03 tmp0         [V03,T01] (  4,   8  )     ref  ->  rax        
;  V04 OutArgs      [V04    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M46198_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx
       488BFA               mov      rdi, rdx

G_M46198_IG02:
       8B5618               mov      edx, dword ptr [rsi+24]
       488B4E08             mov      rcx, gword ptr [rsi+8]
       3B5108               cmp      edx, dword ptr [rcx+8]
       750D                 jne      SHORT G_M46198_IG03
       8B5618               mov      edx, dword ptr [rsi+24]
       FFC2                 inc      edx
       488BCE               mov      rcx, rsi
       E800000000           call     List`1:EnsureCapacity(int):this

G_M46198_IG03:
       488B4608             mov      rax, gword ptr [rsi+8]
       8B5618               mov      edx, dword ptr [rsi+24]
       8D4A01               lea      ecx, [rdx+1]
       894E18               mov      dword ptr [rsi+24], ecx
       3B5008               cmp      edx, dword ptr [rax+8]
       7312                 jae      SHORT G_M46198_IG05
       4863D2               movsxd   rdx, edx
       48897CD010           mov      qword ptr [rax+8*rdx+16], rdi
       FF461C               inc      dword ptr [rsi+28]

G_M46198_IG04:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

G_M46198_IG05:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 79, prolog size 6 for method List`1:Add(long):this

Post

**************** Inline Tree
Inlines into 060034D4 List`1:Add(long):this
  [0 IL=0054 TR=000029 060034D5] [FAILED: noinline per IL/cached result] List`1:AddWithResize(long):this
Budget: initialTime=240, finalTime=240, initialBudget=2400, currentBudget=2400
Budget: initialSize=1499, finalSize=1499
; Assembly listing for method List`1:Add(long):this
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 12,  10.5)     ref  ->  rcx         this
;  V01 arg1         [V01,T02] (  4,   3  )    long  ->  rdx        
;  V02 loc0         [V02,T03] (  4,   3  )     ref  ->  rax        
;  V03 loc1         [V03,T01] (  7,   5  )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M46198_IG01:
       4883EC28             sub      rsp, 40
       90                   nop      

G_M46198_IG02:
       488B4108             mov      rax, gword ptr [rcx+8]
       448B4118             mov      r8d, dword ptr [rcx+24]
       FF411C               inc      dword ptr [rcx+28]
       44394008             cmp      dword ptr [rax+8], r8d
       761C                 jbe      SHORT G_M46198_IG04
       458D4801             lea      r9d, [r8+1]
       44894918             mov      dword ptr [rcx+24], r9d
       443B4008             cmp      r8d, dword ptr [rax+8]
       731C                 jae      SHORT G_M46198_IG06
       4963C8               movsxd   rcx, r8d
       488954C810           mov      qword ptr [rax+8*rcx+16], rdx

G_M46198_IG03:
       4883C428             add      rsp, 40
       C3                   ret      

G_M46198_IG04:
       488D0500000000       lea      rax, [(reloc)]

G_M46198_IG05:
       4883C428             add      rsp, 40
       48FFE0               rex.jmp  rax

G_M46198_IG06:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 69, prolog size 5 for method List`1:Add(long):this

PTAL @stephentoub @jkotas

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 22, 2017

Collaborator

Can't eliminate the range check so there is a double check on length #9707

Collaborator

benaadams commented Feb 22, 2017

Can't eliminate the range check so there is a double check on length #9707

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Feb 22, 2017

Collaborator

Updated to aggressive inlined method; asm reduced and cleaner though still has both

cmp      dword ptr [rax+8], r8d
jbe      SHORT G_M46198_IG04

and

cmp      r8d, dword ptr [rax+8]
jae      SHORT G_M46198_IG06
Collaborator

benaadams commented Feb 22, 2017

Updated to aggressive inlined method; asm reduced and cleaner though still has both

cmp      dword ptr [rax+8], r8d
jbe      SHORT G_M46198_IG04

and

cmp      r8d, dword ptr [rax+8]
jae      SHORT G_M46198_IG06

@jkotas jkotas merged commit 7d9e017 into dotnet:master Feb 23, 2017

13 checks passed

CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
Linux ARM Emulator Cross Debug Build Build finished.
Details
Linux ARM Emulator Cross Release Build Build finished.
Details
OSX x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Windows_NT arm Cross Debug Build Build finished.
Details
Windows_NT arm Cross Release Build Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details
@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Mar 21, 2017

Collaborator

Second check now elided by #9773

Collaborator

benaadams commented Mar 21, 2017

Second check now elided by #9773

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo

jamesqo Mar 21, 2017

Contributor

@benaadams Great, then other collections can take advantage of this then.

Contributor

jamesqo commented Mar 21, 2017

@benaadams Great, then other collections can take advantage of this then.

@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo
Contributor

jamesqo commented Mar 21, 2017

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Mar 21, 2017

Contributor

@benaadams
If you change the order..

array[size] = item;
_size = size + 1;

it may reuse r8d.
Like

inc      r8d
mov      dword ptr [rcx+24], r8d

A couple of bytes less :)
Though not sure about perf.

Contributor

omariom commented Mar 21, 2017

@benaadams
If you change the order..

array[size] = item;
_size = size + 1;

it may reuse r8d.
Like

inc      r8d
mov      dword ptr [rcx+24], r8d

A couple of bytes less :)
Though not sure about perf.

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Mar 21, 2017

Collaborator

@omariom for struct containing refs and classes array[size] = item; becomes a memory barrier assign, so was approaching it as getting the int assign in the cpu pipeline prior to the memory barrier. Might not make much difference tho

Collaborator

benaadams commented Mar 21, 2017

@omariom for struct containing refs and classes array[size] = item; becomes a memory barrier assign, so was approaching it as getting the int assign in the cpu pipeline prior to the memory barrier. Might not make much difference tho

jorive added a commit to guhuro/coreclr that referenced this pull request May 4, 2017

@karelz karelz added this to the 2.0.0 milestone Aug 28, 2017

@karelz karelz added this to the 2.0.0 milestone Aug 28, 2017

@benaadams benaadams deleted the benaadams:list-clear branch Mar 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment