Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential CQ improvements in Dictionary<K,T> #8953

Closed
AndyAyersMS opened this issue Sep 15, 2017 · 17 comments
Closed

Potential CQ improvements in Dictionary<K,T> #8953

AndyAyersMS opened this issue Sep 15, 2017 · 17 comments

Comments

@AndyAyersMS
Copy link
Member

Dictionary<K,T> has some inefficient CQ in key inner loops, in part because the jit has to assume that calls to the comparer may side effect the dictionary itself. For instance in FindEntry's search loop:

for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = entries[i].next)
{
   if (entries[i].hashCode == hashCode && comparer.Equals(entries[i].key, key)) return i;
}

the jit must assume that both comparer and entries are modified by the call to comparer.Equals. This yields native code like the following:

G_M51456_IG03:
       mov      rcx, gword ptr [rsi+16]         // reload entries
       cmp      ebp, dword ptr [rcx+8]          // entries.Length bounds check
       jae      SHORT G_M51456_IG09
       movsxd   rdx, ebp
       lea      r14, [rdx+2*rdx]
       lea      rcx, bword ptr [rcx+8*r14+16]   // & entries[i]
       cmp      dword ptr [rcx+16], ebx         // entries[i].key == key?
       jne      SHORT G_M51456_IG04             // nope, try next
       mov      rdx, gword ptr [rsi+24]         // reload comparer
       mov      r8, qword ptr [rcx+8]           // entries[i].hashCode
       mov      rcx, rdx
       mov      rdx, r8
       mov      r8, rdi                         // key
       lea      r11, [(reloc)]                  // why not CSE'd?
       cmp      dword ptr [rcx], ecx            // null check on comparer (quasi redundant)
       call     qword ptr [r11]System.Collections.Generic.IEqualityComparer`1[Int64][System.Int64]:Equals(long,long):bool:this
       test     eax, eax
       jne      SHORT G_M51456_IG07
G_M51456_IG04:
       mov      rax, gword ptr [rsi+16]          // reload entries
       cmp      ebp, dword ptr [rax+8]           // redundant bounds check for i = entries[i].next
       jae      SHORT G_M51456_IG09
       mov      ebp, dword ptr [rax+8*r14+36]    // update i 
       test     ebp, ebp                         // continue if i >= 0
       jge      SHORT G_M51456_IG03

So at IG04, the jit re-fetches entries, then its length, and then does a redundant bounds check.

There is also some register shuffling before the call to the comparer; this is caused in part by the fact that the comparer object is also be re-fetched from the dictionary before the call.

These issues are unlikely to go away with proposed/conceptual changes to allow default comparer dictionaries to devirtualize, since the possibility of a non-default comparer with unknown effects will still be present. One could perhaps avoid this by unswitching on the "is default" across the entire loop.

Manually caching the comparer and equals array into locals as in the following:

// Cache these two fields in locals so that it is clear to the jit
// that they are not modified by the calls to the comparer.
IEqualityComparer<TKey> lComparer = comparer;
Entry[] lEntries = entries;

int hashCode = lComparer.GetHashCode(key) & 0x7FFFFFFF;
for (int i = buckets[hashCode % buckets.Length]; i >= 0; i = lEntries[i].next)
{
    if (lEntries[i].hashCode == hashCode && lComparer.Equals(lEntries[i].key, key)) return i;
}

avoids these issues and produces tighter inner loop code (statically, 6 instructions less all told)

G_M51456_IG03:
       cmp      esi, r15d
       jae      SHORT G_M51456_IG09
       movsxd   rdx, esi
       lea      rdx, [rdx+2*rdx]
       lea      r12, bword ptr [rbp+8*rdx+16]
       cmp      dword ptr [r12+16], r14d
       jne      SHORT G_M51456_IG04
       mov      rdx, qword ptr [r12+8]
       mov      rcx, rbx
       mov      r8, rdi
       lea      r11, [(reloc)]
       cmp      dword ptr [rcx], ecx    /// seems like this is now fully redundant
       call     qword ptr [r11]System.Collections.Generic.IEqualityComparer`1[Int64][System.Int64]:Equals(long,long):bool:this
       test     eax, eax
       jne      SHORT G_M51456_IG07
G_M51456_IG04:
       mov      esi, dword ptr [r12+20]
       test     esi, esi
       jge      SHORT G_M51456_IG03

However this alters dictionary behavior in cases where the comparer side effects the dictionary. I am not sure if such behaviors are worth preserving, and it's not clear they are at all useful, but they are currently permitted.

If this sort of change seems reasonable, I'll go and apply it to the other methods that use the comparer, and do some benchmarking to see what the overall impact is on performance.

Still need to look at why the null check on the comparer remains; it should now be fully redundant.

@stephentoub @jkotas @benaadams @dotnet/jit-contrib thoughts?

@jkotas
Copy link
Member

jkotas commented Sep 15, 2017

However this alters dictionary behavior in cases where the comparer side effects the dictionary

I do not think you need to worry about this. It is unpredictable behavior.

@stephentoub
Copy link
Member

However this alters dictionary behavior in cases where the comparer side effects the dictionary. I am not sure if such behaviors are worth preserving, and it's not clear they are at all useful, but they are currently permitted.

I think it's an acceptable change. Yes, it's technically breaking, but such could would be extremely brittle and terribly suspect.

@mikedn
Copy link
Contributor

mikedn commented Sep 15, 2017

and then does a redundant bounds check

Nice, didn't notice that one before. FWIW I have a plan to get rid of the remaining 2 range checks.

@benaadams
Copy link
Member

A related item that may help with devirtualization is the default comparer can be a virtual type rather than an interface type?

e.g.

private static EqualityComparer<TKey> s_defaultComparer;
private IEqualityComparer<TKey> _comparer;
EqualityComparer<TKey> defaultComparer = s_defaultComparer;
IEqualityComparer<TKey> comparer = _comparer;
int hashCode = (comparer?.GetHashCode(key) ?? defaultComparer.GetHashCode(key)) & 0x7FFFFFFF;

Note: default string comparer is a bit of a special case as it switches type in use.

@AndyAyersMS
Copy link
Member Author

Will also need to look more closely at the case where key is a ref type -- the above helps it somewhat but the loop still looks overly complex (at least when prejitting). Possibly the helper call is killing too much. Also odd that the loop body ends up non-contiguous (in both before & after).

; Assembly listing for method System.Collections.Generic.Dictionary`2[__Canon,__Canon][System.__Canon,System.__Canon]:FindEntry(ref):int:this
; AFTER
G_M14721_IG05:
       mov      ecx, dword ptr [rbp+8]
       cmp      r13d, ecx
       jae      G_M14721_IG14
       movsxd   rcx, r13d
       lea      rcx, [rcx+2*rcx]
       lea      rax, bword ptr [rbp+8*rcx+16]
       mov      bword ptr [rsp+20H], rax
       cmp      dword ptr [rax+16], r12d
       jne      SHORT G_M14721_IG07
       mov      r8, gword ptr [rbp+8*rcx+16]
       mov      gword ptr [rsp+28H], r8
       mov      rcx, r14
       mov      r11, qword ptr [r15+136]
       test     r11, r11
       jne      SHORT G_M14721_IG10
       lea      rdx, [(reloc)]
       call     CORINFO_HELP_RUNTIMEHANDLE_CLASS
       mov      r11, rax
       mov      r8, gword ptr [rsp+28H]
G_M14721_IG06:
       mov      rcx, rbx
       mov      rdx, r8
       mov      r8, rdi
       cmp      dword ptr [rcx], ecx
       call     qword ptr [r11]
       test     eax, eax
       jne      SHORT G_M14721_IG11
G_M14721_IG07:
       mov      rax, bword ptr [rsp+20H]
       mov      r13d, dword ptr [rax+20]
       test     r13d, r13d
       jge      SHORT G_M14721_IG05
...
G_M14721_IG10:
       mov      r8, gword ptr [rsp+28H]
       jmp      SHORT G_M14721_IG06

Seems like the helper call should be hoistable

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Sep 16, 2017

Have added caching to FindEntry, TryInsert, and both flavors of Remove. Diffs not always as good as they should be; seeing what look like unnecessary spills, shuffles, and poor instruction selection. For instance:

; Assembly listing for method System.Collections.Generic.Dictionary`2[Char,__Canon][System.Char,System.__Canon]:TryInsert(char,ref,ubyte):bool:this
; AFTER
G_M62716_IG04:
       cmp      r10d, dword ptr [r15+8]
       jae      G_M62716_IG18
       mov      dword ptr [rsp+3CH], r10d
       movsxd   rdx, r10d
       lea      rdx, [rdx+2*rdx]
       lea      r11, bword ptr [r15+8*rdx+16]
       cmp      dword ptr [r11+8], r13d
       jne      SHORT G_M62716_IG05
       mov      bword ptr [rsp+20H], r11
       movzx    rdx, word  ptr [r11+16]
       mov      r8d, r12d
       mov      rcx, r14
       lea      r11, [(reloc)]
       cmp      dword ptr [rcx], ecx
       call     qword ptr [r11]System.Collections.Generic.IEqualityComparer`1[Char][System.Char]:Equals(char,char):bool:this
       test     eax, eax
       mov      r11, bword ptr [rsp+20H]
       jne      SHORT G_M62716_IG07
G_M62716_IG05:
       mov      r9d, dword ptr [rsp+40H]
       inc      r9d
       mov      dword ptr [rsp+40H], r9d
       mov      r10d, dword ptr [r11+12]
       mov      edx, r10d
       test     edx, edx
       mov      r10d, edx
       jge      SHORT G_M62716_IG04

r11 was a poor choice for &entries[i] and forces a spill/reload. The compiler can form inc [reg+disp] but seems to choose not to here. The allocator could have avoided the subsequent back and forth with edx.

So may just look at updating FindEntry for now, since that seems to fare somewhat better.

@AndyAyersMS
Copy link
Member Author

Changes in case anyone else is interested in poking at this: master...AndyAyersMS:DictionaryHacks

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Sep 16, 2017

Null check comes from this bit of code in lower.cpp:

GenTree* Lowering::LowerVirtualStubCall(GenTreeCall* call)
{
    ...
#ifdef _TARGET_64BIT_
    // Non-tail calls: Jump Stubs are not taken into account by VM for mapping an AV into a NullRef
    // exception. Therefore, JIT needs to emit an explicit null check.  Note that Jit64 too generates
    // an explicit null check.
    //
    // Tail calls: fgMorphTailCall() materializes null check explicitly and hence no need to emit
    // null check.

    // Non-64-bit: No need to null check the this pointer - the dispatch code will deal with this.
    // The VM considers exceptions that occur in stubs on 64-bit to be not managed exceptions and
    // it would be difficult to change this in a way so that it affects only the right stubs.

    if (!call->IsTailCallViaHelper())
    {
        call->gtFlags |= GTF_CALL_NULLCHECK;
    }
#endif

Seems like the kind of thing that should be set in the importer or morph rather than implied by the call type, so that if it ever gets cleared we actually know it's not needed. As things are now the non-null assertion from the initial comparer call doesn't propagate here because assertion prop thinks this subsequent call doesn't need a null check.

@jkotas
Copy link
Member

jkotas commented Sep 16, 2017

it would be difficult to change this in a way so that it affects only the right stubs.

This is not accurate statement. There is nothing special about x64 that makes this hard. We have it for every other platform. It should be a matter of copying AdjustContextForVirtualStub from say arm or arm64 to amd64 and adjusting it to shape of the VSD stubs or adjusting the code of VSD stubs to match AdjustContextForVirtualStub.

@benaadams
Copy link
Member

As in-function constraints don't currently exist dotnet/csharplang#905

Using the pattern that Vector uses

if (typeof(TKey) == typeof(byte))
{
    return (byte)(object)key0 == (byte)(object)key1;
}
else if (typeof(TKey) == typeof(sbyte))
{
    return (sbyte)(object)key0 == (sbyte)(object)key1;
}
else if (typeof(TKey) == typeof(ushort))
{
    return (ushort)(object)key0 == (ushort)(object)key1;
}
else if (typeof(TKey) == typeof(short))
{
    return (short)(object)key0 == (short)(object)key1;
}
else if (typeof(TKey) == typeof(uint))
{
    return (uint)(object)key0 == (uint)(object)key1;
}
else if (typeof(TKey) == typeof(int))
{
    return (int)(object)key0 == (int)(object)key1;
}
else if (typeof(TKey) == typeof(ulong))
{
    return (ulong)(object)key0 == (ulong)(object)key1;
}
else if (typeof(TKey) == typeof(long))
{
    return (long)(object)key0 == (long)(object)key1;
}
else if (typeof(TKey) == typeof(IntPtr))
{
    return (IntPtr)(object)key0 == (IntPtr)(object)key1;
}
else if (typeof(TKey) == typeof(UIntPtr))
{
    return (UIntPtr)(object)key0 == (UIntPtr)(object)key1;
}
else if (typeof(TKey) == typeof(Guid))
{
    return (Guid)(object)key0 == (Guid)(object)key1;
}
// ...

to get no virtual calls for default comparer and the BCL valuetype primitives master...benaadams:dictionary; won't apply to generic structs though.

FindEntry for a char key doesn't look too bad..?

G_M38479_IG01:
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC58             sub      rsp, 88
       488BF1               mov      rsi, rcx

G_M38479_IG02:
       BFFFFFFFFF           mov      edi, -1
       488B5E08             mov      rbx, gword ptr [rsi+8]
       4885DB               test     rbx, rbx
       0F84AE000000         je       G_M38479_IG10
       488B4E20             mov      rcx, gword ptr [rsi+32]
       4885C9               test     rcx, rcx
       7512                 jne      SHORT G_M38479_IG03
       0FB7EA               movzx    rbp, dx
       8BD5                 mov      edx, ebp
       8BCA                 mov      ecx, edx
       C1E110               shl      ecx, 16
       448BF2               mov      r14d, edx
       440BF1               or       r14d, ecx
       EB14                 jmp      SHORT G_M38479_IG04 ; non-default comparer

;G_M38479_IG03: ; non-default comparer
;       0FB7EA               movzx    rbp, dx
;       8BD5                 mov      edx, ebp
;       4C8D1D00000000       lea      r11, [(reloc)]
;       3909                 cmp      dword ptr [rcx], ecx
;       41FF13               call     qword ptr [r11]IEqualityComparer`1:GetHashCode(char):int:this
;       448BF0               mov      r14d, eax

G_M38479_IG04:
       4181E6FFFFFF7F       and      r14d, 0x7FFFFFFF
       4C8B7E10             mov      r15, gword ptr [rsi+16]
       448B4308             mov      r8d, dword ptr [rbx+8]
       418BC6               mov      eax, r14d
       99                   cdq      
       41F7F8               idiv     edx:eax, r8d
       413BD0               cmp      edx, r8d
       7377                 jae      SHORT G_M38479_IG12
       4C63C2               movsxd   r8, edx
       428B5C8310           mov      ebx, dword ptr [rbx+4*r8+16]
       85DB                 test     ebx, ebx
       7C58                 jl       SHORT G_M38479_IG10

G_M38479_IG05:
       413B5F08             cmp      ebx, dword ptr [r15+8]
       7365                 jae      SHORT G_M38479_IG12
       4C63C3               movsxd   r8, ebx
       4F8D0440             lea      r8, [r8+2*r8]
       4F8D64C710           lea      r12, bword ptr [r15+8*r8+16]
       4539742408           cmp      dword ptr [r12+8], r14d
       7532                 jne      SHORT G_M38479_IG08
       410FB7542410         movzx    rdx, word  ptr [r12+16]
       488B4E20             mov      rcx, gword ptr [rsi+32]
       4885C9               test     rcx, rcx
       750C                 jne      SHORT G_M38479_IG06 ; non-default comparer
       3BD5                 cmp      edx, ebp
       410F94C5             sete     r13b
       450FB6ED             movzx    r13, r13b
       EB12                 jmp      SHORT G_M38479_IG07

;G_M38479_IG06: ; non-default comparer
;       448BC5               mov      r8d, ebp
;       4C8D1D00000000       lea      r11, [(reloc)]
;       3909                 cmp      dword ptr [rcx], ecx
;       41FF13               call     qword ptr [r11]IEqualityComparer`1:Equals(char,char):bool:this
;       448BE8               mov      r13d, eax

G_M38479_IG07:
       4585ED               test     r13d, r13d
       750B                 jne      SHORT G_M38479_IG09

G_M38479_IG08:
       418B5C240C           mov      ebx, dword ptr [r12+12]
       85DB                 test     ebx, ebx
       7DAC                 jge      SHORT G_M38479_IG05
       EB02                 jmp      SHORT G_M38479_IG10

G_M38479_IG09:
       8BFB                 mov      edi, ebx

G_M38479_IG10:
       8BC7                 mov      eax, edi

G_M38479_IG11:
       4883C458             add      rsp, 88
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      

G_M38479_IG12:
       E800000000           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

@benaadams
Copy link
Member

benaadams commented Sep 17, 2017

TIL: You break dictionary, you break every test 🙀

@AndyAyersMS
Copy link
Member Author

Been looking for a good set of perf tests. The ones from CoreFX are a plausible starting point but they may be too noisy to use in low-level CQ tuning. Need to play around with them some more.

On that note, consecutive integer keys are probably best-case for collision avoidance. Would be nice to have test inputs that produce "representative" collision chains, if such a thing can be defined.

benchmarkgames/k-nucleotide is ostensibly a hash table perf test so also warrants a look.

Also want to look deeper at what the actual cost of the virtual stub dispatch is relative to everything else to see whether there's any hope that shaving off a handful of instructions (which is what we'd see on a first-bucket hit/miss w/o being able to inline away the hash & equals) will show up as a perf improvement. Since VSD is stateful, this requires some care.

@benaadams
Copy link
Member

Also want to look deeper at what the actual cost of the virtual stub dispatch is relative to everything else to see whether there's any hope that shaving off a handful of instructions

VSD looks like a double call for monomorphic sites and could be reduced to a single call? https://github.com/dotnet/coreclr/issues/8819
(Though this is just an armchair opinion)

@benaadams
Copy link
Member

benaadams commented Mar 3, 2018

@AndyAyersMS is this resolved with dotnet/coreclr#15419?

@AndyAyersMS
Copy link
Member Author

Could very well be.

It is probably worth looking at the new code gen to see if the minor issues mentioned above are cleared up (or now perhaps more prominent ). I'm be happy to do it, if nobody else gets to it in the next few days.

@danmoseley
Copy link
Member

@AndyAyersMS is this fixed now do you know?

@AndyAyersMS
Copy link
Member Author

Perf is definitely improved from when I opened the issue.

There are still things here we can improve further but we should take a fresh look. So we might as well close this issue and open a new one when we do the analysis and come up with ideas.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants