Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: optimize away overflow check in some span non portable casts #9426

Open
AndyAyersMS opened this issue Dec 12, 2017 · 11 comments
Open

JIT: optimize away overflow check in some span non portable casts #9426

AndyAyersMS opened this issue Dec 12, 2017 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@AndyAyersMS
Copy link
Member

When casting a span to a larger type via NonPortableCast, the result should not be able to overflow and the jit should be able to eliminate the check and throw.

This came up in some span conversion work over in CoreFx.

It might be hard to catch all cases of this since the current code for NonPortableCast<FromType, ToType> first multiplies the length by sizeof<FromType> and then divides by sizeof<ToType> but when the from type is char the resulting length will not overflow an int.
For instance:

ReadOnlySpan<byte> span = ...;
ReadOnlySpan<long> longSpan = span.NonPortableCast<byte, long>();

Currently this does a fairly expensive and overflow check;

G_M41841_IG01:
       4883EC28             sub      rsp, 40

G_M41841_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       4863D2               movsxd   rdx, edx
       4C8BC2               mov      r8, rdx
       49C1F83F             sar      r8, 63
       4983E007             and      r8, 7
       4C03C2               add      r8, rdx
       49C1F803             sar      r8, 3
       4981F8FFFFFF7F       cmp      r8, 0x7FFFFFFF
       7F1A                 jg       SHORT G_M41841_IG04
       4981F800000080       cmp      r8, 0xFFFFFFFF80000000
       7C11                 jl       SHORT G_M41841_IG04
       498BD0               mov      rdx, r8
       488901               mov      bword ptr [rcx], rax
       895108               mov      dword ptr [rcx+8], edx
       488BC1               mov      rax, rcx

G_M41841_IG03:
       4883C428             add      rsp, 40
       C3                   ret

G_M41841_IG04:
       E8895EC75F           call     CORINFO_HELP_OVERFLOW
       CC                   int3

We could probably also fix this in the source code by comparing sizes and only using checked math for the cases where size increases.

category:cq
theme:basic-cq
skill-level:expert
cost:small

@mikedn
Copy link
Contributor

mikedn commented Dec 19, 2017

We could probably also fix this in the source code by comparing sizes and only using checked math for the cases where size increases.

It should also use unsigned arithmetic, division by const (especially power of 2) is cheaper. And checked cast from ulong to int is also cheaper since it only needs to check the upper bound.

I'll give it a try, anyway I don't think the JIT will learn to optimize this code properly anytime soon...

@mikedn
Copy link
Contributor

mikedn commented Dec 19, 2017

Hmm, something like this:

if (Unsafe.SizeOf<TFrom>() == Unsafe.SizeOf<TTo>())
{
    newLength = source.Length;
}
else if (Unsafe.SizeOf<TFrom>() == 1)
{
    newLength = (int)((uint)source.Length / (uint)Unsafe.SizeOf<TTo>());
}
else
{
    ulong newLongLength = (ulong)(uint)source.Length * (ulong)Unsafe.SizeOf<TFrom>() / (ulong)Unsafe.SizeOf<TTo>();

    if (Unsafe.SizeOf<TFrom>() < Unsafe.SizeOf<TTo>())
    {
        newLength = (int)newLongLength;
    }
    else
    {
        newLength = checked((int)newLongLength);
    }
}

A bit large but fortunately it is still inlined. The entire NonPortableCast from your example reduces to shr ecx, 3. The reverse, long to byte, generates

8BC9                 mov      ecx, ecx
48C1E103             shl      rcx, 3
48F7C100000080       test     rcx, 0xFFFFFFFF80000000
7517                 jne      SHORT G_M13600_IG07

@omariom
Copy link
Contributor

omariom commented Dec 20, 2017

@mikedn Looks like you don't need the second if

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2017

else if (Unsafe.SizeOf<TFrom>() == 1)?

Could be, need to check how well those long operations are handled on 32 bit.

It may also be possible to improve the JIT certain areas to reduce the number of special cases. One significant CQ issues is related to signed arithmetic and that may be difficult to deal with in the JIT. But other issues this code exposes might be more manageable.

@mikedn
Copy link
Contributor

mikedn commented Dec 30, 2017

Could be, need to check how well those long operations are handled on 32 bit.

Yeah, it looks like JIT's handling of 64 bit mul.un and div.un on 32 bit targets sucks. At the very least it should remove x div.un 1 but it does not 😕

Without special cases for byte sized to/from NonPortableCast<int, byte> produces something like:

G_M61089_IG01:
       55           push     ebp
       8BEC         mov      ebp, esp
       57           push     edi
       56           push     esi
       83EC10       sub      esp, 16
       8BF1         mov      esi, ecx
G_M61089_IG02:
       8D4508       lea      eax, bword ptr [ebp+08H]
       8B38         mov      edi, dword ptr [eax]
       8B4004       mov      eax, dword ptr [eax+4]
       33D2         xor      edx, edx
       52           push     edx
       50           push     eax
       6A00         push     0
       6A04         push     4
       E8F93D5A0A   call     CORINFO_HELP_LMUL
       8945F0       mov      dword ptr [ebp-10H], eax
       8955F4       mov      dword ptr [ebp-0CH], edx
       8B45F0       mov      eax, dword ptr [ebp-10H]
       8B55F4       mov      edx, dword ptr [ebp-0CH]
       52           push     edx
       50           push     eax
       6A00         push     0
       6A01         push     1
       E8E268390A   call     CORINFO_HELP_ULDIV
       8945E8       mov      dword ptr [ebp-18H], eax
       8955EC       mov      dword ptr [ebp-14H], edx
       8B45E8       mov      eax, dword ptr [ebp-18H]
       8B55EC       mov      edx, dword ptr [ebp-14H]
       85C0         test     eax, eax
       7812         js       SHORT G_M61089_IG04
       85D2         test     edx, edx
       750E         jne      SHORT G_M61089_IG04
       893E         mov      dword ptr [esi], edi
       894604       mov      dword ptr [esi+4], eax
G_M61089_IG03:
       8D65F8       lea      esp, [ebp-08H]
       5E           pop      esi
       5F           pop      edi
       5D           pop      ebp
       C20800       ret      8
G_M61089_IG04:
       E82BA7390A   call     CORINFO_HELP_OVERFLOW
       CC           int3

@mikedn
Copy link
Contributor

mikedn commented Dec 31, 2017

Attempting to navigate around all the problems the JIT has with this code (and boy, does it have a few problems...) risks making the code too complicated. I think that the best option is to handle only those cases that are likely to be useful.

  • sizeof(From) == sizeof(To) - treating float as int or int as RGBA struct containing 4 bytes seems useful. This cast should practically be a no op because the length does not change.
  • sizeof(From) == 1 - treating bytes as ints, floats, RGBAs can be useful, provided that alignment and endianness issues are accounted for. This cast requires only a integer division by constant.
  • sizeof(To) == 1 - AsBytes should cover most needs. Casting to sbyte or a byte sized struct should be a rather rare occurrence. Don't bother special casing this.
  • sizeof(From) < sizeof(To) - with sizeof(From) == 1 already handled there doesn't seem to be much value in handling this? Casting from short to int?!?
  • sizeof(From) > sizeof(To) - this one too doesn't seem valuable. Anyway there's not a lot that can be done about it, this needs 64 bit mul/div and checked cast to int no matter what. What helps here is changing all the computation to use unsigned integers.

So, we can do something like:

int newLength;
if (Unsafe.SizeOf<TFrom>() == Unsafe.SizeOf<TTo>())
{
    newLength = source.Length;
}
else if (Unsafe.SizeOf<TFrom>() == 1)
{
    newLength = (int)((uint)source.Length / (uint)Unsafe.SizeOf<TTo>());
}
else
{
    ulong newUInt64Length = UInt64MulDiv((uint)source.Length, (uint)Unsafe.SizeOf<TFrom>(), (uint)Unsafe.SizeOf<TTo>());
    newLength = checked((int)newUInt64Length);
}
...

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong UInt64MulDiv(uint x, uint y, uint z) => (ulong)x * (ulong)y / (ulong)z;

The main purpose of else if (Unsafe.SizeOf<TFrom>() == 1) is to avoid 64 bit division on 32 bit targets. It could probably be ifdefed out on 64 bit targets to keep the IL size down.

UInt64MulDiv too caters to the needs of 32 bit targets, arranging code this way allows the JIT to generate a "big mul" multiply (a single MUL instruction instead of a helper call).

@jkotas Opinions?

@mikedn
Copy link
Contributor

mikedn commented Dec 31, 2017

A list of the various problems the JIT has with this (original and proposed) code could be useful:

  • A significant problem in the original code is the use of signed integer operations. Both the checked cast to int and the division by constant (especially when the constant is a power of 2) are more expensive when signed arithmetic is used. The JIT does next to nothing in this area and improving things in the general case isn't easy. Perhaps it could handle just the special case of array and span lengths that are known to be positive.
  • Once the signedness issue is avoided another expression optimization issue appears. The JIT doesn't know that x * c / c is x because x * c cannot overflow so for sizeof(From) = sizeof(To) = 4 the following code is generated:
       8BD2                 mov      edx, edx
       48C1E202             shl      rdx, 2
       48C1EA02             shr      rdx, 2
       48F7C200000080       test     rdx, 0xFFFFFFFF80000000
       750E                 jne      SHORT G_M30356_IG04
  • Some 64 to 32 bit narrowing seems to be missing. When sizeof(From) = 1 the JIT generates:
       8BD2                 mov      edx, edx ; zero extend
       48C1EA02             shr      rdx, 2
       48F7C200000080       test     rdx, 0xFFFFFFFF80000000
       750E                 jne      SHORT G_M49564_IG04
  • On 32 bit targets there are various issues arising from early morphing of 64 bit operations into helper calls. You can end up with things like CORINFO_HELP_LMUL_OVF(x, 1) or CORINFO_HELP_LDIV(x, 1). This also prevents optimization of mul/div by power of 2 using shld/shrd instructions.

@jkotas
Copy link
Member

jkotas commented Dec 31, 2017

Sounds good to me.

UInt64MulDiv too caters to the needs of 32 bit targets

Is there no way to do this without a helper method?

@mikedn
Copy link
Contributor

mikedn commented Dec 31, 2017

Is there no way to do this without a helper method?

Anything that ensures that the calls happen before the int->long casts will do. For example, something like:

uint length = (uint)source.Length;
uint fromSize = (uint)Unsafe.SizeOf<TFrom>();
uint toSize = ...
ulong newUInt64Length = (ulong)length * (ulong)fromSize ...

The problem is that calls tends tend to spill the stack to local variables and then you end up multiplying 2 long local variables. The JIT has no idea that those variables were initialized from int values converted to long and fails to perform the "big mul" optimization.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@TIHan
Copy link
Contributor

TIHan commented Feb 2, 2023

The API NonPortableCast does not exist anymore. @AndyAyersMS - maybe this has been renamed?

@AndyAyersMS
Copy link
Member Author

Per dotnet/coreclr#15941 this is now called MemoryMarshal.Cast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants