Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Type Check Inlining #7651

Closed
tarekgh opened this issue Mar 18, 2017 · 9 comments · Fixed by #52708
Closed

Generic Type Check Inlining #7651

tarekgh opened this issue Mar 18, 2017 · 9 comments · Fixed by #52708
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@tarekgh
Copy link
Member

tarekgh commented Mar 18, 2017

@manixrock commented on Sat Mar 18 2017

When writing performance-critical code it often leads to code duplication.

Let's say we wanted to make a method that applies an effect on an image, in our case we want to apply a gray-scale and an optional invert. The code could look like this:

public class Effect
{
    public static void Apply(Bitmap bmp, GreyscaleMethod grayscaleMethod, bool invert)
    {
        // read bitmap data
        int w = bmp.Width, h = bmp.Height;
        var data = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.ReadWrite, bmp.PixelFormat);
        if (bmp.PixelFormat != PixelFormat.Format32bppArgb)
            throw new InvalidOperationException($"Unsupported pixel format: {bmp.PixelFormat}");
        var s = data.Stride;

        unsafe
        {
            var ptr = (byte*)data.Scan0;
            for (int y = 0; y < h; y++) {
                for (int x = 0; x < w; x++) {
                    // read RGB (not quite optimized, but that's not the point)
                    int offset = y * s + x;
                    int r = ptr[offset + 1];
                    int g = ptr[offset + 2];
                    int b = ptr[offset + 3];

                    // apply effects per pixel
                    if (grayscaleMethod == GreyscaleMethod.Average) {
                        r = g = b = (r + g + b) / 3;
                    } else if (grayscaleMethod == GreyscaleMethod.Luminance) {
                        r = g = b = (int)(r * 0.2126 + g * 0.7152 + b * 0722);
                    }
                    if (invert) {
                        r = 255 - r;
                        g = 255 - g;
                        b = 255 - b;
                    }

                    // write RGB
                    ptr[offset + 1] = (byte)r;
                    ptr[offset + 2] = (byte)g;
                    ptr[offset + 3] = (byte)b;
                }
            }
        }

        bmp.UnlockBits(data);
    }
}

public enum GreyscaleMethod
{
    None,
    Average,
    Luminance,
}

However if we expect the invert to be only rarely used, that code is slower than it can be because of the constant if (invert) check inside the performance-critical inner loop. We could of course create another method that gets called when invert is false, but that leads to code duplication, is harder to maintain, etc.

What we would need to have both optimal performance and code reuse is a way to get the compiler to generate 2 methods at compile time depending on the value of invert. Without any new syntax the code might look like this:

public class Effect
{
    private static void Apply<invert>(Bitmap bmp, GreyscaleMethod grayscaleMethod)
        where invert : Bool
    {
        // [...] read bitmap data

        unsafe
        {
            var ptr = (byte*)data.Scan0;
            for (int y = 0; y < h; y++) {
                for (int x = 0; x < w; x++) {
                    // [...] read RGB

                    // apply effects per pixel
                    if (grayscaleMethod == GreyscaleMethod.Average) {
                        r = g = b = (r + g + b) / 3;
                    } else if (grayscaleMethod == GreyscaleMethod.Luminance) {
                        r = g = b = (int)(r * 0.2126 + g * 0.7152 + b * 0722);
                    }
                    if (typeof(invert) == typeof(True)) { // type check
                        r = 255 - r;
                        g = 255 - g;
                        b = 255 - b;
                    }

                    // [...] write RGB
                }
            }
        }

        bmp.UnlockBits(data);
    }
}

public class False : Bool { }
public class True : Bool { }
public class Bool { }

Now that check if a compile-time constant, so the compiler could remove the type-condition and its block away when invert is False, and remove the type-condition but leave its block when True, leading to performance optimal code in both cases without code duplication.

However does the compiler (or even the JIT) do that? According to this stackoverflow answer it currently does not.

This is a proposal to improve the compiler (or JIT) to do that sort of code inlining (through method duplication) for compile-time constant checks.

If this were implemented, we can optimize the code even further by doing the same with the grayscaleMethod parameter:

public class Effect
{
    private static void Apply<invert, greyscaleMethod>(Bitmap bmp)
        where invert : Bool
        where greyscaleMethod : GreyscaleMethodEnum
    {
        // [...] read bitmap data

        unsafe
        {
            var ptr = (byte*)data.Scan0;
            for (int y = 0; y < h; y++) {
                for (int x = 0; x < w; x++) {
                    // [...] read RGB

                    // apply effects per pixel
                    if (typeof(greyscaleMethod) == typeof(GreyscaleMethod_Average)) {
                        r = g = b = (r + g + b) / 3;
                    } else if (typeof(greyscaleMethod) == typeof(GreyscaleMethod_Luminance)) {
                        r = g = b = (int)(r * 0.2126 + g * 0.7152 + b * 0722);
                    }
                    if (typeof(invert) == typeof(True)) {
                        r = 255 - r;
                        g = 255 - g;
                        b = 255 - b;
                    }

                    // [...] write RGB
                }
            }
        }

        bmp.UnlockBits(data);
    }
}

public class GreyscaleMethod_None : GreyscaleMethodEnum { }
public class GreyscaleMethod_Average : GreyscaleMethodEnum { }
public class GreyscaleMethod_Luminance : GreyscaleMethodEnum { }
public class GreyscaleMethodEnum { }

Doing the same optimization through code duplication would require 6 methods, and the number would increase exponentially with the number of parameters. However the compiler would know to only generate the methods which are actually used in the code.


@mikedn commented on Sat Mar 18 2017

This is a runtime/compiler/language issue (coreclr/roslyn repositories), the framework (corefx) doesn't have much to do with this.

However does the compiler (or even the JIT) do that? According to this stackoverflow answer it currently does not.

The .NET JIT usually does that if you use value types. As in:

public struct False : IBool { }
public struct True : IBool { }
public interface IBool { }

When generic arguments are value types the JIT has to generate specialized code for each value type. That enables some optimization including recognizing that typeof(invert) == typeof(True) is always true when invert = True.

Though there recently a bug was introduced that prevented this optimization from working. It's fixed now in the latest CoreCLR builds but it's still present in some .NET Framework builds (e.g. the one that comes with the current Win 10 Preview).

Doing the same optimization through code duplication would require 6 methods, and the number would increase exponentially with the number of parameters.

That's why when the code is shared between instantiations when reference types are used.

However the compiler would know the only generate the methods which are actually used in the code.

Well, if you call all variants it will still have to generate code for all of them. It is what it is, a trade off between code size and performance.

category:cq
theme:generics
skill-level:expert
cost:large

@svick
Copy link
Contributor

svick commented Mar 18, 2017

As already mentioned by @mikedn, CoreCLR already does this kind of optimization for value types and intentionally doesn't do it for reference types.

So, what is actually your proposal? To stop sharing implementations of generic methods for reference types? The implementation does that for a reason (to avoid code explosion) and I doubt such change would happen, especially considering that what you're doing is rare and can be worked around easily (by using value types).

To show that the optimization is done, consider this code:

public interface IBool { }
public struct True : IBool { }
public struct False : IBool { }

[MethodImpl(MethodImplOptions.NoInlining)]
static bool IsTrue<T>() where T : struct, IBool => typeof(T) == typeof(True);

Its disassembly shows it is optimized the way you want:

; IsTrue<True>
mov         eax,1  
ret  

; IsTrue<False>
xor         eax,eax  
ret  

(This is on .Net Core 1.1, I guess I didn't encounter the bug mentioned by @mikedn for some reason?)

Also, the same code could look somewhat better using the potential future feature Shapes dotnet/csharplang#164:

shape SBool
{
    bool IsTrue { get; }
}

struct True : SBool
{
    bool IsTrue => true;
}

struct False : SBool
{
    bool IsTrue => false;
}

static bool IsTrue<T>() where T : SBool => T.IsTrue;

@AndyAyersMS
Copy link
Member

While it's true that in a shared generic method the exact type for shared parameters or locals is not known to the jit, there are still opportunities to optimize.

There are various ways this can happen:

  • if the function applied to the type does not depend on the exact type then the jit may be able to simplify things. For instance sizeof(T) for a ref type is known to be 4 or 8 depending on target architecture, and since today sharing only happens for ref types, this optimization can kick in for shared methods.
  • if the method is inlined into an unshared method then the jit can specialize for the types it knows about. This is what happens for the typeof(T) == typeof(...) case though as noted above this idiomatic optimization was broken by Jit: spill stack for non-candidate calls coreclr#7923 and repaired with Jit: fix for broken type equality optimization coreclr#9562 -- neither of these is in 1.0 or 1.1 and both are going to be in 2.0 so no CoreCLR release is impacted. I'll have check if we somehow only got the first of these in some desktop build.

The challenge in depending on specialization via inlining is that the inliner is trying to balance several concerns -- code size, jit time, and performance impact -- and so may not be able to tell that a large method with lots of type-specialization opportunities will end up having only a small code size impact and big performance boost once inlined and optimized. These are more or less the same issues you'd struggle with trying to do this by hand, but the inliner can't be as certain as you can be about where the bets will pay off, so it tends to be conservative. We are working on improving the ability of the inliner to spot such opportunities but for now in perf sensitive cases you may need to use the aggressive inline attribute. Also there still some cases where methods can't be inlined at all (eg if the method has EH) or the idiomatic type check optimization doesn't kick in. Please let me know if you run into these.

@manixrock
Copy link

manixrock commented Mar 18, 2017

@svick The shape idea is very interesting. It seems to go in the direction of using types as objects with properties.

For example this would be very useful in imposing compile-time constraints on object types. When working with 3D scenes, it's very important to have a camera which has a position and view direction.

public class Camera
{
    public Vector<Bool, Bool> position;
    public Vector<True, Bool> lookDirection;
    public Vector<True, True> upDirection;
}

public struct Vector<IsNormalized, IsAxisAligned>
    where IsNormalized : Bool
    where IsAxisAligned : Bool
{
    public float X;
    public float Y;
    public float Z;
}

public static class VectorMethods
{
    public static Vector<True, Bool> normalize(Vector<Bool, Bool> vector)
    {
        var len = length(vector);
        return new Vector<True, Bool> { X = vector.X / len, Y = vector.Y / len, Z = vector.Z / len };
    }

    public static float length(Vector<Bool, Bool> vector)
    {
        return (float)Math.Sqrt(vector.X * vector.X + vector.Y * vector.Y + vector.Z * vector.Z);
    }
}

This way trying to apply a vector without first normalizing it will generate a compile time error, which is better than a runtime error.

Of course we could just have 3 structs Vector with NormalizedVector and AxisAlignedVector extending it, but then we can't have a vector that is both normalized and axis-aligned at the same time, that also is an instance of the 2 extending types. It also requires extra types, and is hard to add more properties to the type.

@manixrock
Copy link

manixrock commented Mar 18, 2017

@AndyAyersMS what about inlining Action's and Func's?

For example if we wrote the code above as:

public class Effect
{
    private static void Apply(Bitmap bmp, Func<Color, Color> pixelMapper)
    {
        // read bitmap data
        int w = bmp.Width, h = bmp.Height;
        var data = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.ReadWrite, bmp.PixelFormat);
        if (bmp.PixelFormat != PixelFormat.Format32bppArgb)
            throw new InvalidOperationException($"Unsupported pixel format: {bmp.PixelFormat}");
        var s = data.Stride;

        unsafe
        {
            var ptr = (byte*)data.Scan0;
            for (int y = 0; y < h; y++) {
                for (int x = 0; x < w; x++) {
                    // read RGB (not quite optimized, but that's not the point)
                    int offset = y * s + x;
                    byte a = ptr[offset + 1];
                    byte r = ptr[offset + 1];
                    byte g = ptr[offset + 2];
                    byte b = ptr[offset + 3];

                    // apply effects per pixel
                    var mappedColor = pixelMapper(new Color(a, r, g, b));

                    // write RGB
                    ptr[offset + 0] = mappedColor.A;
                    ptr[offset + 1] = mappedColor.R;
                    ptr[offset + 2] = mappedColor.G;
                    ptr[offset + 3] = mappedColor.B;
                }
            }
        }

        bmp.UnlockBits(data);
    }
}

Will it inline the passed method? An image editing program might have dozens of such methods, and it would save a lot of code to be able to re-use this code and only pass the pixel-mapping code. Code duplication is very ok since the alternative would be to manually copy the code anyway.

However since this is a performance-critical code it's extremely important to have some kind of guarantee that the code is inlined in all cases, or it will run very slowly at weird times. AggressiveInlining might help, but it sounds like it's not a guarantee, if it even works at all here.

@AlgorithmsAreCool
Copy link
Contributor

AlgorithmsAreCool commented Mar 19, 2017

I don't think the JIT does anywhere near that level of dataflow analysis (to inline a delegate) and I'd be shocked if it ever did due to the throughput/perf impact.

You could fallback to dynamic code generation to generate specialized code if you really needed to.

@AndyAyersMS
Copy link
Member

Right now we can't do much with either Action or Func; delegate creation is opaque to the jit. But it is on my list of things to look into.

@manixrock
Copy link

Often speed-focused developers are forced to choose between proper code separation and code performance.

This could be solved by a ForceInlining option on an Action or Func, I wonder how hard it would be to implement.

@AndyAyersMS
Copy link
Member

At the "call sites" for Action and Func the jit currently has no idea what method is going to be called. So force inlining doesn't help.

@ayende
Copy link
Contributor

ayende commented Mar 23, 2017

@manixrock the struct trick will work as well here. You have a struct that implements an interface and you call it with the struct implementing the behavior you want.
At the very least, it will not do a virtual dispatch, and sometimes, it will inline the call directly (especially on 2.0)

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 21, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 1, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants