Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace (val / 2) with (val * 0.5) in Jit #24584

Merged
merged 33 commits into from Sep 24, 2019
Merged

Conversation

@EgorBo
Copy link

@EgorBo EgorBo commented May 15, 2019

vmulss/vmulsd has better both latency and throughput than vdivss/vdivsd at least for the hardware I have. e.g. on my MacBook's Haswell:

vdivss (Latency: 10-20,  R.Throughput: 7-14)
vmulss (Latency:     5,  R.Throughput:  0.5)

So if a divisor is a constant power of two we can optimize it, e.g.:

a = b / 2;
becomes:
a = b * 0.5;

See https://godbolt.org/z/rz9h4E (clang, gcc, msvc, x86, AMD64, AArch64 - everywhere this optimization is applied. Btw, LLVM also helps Mono to optimize this case for C#)

I wrote a small benchmark:

[Benchmark]
[Arguments(MathF.PI)]
public float Div_old(float value)
{
    for (int i = 0; i < 10; i++)
        value = value / 2f;
    return value;
}

[Benchmark(Baseline = true)]
[Arguments(MathF.PI)]
public float Div_new(float value)
{
    for (int i = 0; i < 10; i++)
        value = value * 0.5f;
    return value;
}

and the results are (Haswell):

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14 (18A391) [Darwin 18.0.0]
Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host]   : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
  ShortRun : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

|  Method |     value |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |---------- |----------:|----------:|----------:|------:|--------:|
| Div_old | 3.1415927 | 16.848 ns | 1.6733 ns | 0.0917 ns |  3.14 |    0.02 |
| Div_new | 3.1415927 |  5.359 ns | 0.5572 ns | 0.0305 ns |  1.00 |    0.00 |

/cc: @tannergooding

src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/utils.h Outdated Show resolved Hide resolved
Loading
src/jit/utils.h Outdated Show resolved Hide resolved
Loading
src/jit/utils.h Outdated Show resolved Hide resolved
Loading
@mikedn
Copy link

@mikedn mikedn commented May 19, 2019

FWIW I tried to do this or a similar FP optimization in the past but didn't bother with it because it wasn't very clear how useful it is, the framework doesn't have a lot of FP code. But now we have WPF in .NET Core - its PresentationCore.dll and PresentationFramework.dll contain a bit more than 100 hits. So it's quite useful, while many developers do this optimization manually there are still enough opportunities.

I haven't noticed any regressions. In theory this can block CSE because VN doesn't understand that x * 0.5 == x / 2.0. That can be fixed later if necessary, VN has other similar issues anyway.

Loading

src/jit/utils.cpp Outdated Show resolved Hide resolved
Loading
@EgorBo
Copy link
Author

@EgorBo EgorBo commented May 23, 2019

@mikedn @tannergooding sorry for the delayed response. I've refactored it to frexp (I need to double check if it's 100% portable) based on your HasInverse
the float / const pattern can be found in game engines too, already found several places.

Loading

@EgorBo
Copy link
Author

@EgorBo EgorBo commented May 23, 2019

Also, I did some testing locally, e.g. https://gist.github.com/EgorBo/866a49334291c1ac3b108eb9341681ae (similar for double and for non-power-of-two constants)

Loading

src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/pal/src/cruntime/math.cpp Outdated Show resolved Hide resolved
Loading
@EgorBo
Copy link
Author

@EgorBo EgorBo commented May 23, 2019

@mikedn @tannergooding could you please take a look one more time? I think I handled all the cases. Also added a test.
isNormal implementation was copied from the managed double.IsNormal() impl.

Also: should I take care about Big Endian?

Loading

src/jit/morph.cpp Outdated Show resolved Hide resolved
Loading
src/jit/utils.cpp Show resolved Hide resolved
Loading
tests/src/JIT/opt/InstructionCombining/DivToMul.cs Outdated Show resolved Hide resolved
Loading
@EgorBo
Copy link
Author

@EgorBo EgorBo commented Jun 19, 2019

A small Roslyn-based script to find places where this optimization can be applied (found some in various math/graphics related C# repositories): https://gist.github.com/EgorBo/74b034fe1936c43fcd0b42934322557c
Also, there are places in CoreFX where * 0.5 is used (I guess authors knew it's better than / 2.0 😄) e.g.: https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Numerics/src/System/Numerics/Complex.cs#L418

Loading

@mikedn
Copy link

@mikedn mikedn commented Jun 19, 2019

A small Roslyn-based script to find places where this optimization can be applied (found some in various math/graphics related C# repositories):

Erm, why do you need such a contraption? Perhaps you're not aware how run JIT diffs?

Loading

@EgorBo
Copy link
Author

@EgorBo EgorBo commented Jun 19, 2019

@mikedn I wanted to quickly find such places without even downloading repositories (via HttpClient) 🙂. Also I made a list of patterns that LLVM is able to optimize (InstCombine transforms) and looking for them in those repos.

Loading

@mikedn
Copy link

@mikedn mikedn commented Jun 19, 2019

I wanted to quickly find such places without even downloading repositories

That's not going to find the interesting case, where expressions like x / 2.0 appear as a result of other JIT optimizations (inlining, constant folding & propagation etc.). Cases where x / 2.0 literally appears in the code aren't very interesting, developers can easily change that to x * 0.5 if they wish so. Adding a zillion of such expression transformations to the JIT might not be the best thing to do, especially in the current JIT state (morph is really a nightmare).

Anyway, here's a x64 FX diff:

Summary:
(Lower is better)
Total bytes of diff: 0 (0.00% of base)
0 total files with size differences (0 improved, 0 regressed), 129 unchanged.
0 total methods with size differences (0 improved, 0 regressed), 146772 unchanged.
3 files had text diffs but not size diffs.
Microsoft.Diagnostics.Tracing.TraceEvent.dasm had 12 diffs
System.Private.CoreLib.dasm had 8 diffs
Microsoft.DotNet.Cli.Utils.dasm had 4 diffs
Completed analysis in 28.14s

As I mentioned in a previous post, it's a bit of a strange case because divss and mulss instructions have the same size so nothing shows up in terms of size differences. Text diff shows that there are only 6 diffs in the entire framework. Not surprising as there's not a lot of floating point code in corelib and corefx.

Loading

@EgorBo
Copy link
Author

@EgorBo EgorBo commented Jun 19, 2019

@mikedn yeah, but as you mentioned earlier there are cases in dotnet/wpf repo:

ReachFramework\AlphaFlattener\BrushProxy.cs:
        (-B + root) / A / 2

ReachFramework\AlphaFlattener\BrushProxy.cs:
        (-B - root) / A / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (rRadSquared + rDot) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (xEnd - xStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (yEnd - yStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (xEnd + xStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (yEnd + yStart) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        - (dstwidth - width) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        (dstwidth - width) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        - (dstheight - height) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        (dstheight - height) / 2

PresentationCore\MS\internal\Ink\StrokeRenderer.cs:
        Math.PI * 3.0 / 2.0

PresentationCore\MS\internal\Shaping\CompositeTypefaceMetrics.cs:
        (-UnderlineOffsetDefaultInEm) / 2

PresentationCore\System\Windows\Media\CompositionTarget.cs:
        Double.MinValue / 2.0

PresentationCore\System\Windows\Media\CompositionTarget.cs:
        Double.MinValue / 2.0

PresentationCore\System\Windows\Media\GlyphRun.cs:
        (firstStopAdvance + currentAdvance) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (centerWidth - ExtentWidth) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (ViewportHeight - ExtentHeight) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (ExtentWidth - _lastRowChangeExtentWidth) / 2.0

PresentationFramework\MS\Internal\documents\TextBoxView.cs:
        (width - lineWidth) / 2

PresentationFramework\MS\Internal\Ink\PenCursorManager.cs:
        (width - originalWidth) / 2

PresentationFramework\MS\Internal\Ink\PenCursorManager.cs:
        (height - originalHeight) / 2

PresentationFramework\System\Windows\Controls\Grid.cs:
        (minPower + maxPower) / 2.0

PresentationFramework\System\Windows\Documents\CaretElement.cs:
        double.MaxValue/2

PresentationFramework\System\Windows\Documents\CaretElement.cs:
        double.MaxValue/2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (_parameter + top) / 2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (_parameter + bottom) / 2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (bottom + top) / 2

Loading

@mikedn
Copy link

@mikedn mikedn commented Jun 19, 2019

Unfortunately running diffs on wpf assemblies is a bit more tricky at the moment so I haven't done it again. The ~100 hits estimation from my previous post on the matter likely still stands.

Loading

@EgorBo EgorBo closed this Jun 28, 2019
@EgorBo EgorBo reopened this Jun 28, 2019
Copy link

@sandreenko sandreenko left a comment

The change looks good, thanks @EgorBo.

However, I am not 100% sure that it is worth taking (with the current morph state and the lack of a separated expression transformation optimizer), @BruceForstall?

Loading

return (bits < 0x7FF0000000000000) && (bits != 0) && ((bits & 0x7FF0000000000000) != 0);
}

bool FloatingPointUtils::isNormal(float x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this bit checks, but without C++ isnormal I do not see any better solution.

Loading

tests/src/JIT/opt/InstructionCombining/DivToMul.cs Outdated Show resolved Hide resolved
Loading
tests/src/JIT/opt/InstructionCombining/DivToMul.csproj Outdated Show resolved Hide resolved
Loading
@briansull briansull mentioned this pull request Sep 5, 2019
@briansull
Copy link

@briansull briansull commented Sep 5, 2019

For the non power of two float (32-bit) divides it would also always be a win to perform a reciprocal multiply operation using a (64-bit) multiply and conversion back to float (32-bit)

Turns out it isn't a win :-(

Loading

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Sep 5, 2019

For float (32-bit) divides it would also always be a win to perform a reciprocal multiply operation using a (64-bit) multiply and conversion back to float (32-bit)

Why would you do that rather than just doing the 32-bit reciprocal multiplication? For cases where it is known to be equivalent, just keeping it as a single-precision float would be more efficient..

Loading

@briansull
Copy link

@briansull briansull commented Sep 5, 2019

It was to cover the non power of two divide by constant case:

        public static float DivGen_new(float value)
        {
            for (int i = 0; i < 10; i++)
            {
                double valueD  = (double) value;
	        double resultD = valueD * (1.0 / 3.0);
	        value = (float) resultD;
            }
            return value;
        }

However it turns out the the convert instructions are pretty slow, so I believe that this transformation loses:

G_M2669_IG03:		;; bbWeight=10   
       F30F5AC0             cvtss2sd  xmm0, xmm0
       F20F590500000000     mulsd    xmm0, qword ptr [reloc @RWD00]
       F20F5AC0             cvtsd2ss  xmm0, xmm0
       FFC0                 inc      eax
       83F80A               cmp      eax, 10
       7CE9                 jl       SHORT G_M2669_IG03

RWD00  dq	3FD5555555555555h

Loading

@EgorBo
Copy link
Author

@EgorBo EgorBo commented Sep 5, 2019

@briansull nice try anyway 🙂

Loading

Copy link

@briansull briansull left a comment

LGTM

Loading

{
oper = GT_MUL;
tree->ChangeOper(oper);
op2->AsDblCon()->gtDconVal = 1.0 / divisor;
Copy link
Member

@tannergooding tannergooding Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that this is safe and doing the single operation in single precision isn't required.

The paper Innocuous Double Rounding of Basic Arithmetic Operations provides a proof that a single primitive operation done to at least twice the precision of the target format does not incur error due to double-rounding (hence for float divisor doing (float)(1.0 / divisor) is the same as (1.0f / divisor)).

This is not a safe thing to do across multiple operations (you must downcast back to float after each individual operation) nor is it safe if one of the inputs could not be exactly represented as a float (e.g. if you have double divisor and (float)divisor != divisor).

Loading

@sandreenko sandreenko merged commit 40faef6 into dotnet:master Sep 24, 2019
54 checks passed
Loading
@sandreenko
Copy link

@sandreenko sandreenko commented Sep 24, 2019

Thanks @EgorBo.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment