Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Replace (val / 2) with (val * 0.5) in Jit #24584

Merged
merged 33 commits into from
Sep 24, 2019
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented May 15, 2019

vmulss/vmulsd has better both latency and throughput than vdivss/vdivsd at least for the hardware I have. e.g. on my MacBook's Haswell:

vdivss (Latency: 10-20,  R.Throughput: 7-14)
vmulss (Latency:     5,  R.Throughput:  0.5)

So if a divisor is a constant power of two we can optimize it, e.g.:

a = b / 2;
becomes:
a = b * 0.5;

See https://godbolt.org/z/rz9h4E (clang, gcc, msvc, x86, AMD64, AArch64 - everywhere this optimization is applied. Btw, LLVM also helps Mono to optimize this case for C#)

I wrote a small benchmark:

[Benchmark]
[Arguments(MathF.PI)]
public float Div_old(float value)
{
    for (int i = 0; i < 10; i++)
        value = value / 2f;
    return value;
}

[Benchmark(Baseline = true)]
[Arguments(MathF.PI)]
public float Div_new(float value)
{
    for (int i = 0; i < 10; i++)
        value = value * 0.5f;
    return value;
}

and the results are (Haswell):

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14 (18A391) [Darwin 18.0.0]
Intel Core i7-4980HQ CPU 2.80GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host]   : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
  ShortRun : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1
WarmupCount=3

|  Method |     value |      Mean |     Error |    StdDev | Ratio | RatioSD |
|-------- |---------- |----------:|----------:|----------:|------:|--------:|
| Div_old | 3.1415927 | 16.848 ns | 1.6733 ns | 0.0917 ns |  3.14 |    0.02 |
| Div_new | 3.1415927 |  5.359 ns | 0.5572 ns | 0.0305 ns |  1.00 |    0.00 |

/cc: @tannergooding

src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/utils.h Outdated Show resolved Hide resolved
src/jit/utils.h Outdated Show resolved Hide resolved
src/jit/utils.h Outdated Show resolved Hide resolved
@mikedn
Copy link

mikedn commented May 19, 2019

FWIW I tried to do this or a similar FP optimization in the past but didn't bother with it because it wasn't very clear how useful it is, the framework doesn't have a lot of FP code. But now we have WPF in .NET Core - its PresentationCore.dll and PresentationFramework.dll contain a bit more than 100 hits. So it's quite useful, while many developers do this optimization manually there are still enough opportunities.

I haven't noticed any regressions. In theory this can block CSE because VN doesn't understand that x * 0.5 == x / 2.0. That can be fixed later if necessary, VN has other similar issues anyway.

src/jit/utils.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented May 23, 2019

@mikedn @tannergooding sorry for the delayed response. I've refactored it to frexp (I need to double check if it's 100% portable) based on your HasInverse
the float / const pattern can be found in game engines too, already found several places.

@EgorBo
Copy link
Member Author

EgorBo commented May 23, 2019

Also, I did some testing locally, e.g. https://gist.github.com/EgorBo/866a49334291c1ac3b108eb9341681ae (similar for double and for non-power-of-two constants)

src/jit/morph.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented May 23, 2019

@mikedn @tannergooding could you please take a look one more time? I think I handled all the cases. Also added a test.
isNormal implementation was copied from the managed double.IsNormal() impl.

Also: should I take care about Big Endian?

src/jit/morph.cpp Outdated Show resolved Hide resolved
src/jit/utils.cpp Show resolved Hide resolved
tests/src/JIT/opt/InstructionCombining/DivToMul.cs Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member Author

EgorBo commented Jun 19, 2019

A small Roslyn-based script to find places where this optimization can be applied (found some in various math/graphics related C# repositories): https://gist.github.com/EgorBo/74b034fe1936c43fcd0b42934322557c
Also, there are places in CoreFX where * 0.5 is used (I guess authors knew it's better than / 2.0 😄) e.g.: https://github.com/dotnet/corefx/blob/master/src/System.Runtime.Numerics/src/System/Numerics/Complex.cs#L418

@mikedn
Copy link

mikedn commented Jun 19, 2019

A small Roslyn-based script to find places where this optimization can be applied (found some in various math/graphics related C# repositories):

Erm, why do you need such a contraption? Perhaps you're not aware how run JIT diffs?

@EgorBo
Copy link
Member Author

EgorBo commented Jun 19, 2019

@mikedn I wanted to quickly find such places without even downloading repositories (via HttpClient) 🙂. Also I made a list of patterns that LLVM is able to optimize (InstCombine transforms) and looking for them in those repos.

@mikedn
Copy link

mikedn commented Jun 19, 2019

I wanted to quickly find such places without even downloading repositories

That's not going to find the interesting case, where expressions like x / 2.0 appear as a result of other JIT optimizations (inlining, constant folding & propagation etc.). Cases where x / 2.0 literally appears in the code aren't very interesting, developers can easily change that to x * 0.5 if they wish so. Adding a zillion of such expression transformations to the JIT might not be the best thing to do, especially in the current JIT state (morph is really a nightmare).

Anyway, here's a x64 FX diff:

Summary:
(Lower is better)
Total bytes of diff: 0 (0.00% of base)
0 total files with size differences (0 improved, 0 regressed), 129 unchanged.
0 total methods with size differences (0 improved, 0 regressed), 146772 unchanged.
3 files had text diffs but not size diffs.
Microsoft.Diagnostics.Tracing.TraceEvent.dasm had 12 diffs
System.Private.CoreLib.dasm had 8 diffs
Microsoft.DotNet.Cli.Utils.dasm had 4 diffs
Completed analysis in 28.14s

As I mentioned in a previous post, it's a bit of a strange case because divss and mulss instructions have the same size so nothing shows up in terms of size differences. Text diff shows that there are only 6 diffs in the entire framework. Not surprising as there's not a lot of floating point code in corelib and corefx.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 19, 2019

@mikedn yeah, but as you mentioned earlier there are cases in dotnet/wpf repo:

ReachFramework\AlphaFlattener\BrushProxy.cs:
        (-B + root) / A / 2

ReachFramework\AlphaFlattener\BrushProxy.cs:
        (-B - root) / A / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (rRadSquared + rDot) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (xEnd - xStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (yEnd - yStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (xEnd + xStart) / 2

ReachFramework\Serialization\DrawingContextFlattener.cs:
        (yEnd + yStart) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        - (dstwidth - width) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        (dstwidth - width) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        - (dstheight - height) / 2

ReachFramework\Serialization\VisualSerializer.cs:
        (dstheight - height) / 2

PresentationCore\MS\internal\Ink\StrokeRenderer.cs:
        Math.PI * 3.0 / 2.0

PresentationCore\MS\internal\Shaping\CompositeTypefaceMetrics.cs:
        (-UnderlineOffsetDefaultInEm) / 2

PresentationCore\System\Windows\Media\CompositionTarget.cs:
        Double.MinValue / 2.0

PresentationCore\System\Windows\Media\CompositionTarget.cs:
        Double.MinValue / 2.0

PresentationCore\System\Windows\Media\GlyphRun.cs:
        (firstStopAdvance + currentAdvance) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (centerWidth - ExtentWidth) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (ViewportHeight - ExtentHeight) / 2.0

PresentationFramework\MS\Internal\documents\DocumentGrid.cs:
        (ExtentWidth - _lastRowChangeExtentWidth) / 2.0

PresentationFramework\MS\Internal\documents\TextBoxView.cs:
        (width - lineWidth) / 2

PresentationFramework\MS\Internal\Ink\PenCursorManager.cs:
        (width - originalWidth) / 2

PresentationFramework\MS\Internal\Ink\PenCursorManager.cs:
        (height - originalHeight) / 2

PresentationFramework\System\Windows\Controls\Grid.cs:
        (minPower + maxPower) / 2.0

PresentationFramework\System\Windows\Documents\CaretElement.cs:
        double.MaxValue/2

PresentationFramework\System\Windows\Documents\CaretElement.cs:
        double.MaxValue/2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (_parameter + top) / 2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (_parameter + bottom) / 2

PresentationCore\System\Windows\Media\Animation\KeySpline.cs:
        (bottom + top) / 2

@mikedn
Copy link

mikedn commented Jun 19, 2019

Unfortunately running diffs on wpf assemblies is a bit more tricky at the moment so I haven't done it again. The ~100 hits estimation from my previous post on the matter likely still stands.

@EgorBo EgorBo closed this Jun 28, 2019
@EgorBo EgorBo reopened this Jun 28, 2019
Copy link

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, thanks @EgorBo.

However, I am not 100% sure that it is worth taking (with the current morph state and the lack of a separated expression transformation optimizer), @BruceForstall?

return (bits < 0x7FF0000000000000) && (bits != 0) && ((bits & 0x7FF0000000000000) != 0);
}

bool FloatingPointUtils::isNormal(float x)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this bit checks, but without C++ isnormal I do not see any better solution.

tests/src/JIT/opt/InstructionCombining/DivToMul.cs Outdated Show resolved Hide resolved
tests/src/JIT/opt/InstructionCombining/DivToMul.csproj Outdated Show resolved Hide resolved
@briansull briansull mentioned this pull request Sep 5, 2019
@briansull
Copy link

briansull commented Sep 5, 2019

For the non power of two float (32-bit) divides it would also always be a win to perform a reciprocal multiply operation using a (64-bit) multiply and conversion back to float (32-bit)

Turns out it isn't a win :-(

@tannergooding
Copy link
Member

For float (32-bit) divides it would also always be a win to perform a reciprocal multiply operation using a (64-bit) multiply and conversion back to float (32-bit)

Why would you do that rather than just doing the 32-bit reciprocal multiplication? For cases where it is known to be equivalent, just keeping it as a single-precision float would be more efficient..

@briansull
Copy link

briansull commented Sep 5, 2019

It was to cover the non power of two divide by constant case:

        public static float DivGen_new(float value)
        {
            for (int i = 0; i < 10; i++)
            {
                double valueD  = (double) value;
	        double resultD = valueD * (1.0 / 3.0);
	        value = (float) resultD;
            }
            return value;
        }

However it turns out the the convert instructions are pretty slow, so I believe that this transformation loses:

G_M2669_IG03:		;; bbWeight=10   
       F30F5AC0             cvtss2sd  xmm0, xmm0
       F20F590500000000     mulsd    xmm0, qword ptr [reloc @RWD00]
       F20F5AC0             cvtsd2ss  xmm0, xmm0
       FFC0                 inc      eax
       83F80A               cmp      eax, 10
       7CE9                 jl       SHORT G_M2669_IG03

RWD00  dq	3FD5555555555555h

@EgorBo
Copy link
Member Author

EgorBo commented Sep 5, 2019

@briansull nice try anyway 🙂

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
oper = GT_MUL;
tree->ChangeOper(oper);
op2->AsDblCon()->gtDconVal = 1.0 / divisor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that this is safe and doing the single operation in single precision isn't required.

The paper Innocuous Double Rounding of Basic Arithmetic Operations provides a proof that a single primitive operation done to at least twice the precision of the target format does not incur error due to double-rounding (hence for float divisor doing (float)(1.0 / divisor) is the same as (1.0f / divisor)).

This is not a safe thing to do across multiple operations (you must downcast back to float after each individual operation) nor is it safe if one of the inputs could not be exactly represented as a float (e.g. if you have double divisor and (float)divisor != divisor).

@sandreenko sandreenko merged commit 40faef6 into dotnet:master Sep 24, 2019
@sandreenko
Copy link

Thanks @EgorBo.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
9 participants