Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use FMA instruction in CpuMath for .NET Core 3 #1292

Merged
merged 4 commits into from Oct 19, 2018

Conversation

helloguo
Copy link

@helloguo helloguo commented Oct 18, 2018

Fix #832

Test with ..\..\Tools\dotnetcli\dotnet.exe run -c Release-Intrinsics --allCategories=Fma

@eerhardt @tannergooding PTAL

Vector256<float> x21 = Avx.LoadVector256(pMatTemp += ccol);
Vector256<float> x31 = Avx.LoadVector256(pMatTemp += ccol);

res0 = Fma.MultiplyAdd(vector, x01, res0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to validate the codegen here and assert that the various loads are being folded into the FMA operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was "normal code", I would normally say we should pull all the calls to Avx.LoadVector256 above the if-else statements, so they aren't duplicated between the two. We are doing that in other methods below.

If we do that, does the codegen still look "acceptable"?


In reply to: 226137860 [](ancestors = 226137860)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check, but I've seen the best "experience" for folding these loads when it is done directly as part of the method call (skipping the local entirely).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I will check the codegen.

Vector256<float> x11 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
Vector256<float> x21 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
Vector256<float> x31 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
if (Fma.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the Fma.IsSupported check and either calls Fma.MultiplyAdd or does the individual Multiply and Add operations.

That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I also wonder whether the repeated checks of Fma.IsSupported create efficient code. Just looking at the code, a single check per invocation seems possible, or, even better, one per process instantiation. Then again, the JIT might do that for us?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT treats the IsSupported checks as constant and will completely drop the irrelevant path from the generated code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern with creating a helper method here is that the JIT will enforce the evaluation order and will no longer fold the respective LoadVector256 into the Fma.MultiplyAdd or Avx.Multiply calls....

There are probable workarounds if that is the case (such as passing the pointer we want to load from), but it is worth checking (and possibly logging a bug if it doesn't "just work").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT treats the IsSupported checks as constant and will completely drop the irrelevant path from the generated code.

Thanks for explaining! This is cool :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the Fma.IsSupported check and either calls Fma.MultiplyAdd or does the individual Multiply and Add operations.

That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).

Your concern are right. The sample code MyMulAdd does not fold the load into FMA. We need to use something like MyUnsafeMulAdd. I will change the code accordingly.

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
        private static Vector256<float> MyMulAdd(Vector256<float> a, Vector256<float> b, Vector256<float> c)
        {
            return Fma.MultiplyAdd(a, b, c);
        }
        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
        private unsafe static Vector256<float> MyUnsafeMulAdd(float* ptra, Vector256<float> b, Vector256<float> c)
        {
            return Fma.MultiplyAdd(Avx.LoadVector256(ptra), b, c);
        }

@tannergooding
Copy link
Member

Intel Core i9-7980XE CPU 2.60GHz (Max: 2.59GHz), 1 CPU, 36 logical and 18 physical cores

It would be nice to test this on some hardware that normal users would be more likely to have (I believe we generally test on a 4-core i7).

@helloguo
Copy link
Author

Intel Core i9-7980XE CPU 2.60GHz (Max: 2.59GHz), 1 CPU, 36 logical and 18 physical cores

It would be nice to test this on some hardware that normal users would be more likely to have (I believe we generally test on a 4-core i7).

Which generation(s) of platforms (e.g. Coffee Lake, Sky Lake etc.) do you use for testing?

@tannergooding
Copy link
Member

Which generation(s) of platforms (e.g. Coffee Lake, Sky Lake etc.) do you use for testing?

My current work box is an i7 Kaby Lake (i7 7700 @ 3.60GHz). I am still attempting to determine the exact micro-architecture (or range of micro-architectures) we are using for our official benchmarks. -- CC. @adamsitnik who might have some idea.

@adamsitnik
Copy link
Member

@jorive what is our hardware setup for the perf lab? and what machines did we order some time ago?

@jorive
Copy link
Member

jorive commented Oct 18, 2018

@adamsitnik We have been running .NET Core benchmarks on Haswell machines (i7-4790).
Now, with respect to the hardware we ordered a while back I do not know what we are getting. @brianrob and @maririos might have more information.

@brianrob
Copy link
Member

I'm not sure what we're getting, but I think that @maririos was looking into this.

}
else
{
return Avx.Add(Avx.Multiply(Avx.LoadVector256(psrc1), src2), src3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might be more readable (and I believe the codegen should be the same) if we do:

Vector256<float> product = Avx.Multiply(src2, Avx.LoadVector256(pSrc1));
return Avx.Add(product, src3);

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although getting some perf numbers on some lower level hardware might also be nice.

@maririos
Copy link
Member

what machines did we order some time ago?

We are in the process of revisiting the orders and the current need. Ping me if you want to be involved in the conversation for that new hardware.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thank you @helloguo !

@@ -28,14 +28,17 @@ public void ScaleAddU()
=> AvxIntrinsics.ScaleAddU(DefaultScale, DefaultScale, new Span<float>(dst, 0, Length));

[Benchmark]
[BenchmarkCategory("Fma")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea with adding the Fma category! 👍

@adamsitnik
Copy link
Member

@tannergooding do you know if there are any env vars to make Fma.IsSupported return false for the purpose of testing?

@tannergooding
Copy link
Member

@tannergooding do you know if there are any env vars to make Fma.IsSupported return false for the purpose of testing?

We have one, but it is currently only enabled for 'checked' builds of the Runtime and is not available as a "retail" flag. It has been brought up before that we may need it as a retail flag for cases like this.
CC. @CarolEidt, @danmosemsft, @eerhardt, @fiigii

@helloguo
Copy link
Author

Before:
fma-before-1019

After:
fma-after-1019

@tannergooding The data is updated. I believe all the feedback is addressed.

@fiigii
Copy link

fiigii commented Oct 19, 2018

Opened an issue at https://github.com/dotnet/coreclr/issues/20498, I will expose these knobs in release build once it gets approved.

@tannergooding tannergooding merged commit 06b5ea6 into dotnet:master Oct 19, 2018
@tannergooding
Copy link
Member

tannergooding commented Oct 19, 2018

Thanks @helloguo

LittleLittleCloud added a commit that referenced this pull request Jan 28, 2022
* enable forecasting scenario

* fix build

* update

* fix codegen

* fix codegen test
:

* fix automl service test

* fix try it out page

* clean up

* add program generator for local image

* fix local image console app generator

* fix azure code gen

* fix azure image classification bug

* fix tests

* fix tests

* fix tests

* fix test error

* fix tests

* fix tests

* fix tests

* fix tests
git

* update

* update

* update

* use sort

* use array.sort

* enable nls

* fix build

* fix build

* fix tests

* fix typo

* bump up version

* fix code snippet

* remove label from sampled data

* fix web api code gen

* Update ResampleStrategyProposer.cs

* fix typo
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants