Use FMA instruction in CpuMath for .NET Core 3 #1292

helloguo · 2018-10-18T00:29:32Z

Test with ..\..\Tools\dotnetcli\dotnet.exe run -c Release-Intrinsics --allCategories=Fma

tannergooding · 2018-10-18T00:36:10Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+                                Vector256<float> x21 = Avx.LoadVector256(pMatTemp += ccol);
+                                Vector256<float> x31 = Avx.LoadVector256(pMatTemp += ccol);
+
+                                res0 = Fma.MultiplyAdd(vector, x01, res0);


it would be nice to validate the codegen here and assert that the various loads are being folded into the FMA operation.

If this was "normal code", I would normally say we should pull all the calls to Avx.LoadVector256 above the if-else statements, so they aren't duplicated between the two. We are doing that in other methods below.

If we do that, does the codegen still look "acceptable"?

In reply to: 226137860 [](ancestors = 226137860)

We can check, but I've seen the best "experience" for folding these loads when it is done directly as part of the method call (skipping the local entirely).

Good points. I will check the codegen.

tannergooding · 2018-10-18T00:38:14Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

-                            Vector256<float> x11 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
-                            Vector256<float> x21 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
-                            Vector256<float> x31 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol));
+                            if (Fma.IsSupported)


It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the Fma.IsSupported check and either calls Fma.MultiplyAdd or does the individual Multiply and Add operations.

That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).

Good idea. I also wonder whether the repeated checks of Fma.IsSupported create efficient code. Just looking at the code, a single check per invocation seems possible, or, even better, one per process instantiation. Then again, the JIT might do that for us?

The JIT treats the IsSupported checks as constant and will completely drop the irrelevant path from the generated code.

My only concern with creating a helper method here is that the JIT will enforce the evaluation order and will no longer fold the respective LoadVector256 into the Fma.MultiplyAdd or Avx.Multiply calls....

There are probable workarounds if that is the case (such as passing the pointer we want to load from), but it is worth checking (and possibly logging a bug if it doesn't "just work").

The JIT treats the IsSupported checks as constant and will completely drop the irrelevant path from the generated code.

Thanks for explaining! This is cool :)

It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the Fma.IsSupported check and either calls Fma.MultiplyAdd or does the individual Multiply and Add operations.

That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).

Your concern are right. The sample code MyMulAdd does not fold the load into FMA. We need to use something like MyUnsafeMulAdd. I will change the code accordingly.

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] private static Vector256<float> MyMulAdd(Vector256<float> a, Vector256<float> b, Vector256<float> c) { return Fma.MultiplyAdd(a, b, c); }

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] private unsafe static Vector256<float> MyUnsafeMulAdd(float* ptra, Vector256<float> b, Vector256<float> c) { return Fma.MultiplyAdd(Avx.LoadVector256(ptra), b, c); }

tannergooding · 2018-10-18T00:40:07Z

Intel Core i9-7980XE CPU 2.60GHz (Max: 2.59GHz), 1 CPU, 36 logical and 18 physical cores

It would be nice to test this on some hardware that normal users would be more likely to have (I believe we generally test on a 4-core i7).

helloguo · 2018-10-18T16:27:24Z

Intel Core i9-7980XE CPU 2.60GHz (Max: 2.59GHz), 1 CPU, 36 logical and 18 physical cores

It would be nice to test this on some hardware that normal users would be more likely to have (I believe we generally test on a 4-core i7).

Which generation(s) of platforms (e.g. Coffee Lake, Sky Lake etc.) do you use for testing?

tannergooding · 2018-10-18T16:57:53Z

Which generation(s) of platforms (e.g. Coffee Lake, Sky Lake etc.) do you use for testing?

My current work box is an i7 Kaby Lake (i7 7700 @ 3.60GHz). I am still attempting to determine the exact micro-architecture (or range of micro-architectures) we are using for our official benchmarks. -- CC. @adamsitnik who might have some idea.

adamsitnik · 2018-10-18T17:32:41Z

@jorive what is our hardware setup for the perf lab? and what machines did we order some time ago?

jorive · 2018-10-18T18:46:00Z

@adamsitnik We have been running .NET Core benchmarks on Haswell machines (i7-4790).
Now, with respect to the hardware we ordered a while back I do not know what we are getting. @brianrob and @maririos might have more information.

brianrob · 2018-10-18T21:01:11Z

I'm not sure what we're getting, but I think that @maririos was looking into this.

tannergooding · 2018-10-18T23:54:02Z

src/Microsoft.ML.CpuMath/AvxIntrinsics.cs

+            }
+            else
+            {
+                return Avx.Add(Avx.Multiply(Avx.LoadVector256(psrc1), src2), src3);


nit: this might be more readable (and I believe the codegen should be the same) if we do:

Vector256<float> product = Avx.Multiply(src2, Avx.LoadVector256(pSrc1)); return Avx.Add(product, src3);

tannergooding

LGTM, although getting some perf numbers on some lower level hardware might also be nice.

maririos · 2018-10-19T01:14:48Z

what machines did we order some time ago?

We are in the process of revisiting the orders and the current need. Ping me if you want to be involved in the conversation for that new hardware.

adamsitnik

LGTM!

Thank you @helloguo !

adamsitnik · 2018-10-19T09:47:45Z

test/Microsoft.ML.CpuMath.PerformanceTests/AvxPerformanceTests.cs

@@ -28,14 +28,17 @@ public void ScaleAddU()
            => AvxIntrinsics.ScaleAddU(DefaultScale, DefaultScale, new Span<float>(dst, 0, Length));

        [Benchmark]
+        [BenchmarkCategory("Fma")]


good idea with adding the Fma category! 👍

adamsitnik · 2018-10-19T09:50:03Z

@tannergooding do you know if there are any env vars to make Fma.IsSupported return false for the purpose of testing?

tannergooding · 2018-10-19T15:32:08Z

@tannergooding do you know if there are any env vars to make Fma.IsSupported return false for the purpose of testing?

We have one, but it is currently only enabled for 'checked' builds of the Runtime and is not available as a "retail" flag. It has been brought up before that we may need it as a retail flag for cases like this.
CC. @CarolEidt, @danmosemsft, @eerhardt, @fiigii

helloguo · 2018-10-19T21:24:20Z

Before:

After:

@tannergooding The data is updated. I believe all the feedback is addressed.

fiigii · 2018-10-19T21:40:41Z

Opened an issue at https://github.com/dotnet/coreclr/issues/20498, I will expose these knobs in release build once it gets approved.

tannergooding · 2018-10-19T22:37:05Z

Thanks @helloguo

* enable forecasting scenario * fix build * update * fix codegen * fix codegen test : * fix automl service test * fix try it out page * clean up * add program generator for local image * fix local image console app generator * fix azure code gen * fix azure image classification bug * fix tests * fix tests * fix tests * fix test error * fix tests * fix tests * fix tests * fix tests git * update * update * update * use sort * use array.sort * enable nls * fix build * fix build * fix tests * fix typo * bump up version * fix code snippet * remove label from sampled data * fix web api code gen * Update ResampleStrategyProposer.cs * fix typo

helloguo added 2 commits October 17, 2018 17:00

use FMA

485e842

revert cli version back

60d7c35

tannergooding reviewed Oct 18, 2018

View reviewed changes

fold the load into fma

336e12b

tannergooding reviewed Oct 18, 2018

View reviewed changes

tannergooding approved these changes Oct 18, 2018

View reviewed changes

refactor the code to be more readable

e13caf6

adamsitnik approved these changes Oct 19, 2018

View reviewed changes

tannergooding merged commit 06b5ea6 into dotnet:master Oct 19, 2018

tannergooding mentioned this pull request Oct 22, 2018

Same implementation for Sparse Multiplication for aligned and unaligned arrays #1274

Merged

fiigii mentioned this pull request Jan 31, 2020

Expose EnableISA knobs in release build of .NET Core 3.0 dotnet/runtime#11270

Closed

dotnet locked as resolved and limited conversation to collaborators Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use FMA instruction in CpuMath for .NET Core 3 #1292

Use FMA instruction in CpuMath for .NET Core 3 #1292

helloguo commented Oct 18, 2018 •

edited

tannergooding Oct 18, 2018

eerhardt Oct 18, 2018

tannergooding Oct 18, 2018

helloguo Oct 18, 2018

tannergooding Oct 18, 2018

markusweimer Oct 18, 2018

tannergooding Oct 18, 2018

tannergooding Oct 18, 2018

markusweimer Oct 18, 2018

helloguo Oct 18, 2018

tannergooding commented Oct 18, 2018

helloguo commented Oct 18, 2018

tannergooding commented Oct 18, 2018

adamsitnik commented Oct 18, 2018

jorive commented Oct 18, 2018

brianrob commented Oct 18, 2018

tannergooding Oct 18, 2018

tannergooding left a comment

maririos commented Oct 19, 2018

adamsitnik left a comment

adamsitnik Oct 19, 2018

adamsitnik commented Oct 19, 2018

tannergooding commented Oct 19, 2018

helloguo commented Oct 19, 2018

fiigii commented Oct 19, 2018

tannergooding commented Oct 19, 2018 •

edited

Use FMA instruction in CpuMath for .NET Core 3 #1292

Use FMA instruction in CpuMath for .NET Core 3 #1292

Conversation

helloguo commented Oct 18, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Oct 18, 2018

helloguo commented Oct 18, 2018

tannergooding commented Oct 18, 2018

adamsitnik commented Oct 18, 2018

jorive commented Oct 18, 2018

brianrob commented Oct 18, 2018

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

maririos commented Oct 19, 2018

adamsitnik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamsitnik commented Oct 19, 2018

tannergooding commented Oct 19, 2018

helloguo commented Oct 19, 2018

fiigii commented Oct 19, 2018

tannergooding commented Oct 19, 2018 • edited

helloguo commented Oct 18, 2018 •

edited

tannergooding commented Oct 19, 2018 •

edited