New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FMA instruction in CpuMath for .NET Core 3 #1292
Conversation
Vector256<float> x21 = Avx.LoadVector256(pMatTemp += ccol); | ||
Vector256<float> x31 = Avx.LoadVector256(pMatTemp += ccol); | ||
|
||
res0 = Fma.MultiplyAdd(vector, x01, res0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to validate the codegen here and assert that the various loads are being folded into the FMA operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was "normal code", I would normally say we should pull all the calls to Avx.LoadVector256
above the if-else
statements, so they aren't duplicated between the two. We are doing that in other methods below.
If we do that, does the codegen still look "acceptable"?
In reply to: 226137860 [](ancestors = 226137860)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check, but I've seen the best "experience" for folding these loads when it is done directly as part of the method call (skipping the local entirely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I will check the codegen.
Vector256<float> x11 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol)); | ||
Vector256<float> x21 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol)); | ||
Vector256<float> x31 = Avx.Multiply(vector, Avx.LoadVector256(pMatTemp += ccol)); | ||
if (Fma.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the Fma.IsSupported
check and either calls Fma.MultiplyAdd
or does the individual Multiply
and Add
operations.
That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I also wonder whether the repeated checks of Fma.IsSupported
create efficient code. Just looking at the code, a single check per invocation seems possible, or, even better, one per process instantiation. Then again, the JIT might do that for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT treats the IsSupported
checks as constant and will completely drop the irrelevant path from the generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern with creating a helper method here is that the JIT will enforce the evaluation order and will no longer fold the respective LoadVector256
into the Fma.MultiplyAdd
or Avx.Multiply
calls....
There are probable workarounds if that is the case (such as passing the pointer we want to load from), but it is worth checking (and possibly logging a bug if it doesn't "just work").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT treats the IsSupported checks as constant and will completely drop the irrelevant path from the generated code.
Thanks for explaining! This is cool :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to abstract this into a "MultiplyAdd" helper, which itself contains the
Fma.IsSupported
check and either callsFma.MultiplyAdd
or does the individualMultiply
andAdd
operations.That would make this outer code cleaner, as we could just have four calls to the helper method (and hopefully the JIT would do the right thing for codegen here).
Your concern are right. The sample code MyMulAdd
does not fold the load into FMA. We need to use something like MyUnsafeMulAdd
. I will change the code accordingly.
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private static Vector256<float> MyMulAdd(Vector256<float> a, Vector256<float> b, Vector256<float> c)
{
return Fma.MultiplyAdd(a, b, c);
}
[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
private unsafe static Vector256<float> MyUnsafeMulAdd(float* ptra, Vector256<float> b, Vector256<float> c)
{
return Fma.MultiplyAdd(Avx.LoadVector256(ptra), b, c);
}
It would be nice to test this on some hardware that normal users would be more likely to have (I believe we generally test on a 4-core i7). |
Which generation(s) of platforms (e.g. Coffee Lake, Sky Lake etc.) do you use for testing? |
My current work box is an |
@jorive what is our hardware setup for the perf lab? and what machines did we order some time ago? |
@adamsitnik We have been running .NET Core benchmarks on Haswell machines (i7-4790). |
I'm not sure what we're getting, but I think that @maririos was looking into this. |
} | ||
else | ||
{ | ||
return Avx.Add(Avx.Multiply(Avx.LoadVector256(psrc1), src2), src3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this might be more readable (and I believe the codegen should be the same) if we do:
Vector256<float> product = Avx.Multiply(src2, Avx.LoadVector256(pSrc1));
return Avx.Add(product, src3);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although getting some perf numbers on some lower level hardware might also be nice.
We are in the process of revisiting the orders and the current need. Ping me if you want to be involved in the conversation for that new hardware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you @helloguo !
@@ -28,14 +28,17 @@ public void ScaleAddU() | |||
=> AvxIntrinsics.ScaleAddU(DefaultScale, DefaultScale, new Span<float>(dst, 0, Length)); | |||
|
|||
[Benchmark] | |||
[BenchmarkCategory("Fma")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea with adding the Fma category! 👍
@tannergooding do you know if there are any env vars to make |
We have one, but it is currently only enabled for 'checked' builds of the Runtime and is not available as a "retail" flag. It has been brought up before that we may need it as a retail flag for cases like this. |
@tannergooding The data is updated. I believe all the feedback is addressed. |
Opened an issue at https://github.com/dotnet/coreclr/issues/20498, I will expose these knobs in release build once it gets approved. |
Thanks @helloguo |
* enable forecasting scenario * fix build * update * fix codegen * fix codegen test : * fix automl service test * fix try it out page * clean up * add program generator for local image * fix local image console app generator * fix azure code gen * fix azure image classification bug * fix tests * fix tests * fix tests * fix test error * fix tests * fix tests * fix tests * fix tests git * update * update * update * use sort * use array.sort * enable nls * fix build * fix build * fix tests * fix typo * bump up version * fix code snippet * remove label from sampled data * fix web api code gen * Update ResampleStrategyProposer.cs * fix typo
Fix #832
Test with
..\..\Tools\dotnetcli\dotnet.exe run -c Release-Intrinsics --allCategories=Fma
@eerhardt @tannergooding PTAL