-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Fold some bitwise operations to vpternlog #91227
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionThis PR is trying to solve #84534. We implemented the optimization by tracking the use-def chain during lowering, and trying to fold 2 adjacent binary bitwise operations on the same chain into a single ternary node when AVX512 is available. As we tested internally, we observed some code size reduction in superpmi asmdiff tests, and no tp regression. Moreover, based on the tests where code gen difference is detected, we ran the related micros, and the results will be attached below.
|
Micros performance collected on
|
src/coreclr/jit/hwintrinsic.cpp
Outdated
case NI_SSE_AndNot: | ||
case NI_SSE2_AndNot: | ||
case NI_AVX_AndNot: | ||
case NI_AVX2_AndNot: | ||
case NI_AVX512F_AndNot: | ||
case NI_AVX512DQ_AndNot: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if ANDNOT logic should be included or not. The ANDNOT case won't be used by my PR, while it seems better to include it considering the semantics this function indicates.
@tannergooding @dotnet/avx512-contrib trying to kindly ask if we could have more reviews on this PR, thanks. |
src/coreclr/jit/gentree.cpp
Outdated
#if defined(TARGET_XARCH) | ||
NamedIntrinsic intrinsicId = GetHWIntrinsicId(); | ||
return Compiler::gtIsBitwiseIntrinsic(intrinsicId, GT_AND) || Compiler::gtIsBitwiseIntrinsic(intrinsicId, GT_OR) || | ||
Compiler::gtIsBitwiseIntrinsic(intrinsicId, GT_XOR) || | ||
Compiler::gtIsBitwiseIntrinsic(intrinsicId, GT_AND_NOT); | ||
#else // !TARGET_XARCH | ||
return false; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fine to be general and not XARCH only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing, sharing the long term plan!
Trying to confirm, if I understand it correctly, we will stick with the current implementation and continue the review process, right?
ANDNOT cannot be optimized by the ternary node in a standard way.
be63b0c
to
83dd601
Compare
src/coreclr/jit/gentree.cpp
Outdated
// | ||
bool GenTreeHWIntrinsic::OperIsBitwiseHWIntrinsic() | ||
{ | ||
return HWOperGet() == GT_AND || HWOperGet() == GT_OR || HWOperGet() == GT_XOR || HWOperGet() == GT_AND_NOT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'd be nice to explicitly cache HWOperGet()
. The compiler "should" be doing it for us, but its not a trivial call and is better to be safe.
case NI_AVX2_Or: | ||
case NI_AVX512F_Or: | ||
case NI_AVX512DQ_Or: | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to either include the { return GT_OR; }
as part of the ifdef
-or- more ideally we add the relevant:
#elif defined(TARGET_ARM64)
case NI_AdvSimd_Or:
#endif
src/coreclr/jit/gentree.cpp
Outdated
case NI_AVX2_AndNot: | ||
case NI_AVX512F_AndNot: | ||
case NI_AVX512DQ_AndNot: | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. We need to either include the braces as part of the ifdef or cover NI_AdvSimd_AndNot:
src/coreclr/jit/gentree.cpp
Outdated
NamedIntrinsic firstLogic = GetHWIntrinsicId(); | ||
NamedIntrinsic secondLogic = second->GetHWIntrinsicId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like dead locals now.
src/coreclr/jit/gentree.cpp
Outdated
const uint8_t A = 240; // 0xF0 | ||
const uint8_t B = 204; // 0xCC | ||
const uint8_t C = 170; // 0xAA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just make these directly hex:
const uint8_t A = 0xF0;
const uint8_t B = 0xCC;
const uint8_t C = 0xAA;
src/coreclr/jit/gentree.h
Outdated
@@ -6249,11 +6249,13 @@ struct GenTreeHWIntrinsic : public GenTreeJitIntrinsic | |||
bool OperIsEmbBroadcastCompatible() const; | |||
bool OperIsBroadcastScalar() const; | |||
bool OperIsCreateScalarUnsafe() const; | |||
bool OperIsBitwiseHWIntrinsic(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be bool OperIsBitwiseHWIntrinsic() const;
to indicate it doesn't mutate the instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah Yes, you are right, and do we consider also making HWOperGet()
constant? it would make sense to me based on its semantic, and it would save some const_cast
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be annotated as such since it doesn't mutate (and never should)
src/coreclr/jit/gentree.h
Outdated
|
||
bool OperRequiresAsgFlag() const; | ||
bool OperRequiresCallFlag() const; | ||
|
||
unsigned GetResultOpNumForRmwIntrinsic(GenTree* use, GenTree* op1, GenTree* op2, GenTree* op3); | ||
uint8_t GetTernaryControlByte(GenTreeHWIntrinsic* second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, shouldn't this be uint8_t GetTernaryControlByte(GenTreeHWIntrinsic* second) const;
?
@Ruihan-Yin Is this ready for another review? @tannergooding @EgorBo |
Yes, this PR hasn't changed after I resolved the comments from Tanner last time. |
case NI_AVX_Xor: | ||
case NI_AVX2_Xor: | ||
case NI_AVX512F_Xor: | ||
case NI_AVX512DQ_Xor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any concerns around this becoming out of sync from OperIsBitwiseHWIntrinsic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this list, we have the intrinsics that can be folded into ternary logic, while ANDNOT related intrinsics cannot be folded currently. On the other hand, OperIsBitwiseHWIntrinsic
should be consistent with its name from my view, so I included ANDNOT there.
I could leave some comments there to specify this issue, if this is the better way to make thing more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks everyone for the review and help! |
Description
This PR is trying to solve #84534.
We implemented the optimization by tracking the use-def chain during lowering, and trying to fold 2 adjacent binary bitwise operations on the same chain into a single ternary node when AVX512 is available.
As we tested internally, we observed some code size reduction in superpmi asmdiff tests, and no tp regression. Moreover, based on the tests where code gen difference is detected, we ran the related micros, and the results will be attached below.
Asmdiff collected locally
Diffs are based on 1,809,204 contexts (477,269 MinOpts, 1,331,935 FullOpts).
MISSED contexts: 6 (0.00%)
Overall (-1,192 bytes)
MinOpts (-31 bytes)
FullOpts (-1,161 bytes)
Example diffs
aspnet.run.windows.x64.checked.mch
-5 (-2.38%) : 17633.dasm - Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.StringUtilities+<>c:<.cctor>b__25_3(System.Span`1[ushort],System.ValueTuple`3[System.String,ushort,uint]):this (Instrumented Tier1)
-28 (-0.66%) : 51072.dasm - System.SpanHelpers:IndexOfAnyValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,short,short,short,short,int):int (Instrumented Tier0)
-1 (-0.65%) : 105.dasm - System.Guid:FormatGuidVector128Utf8(System.Guid,bool):System.ValueTuple`3[System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]] (FullOpts)
+10 (+5.32%) : 40640.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_512(ulong,ulong,ulong):ulong (Tier1-OSR)
+10 (+5.32%) : 11481.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_512(ulong,ulong,ulong):ulong (Tier1-OSR)
+10 (+5.32%) : 31137.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_512(ulong,ulong,ulong):ulong (Tier1-OSR)
benchmarks.run.windows.x64.checked.mch
-16 (-2.96%) : 17307.dasm - System.HexConverter:TryDecodeFromUtf16_Vector128(System.ReadOnlySpan`1[ushort],System.Span`1[ubyte]):bool (FullOpts)
-11 (-1.95%) : 11088.dasm - System.Runtime.Serialization.DataContracts.DataContract:CheckExplicitDataContractNamespaceUri(System.String,System.Type) (FullOpts)
-4 (-0.79%) : 757.dasm - System.Text.Ascii:ChangeCase[ushort,ushort,System.Text.Ascii+ToUpperConversion](ulong,ulong,ulong):ulong (FullOpts)
+14 (+1.65%) : 22121.dasm - System.SpanHelpers:IndexOfAnyValueType[ubyte,System.SpanHelpers+DontNegate`1[ubyte]](byref,ubyte,ubyte,ubyte,ubyte,int):int (FullOpts)
+9 (+1.97%) : 3798.dasm - System.Text.Ascii:IsValidCore[ushort](byref,int):bool (FullOpts)
+3 (+3.53%) : 22146.dasm - System.Numerics.Tests.Perf_VectorConvert:Widen[float,double](float[]):System.Numerics.Vector`1[double] (FullOpts)
benchmarks.run_pgo.windows.x64.checked.mch
-22 (-2.55%) : 81880.dasm - System.SpanHelpers:NonPackedIndexOfAnyValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,short,short,int):int (Tier1)
-22 (-2.55%) : 88557.dasm - System.SpanHelpers:NonPackedIndexOfAnyValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,short,short,int):int (Tier1)
-4 (-0.47%) : 24666.dasm - System.SpanHelpers:NonPackedIndexOfAnyValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,short,short,int):int (Tier1)
+9 (+1.62%) : 88451.dasm - System.Text.Ascii:IsValidCore[ushort](byref,int):bool (Tier1)
+3 (+4.11%) : 64196.dasm - System.Numerics.VectorMath:ConditionalSelectBitwise(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[float]):System.Runtime.Intrinsics.Vector128`1[float] (Tier0)
+3 (+4.11%) : 48668.dasm - System.Numerics.VectorMath:ConditionalSelectBitwise(System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double]):System.Runtime.Intrinsics.Vector128`1[double] (Tier0)
benchmarks.run_tiered.windows.x64.checked.mch
-4 (-0.79%) : 41558.dasm - System.Text.Ascii:ChangeCase[ushort,ushort,System.Text.Ascii+ToUpperConversion](ulong,ulong,ulong):ulong (Tier1)
-1 (-0.65%) : 45177.dasm - System.Guid:FormatGuidVector128Utf8(System.Guid,bool):System.ValueTuple`3[System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]] (Tier1)
-4 (-0.52%) : 17597.dasm - System.SpanHelpers:NonPackedIndexOfAnyValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,short,short,int):int (Tier1)
+9 (+1.97%) : 17569.dasm - System.Text.Ascii:IsValidCore[ushort](byref,int):bool (Tier1)
+3 (+4.11%) : 41810.dasm - System.Numerics.VectorMath:ConditionalSelectBitwise(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[float]):System.Runtime.Intrinsics.Vector128`1[float] (Tier0)
+3 (+4.11%) : 32889.dasm - System.Numerics.VectorMath:ConditionalSelectBitwise(System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double]):System.Runtime.Intrinsics.Vector128`1[double] (Tier0)
coreclr_tests.run.windows.x64.checked.mch
-11 (-10.00%) : 222531.dasm - Tests_len34_13:Test_tst_21(System.String):bool (Tier1)
-11 (-10.00%) : 222537.dasm - Tests_len34_13:Test_tst_24(System.String):bool (Tier1)
-11 (-10.00%) : 479726.dasm - Tests_len34_13:Test_tst_21(System.String):bool (FullOpts)
+3 (+12.50%) : 114841.dasm - System.Double:CopySign(double,double):double (Instrumented Tier1)
+3 (+12.50%) : 119618.dasm - System.Single:CopySign(float,float):float (Instrumented Tier1)
+3 (+12.50%) : 119622.dasm - System.Single:CopySign(float,float):float (Tier1)
libraries.pmi.windows.x64.checked.mch
-11 (-7.48%) : 251209.dasm - System.SR:get_RangeAttribute_ValidationError_MinExclusive_MaxExclusive():System.String (FullOpts)
-11 (-7.48%) : 251175.dasm - System.SR:get_CustomValidationAttribute_Method_Must_Return_ValidationResult():System.String (FullOpts)
-11 (-7.48%) : 251211.dasm - System.SR:get_RegularExpressionAttribute_Empty_Pattern():System.String (FullOpts)
+3 (+10.34%) : 34948.dasm - System.Numerics.VectorMath:ConditionalSelectBitwise(System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double],System.Runtime.Intrinsics.Vector128`1[double]):System.Runtime.Intrinsics.Vector128`1[double] (FullOpts)
+3 (+11.11%) : 34504.dasm - System.Numerics.Vector:AndNot[ubyte](System.Numerics.Vector`1[ubyte],System.Numerics.Vector`1[ubyte]):System.Numerics.Vector`1[ubyte] (FullOpts)
+5 (+12.50%) : 34539.dasm - System.Numerics.Vector:ConditionalSelect[ubyte](System.Numerics.Vector`1[ubyte],System.Numerics.Vector`1[ubyte],System.Numerics.Vector`1[ubyte]):System.Numerics.Vector`1[ubyte] (FullOpts)
libraries_tests.pmi.windows.x64.checked.mch
-10 (-3.22%) : 230310.dasm - System.SpanTests.ReadOnlySpanTests:CtorArray2() (FullOpts)
-10 (-3.22%) : 230864.dasm - System.SpanTests.SpanTests:CtorArray2() (FullOpts)
-5 (-2.76%) : 230342.dasm - System.SpanTests.ReadOnlySpanTests:EndsWithMatchDifferentSpans_Long() (FullOpts)
+3 (+12.50%) : 130554.dasm - System.NumberHelper`1[double]:CopySign(double,double):double (FullOpts)
+3 (+12.50%) : 327844.dasm - System.NumberHelper`1[double]:CopySign(double,double):double (FullOpts)
+3 (+12.50%) : 330292.dasm - System.NumberHelper`1[double]:CopySign(double,double):double (FullOpts)
realworld.run.windows.x64.checked.mch
-5 (-3.45%) : 13178.dasm - FSharp.Compiler.Lexer+Ranges+isInt32BadMax@42:Invoke(System.String):bool:this (FullOpts)
-5 (-3.38%) : 14927.dasm - FSharp.Compiler.Lexer+Ranges+isInt64BadMax@45:Invoke(System.String):bool:this (FullOpts)
-23 (-2.06%) : 1501.dasm - BepuPhysics.CollisionDetection.CollisionTasks.BoxTriangleTester:ClipTriangleEdgeAgainstBoxFace(byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,int) (FullOpts)
+9 (+1.97%) : 19339.dasm - System.Text.Ascii:IsValidCore[ushort](byref,int):bool (FullOpts)
+47 (+2.68%) : 1504.dasm - BepuPhysics.CollisionDetection.CollisionTasks.TrianglePairTester:ClipBEdgeAgainstABounds(byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,byref,System.Numerics.Vector`1[int],byref,byref,int) (FullOpts)
+14 (+2.82%) : 19110.dasm - Microsoft.ML.Internal.CpuMath.AvxIntrinsics:Scale(float,System.Span`1[float]) (FullOpts)
Details
Improvements/regressions per collection
Context information
jit-analyze output
Selected micros
System.Text.Perf_Ascii.IsValid*:
+9 (+1.97%) : 17569.dasm - System.Text.Ascii:IsValidCore[ushort](byref,int):bool (Tier1)
System.Text.Perf_Ascii.ToUpper_Chars:
-4 (-0.79%) : 41558.dasm - System.Text.Ascii:ChangeCase[ushort,ushort,System.Text.Ascii+ToUpperConversion](ulong,ulong,ulong):ulong (Tier1)
System.Text.Perf_Ascii.FromUtf16:
+10 (+5.32%) : 40640.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified_512(ulong,ulong,ulong):ulong (Tier1-OSR)
System.Memory.Span.IndexOf* :
-22 (-2.55%) : 81880.dasm - System.SpanHelpers:NonPackedIndexOfAnyValueType[short,System.SpanHelpers+DontNegate1[short]](byref,short,short,short,int):int (Tier1)