JIT: Rewrite BlendVariableMask when mask is created from vector#126062
JIT: Rewrite BlendVariableMask when mask is created from vector#126062saucecontrol wants to merge 2 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
This PR updates JIT rationalization for xarch NI_AVX512_BlendVariableMask to avoid keeping the mask-form blend when the mask operand originates from a vector-to-mask conversion, preventing a deoptimization where a mask is created only to be embedded.
Changes:
- Extend
RewriteHWIntrinsicBlendvto detect when the blend mask is produced viaNI_AVX512_ConvertVectorToMask. - Avoid the “keep embedded mask” early-return in that scenario so the blend can be rewritten back to the non-mask form.
Comments suppressed due to low confidence (1)
src/coreclr/jit/rationalize.cpp:685
op3is now a localGenTree*, but it is later passed by address toRewriteHWIntrinsicToNonMask(&op3, ...).RewriteHWIntrinsicToNonMaskexpectsuseto be the actual operand edge so it can replace/remove nodes (e.g., it removesNI_AVX512_ConvertVectorToMaskand updates the parent viaReplaceOperand). Passing a local pointer meansnode->Op(3)will not be updated, leaving the blend node still pointing at the removed intrinsic (dangling operand / miscompile). UseGenTree*& op3 = node->Op(3);(or otherwise pass&node->Op(3)/ an operand reference) when callingRewriteHWIntrinsicToNonMask.
GenTree* op2 = node->Op(2);
GenTree* op3 = node->Op(3);
// We're in the post-order visit and are traversing in execution order, so
// everything between op2 and node will have already been rewritten to LIR
// form and doing the IsInvariantInRange check is safe. This allows us to
// catch cases where something is embedded masking compatible but where we
// could never actually contain it and so we want to rewrite it to the non-mask
// variant
SideEffectSet scratchSideEffects;
if (scratchSideEffects.IsLirInvariantInRange(m_compiler, op2, node))
{
unsigned tgtMaskSize = simdSize / genTypeSize(simdBaseType);
var_types tgtSimdBaseType = TYP_UNDEF;
if (op2->isEmbeddedMaskingCompatible(m_compiler, tgtMaskSize, tgtSimdBaseType))
{
// Make sure we had a mask to begin with. We don't want to create a mask
// solely for the purpose of embedding it.
if (!op3->OperIsHWIntrinsic() ||
(op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask))
{
// We are going to utilize the embedded mask, so we don't need to rewrite. However,
// we want to fixup the simdBaseType here since it simplifies lowering and allows
// both embedded broadcast and the mask to be live simultaneously.
if (tgtSimdBaseType != TYP_UNDEF)
{
op2->AsHWIntrinsic()->SetSimdBaseType(tgtSimdBaseType);
}
return;
}
}
}
if (!ShouldRewriteToNonMaskHWIntrinsic(op3))
{
return;
}
parents.Push(op3);
RewriteHWIntrinsicToNonMask(&op3, parents);
(void)parents.Pop();
|
cc @dotnet/jit-contrib SPMI doesn't show any diffs, but this does fix up my motivating case. Something like: static Vector128<float> AddToNegative(Vector128<float> v1, Vector128<float> v2)
=> Sse41.BlendVariable(v1, v1 + v2, v1); vmovups xmm0, xmmword ptr [rdx]
- vpmovd2m k1, xmm0
- vaddps xmm0 {k1}, xmm0, xmmword ptr [r8]
+ vaddps xmm1, xmm0, xmmword ptr [r8]
+ vblendvps xmm0, xmm0, xmm1, xmm0
vmovups xmmword ptr [rcx], xmm0
mov rax, rcx
ret
-; Total bytes of code: 24
+; Total bytes of code: 23 |
|
@tannergooding, @kg PTAL. |
| if (!op3->OperIsHWIntrinsic() || | ||
| (op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask)) |
There was a problem hiding this comment.
The ConvertVectorToMask check can be simplified/standardized by using the existing helper OperIsConvertVectorToMask() instead of checking OperIsHWIntrinsic() + GetHWIntrinsicId(). This avoids duplicating intrinsic IDs and reads more clearly.
| if (!op3->OperIsHWIntrinsic() || | |
| (op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask)) | |
| if (!op3->OperIsConvertVectorToMask()) |
| // solely for the purpose of embedding it. | ||
|
|
||
| if (tgtSimdBaseType != TYP_UNDEF) | ||
| if (!op3->OperIsHWIntrinsic() || | ||
| (op3->AsHWIntrinsic()->GetHWIntrinsicId() != NI_AVX512_ConvertVectorToMask)) | ||
| { |
There was a problem hiding this comment.
This changes when BlendVariableMask is rewritten back to BlendVariable (and can eliminate a ConvertVectorToMask), but I couldn't find an existing JIT codegen test covering this scenario (no BlendVariableMask hits under src/tests). Consider adding an asm pattern test to prevent regressions for Vector128/256 cases.
| // Make sure we had a mask to begin with. We don't want to create a mask | ||
| // solely for the purpose of embedding it. |
There was a problem hiding this comment.
This needs an elaboration covering why as it's operating under the presumption that vpmov*2m is more expensive than vblendvps, which can depend on the hardware and may change in the future
For example, this is what we have today where it may be slower or may be similar perf (using XMM/YMM/ZMM and vpmov*2m vs vblendvps):
- AMD Zen4:
3/4/5vs1 - AMD Zen5:
3/4/6vs2 - Intel Skylake-X:
3vs1-2 - Intel Emerald Rapids:
3vs2-3
Then there is also the case where we don't have a vpblendvw and vpblendvb has slightly different behavior here, so it might not be valid to switch back. Consider that 0x8000 selects a whole word but needs to be 0x8080 if using vpblendvb.
This catches cases where
BlendVariableis 'upgraded' toBlendVariableMaskon import but the blend mask was notTYP_MASK.There is logic in place that checks whether the blend could be used as an EVEX embedded mask and rewrites back to
BlendVariableif not. However, it misses cases where the mask is created from a vector anyway, and creating the mask just to embed it is a deoptimization.