-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add JIT intrinsics support for vector conversion on AMD64 and x86 #10662
Conversation
bb76ccf
to
ab20f5f
Compare
@dotnet-bot test OSX10.12 x64 Checked Build and Test |
src/jit/emitxarch.cpp
Outdated
@@ -214,7 +217,8 @@ bool TakesRexWPrefix(instruction ins, emitAttr attr) | |||
// cased here. | |||
// | |||
// Rex_jmp = jmp with rex prefix always requires rex.w prefix. | |||
if (ins == INS_movsx || ins == INS_rex_jmp) | |||
// vpermq requires w bit to set to 1 | |||
if (ins == INS_movsx || ins == INS_rex_jmp || ins == INS_vpermq) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. vpermq is encoded with a VEX prefix (with w set to 1), not with a REX.W prefix. And this function is only asking about REX.W.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. Yes, vpermq needs VEX w1 bit. How about creating two functions named TakesVexWPrefix() and AddVexWPrefix() to handle VEX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have TakesVexPrefix()
and AddVexPrefix()
. Maybe what you want is a function that returns true if the VEX prefix W bit needs to be set (I assume nobody required this until now). Perhaps that should be called RequiresVexPrefixWBit(instruction ins)
? Then, AddVexPrefix()
could have code like:
if (RequiresVexPrefixWBit(ins))
{
code |= ... <magic W bit mask> ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right to me. The VEX.W prefix, I believe, is generally the AVX encoding of the corresponding REX.W prefix. We do have encodings that use the VEX.W prefix, and it is set based on the opcode size. I think it may be that the vpermq encoding needs to have the correct operand size, in which case I believe that the W bit would be set appropriately.
In reply to: 109974462 [](ancestors = 109974462)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VEX.W prefix, I believe, is generally the AVX encoding of the corresponding REX.W prefix.
I agree. From Intel® 64 and IA-32 Architectures Software Developer’s Manual section 2.3.5, "Three-byte form of the VEX prefix provides the functionality of REX.W only to specific instructions that need to override default 32-bit operand size for a general purpose register to 64-bit size in 64-bit mode. For those applicable instructions, VEX.W field provides the same functionality as REX.W. VEX.W field can provide completely different functionality for other instructions."
The function AddRexWPrefix(instruction ins, code_t code)
https://github.com/dotnet/coreclr/blob/master/src/jit/emitxarch.cpp#L350, takes care of both VEX.W bit and REX.W bit. From this aspect, it's OK to let TakesRexWPrefix
return true
for vpermq
and use AddRexWPrefix
to set VEX.W bit.
We do have encodings that use the VEX.W prefix, and it is set based on the opcode size.
For non-integer register, VEX.W bit is just a general opcode extension bit. So I think we need to set the VEX.W bit explicitly for vpermq
because vpermq
uses ymm or xmm.
I think we have two options here:
(1) reuse TakesRexWPrefix
and AddRexWPrefix
by letting TakesRexWPrefix
return true
for vpermq
.
(2) define two new functions (e.g. TakesVexWPrefix
and AddVexWPrefix
) to handle VEX.W bit explicitly. Meanwhile, clean the code in AddRexWPrefix
so that AddRexWPrefix
only handles REX.W bit.
Any suggestions?
src/jit/emitxarch.cpp
Outdated
@@ -3786,7 +3790,7 @@ void emitter::emitIns_R_R_I(instruction ins, emitAttr attr, regNumber reg1, regN | |||
instrDesc* id = emitNewInstrSC(attr, ival); | |||
|
|||
// REX prefix | |||
if (IsExtendedReg(reg1, attr) || IsExtendedReg(reg2, attr)) | |||
if (IsExtendedReg(reg1, attr) || IsExtendedReg(reg2, attr) || TakesRexWPrefix(ins, attr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| TakesRexWPrefix(ins, attr) [](start = 63, length = 29)
Did you add this because of vpermq? Why is that necessary? It wouldn't work on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's because vpermq. Will find a better way to handle vpermq.
src/jit/emitxarch.cpp
Outdated
* Arguments: | ||
* ins - the instruction to add | ||
* targetReg - the target (destination) register | ||
* reg1 - the first source register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include all arguments, including attr
{ | ||
code = AddRexWPrefix(ins, code); | ||
} | ||
|
||
if (TakesVexPrefix(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change necessary? It doesn't look like you added support for any instructions that take a REX prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also because of vpermq. Will address vpermq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceForstall - for AVX instructions that encode REX-equivalent bits (e.g. the W bit), we handle them as for the REX encoding. In fact, for instructions that have both AVX and SSE encodings, we simply generate the AVX encoding from the same encoding bits when AVX encodings are enabled. So, this is the universal way to set 'W' bits, whether they wind up in an AVX or VEX prefix.
I find this approach preferable to adding a separate method. It is what I did for the vgather variants when I was experimenting with that, which also require both the W and L bits to be set appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helloguo - thanks for making this change; I think it is a better approach.
src/jit/simd.cpp
Outdated
case SIMDIntrinsicConvertToUInt32: | ||
case SIMDIntrinsicConvertToInt64: | ||
case SIMDIntrinsicConvertToUInt64: | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look the same as SIMDIntrinsicCast. Why not share that code, above?
src/jit/simd.cpp
Outdated
case SIMDIntrinsicNarrow: | ||
{ | ||
#ifdef _TARGET_AMD64_ | ||
assert(!instMethod); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this (and the other ones you've added) under _TARGET_AMD64_
? I would remove the #ifdef
and make it all platform (namely, x64 and x86, which are the only ones defining FEATURE_SIMD
today).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does x64 first. It's a good point to make it all platform. We may create an issue to follow it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to make it all platform before this change is merged. We don't want to advance the two platforms independently; we want them to support the same functionality. Plus, you need to change the VectorConvert tests to check for actually generating accelerated code, and we don't want per-architecture tests.
GenTree* dupOp1 = fgInsertCommaFormTemp(&op1, gtGetStructHandleForSIMD(simdType, baseType)); | ||
|
||
// Widen the lower half and assign it to dstAddrLo. | ||
simdTree = gtNewSIMDNode(simdType, op1, nullptr, SIMDIntrinsicWidenLo, baseType, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op1 [](start = 47, length = 3)
Should this refer to dupOp1 instead of op1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/jit/simd.cpp
Outdated
hiAsg->gtFlags |= ((simdTree->gtFlags | dstAddrHi->gtFlags) & GTF_ALL_EFFECT); | ||
|
||
retVal = gtNewOperNode(GT_COMMA, simdType, loAsg, hiAsg); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we never create a SIMDIntrinsicWiden tree node, but instead "decompose" it into a high and low part that we put into a GT_COMMA
like:
GT_COMMA( *dstHi = SIMDIntrinsicWidenHi(op1) , *dstLo = SIMDIntrinsicWidenLow(op1) )
This seems a bit weird to me, as we normally use commas to create temps in the tree, e.g.:
GT_COMMA( tmp = ... , tmp )
where the "result" of the comma is the tmp
on the right-hand-side (second operand) of the GT_COMMA
. That would imply the "value" represented by your tree is the *dstLo
assignment, and the left side is "just" side-effect.
I wonder if you should instead create some kind of "merge" node instead, e.g.:
SIMDIntrinsicWiden( *dstHi = SIMDIntrinsicWidenHi(op1) , *dstLo = SIMDIntrinsicWidenLow(op1) )
where the SIMDIntrinsicWiden codegen would be a nop, since the operands would do the work for it.
Comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to do something like:
SIMDIntrinsicWiden( *dstHi = SIMDIntrinsicWidenHi(op1) , *dstLo = SIMDIntrinsicWidenLow(op1) )
But it's not that straightforward. I define a new node named SIMDIntrinsicWidenRet
, which takes loAsg
and hiAsg
as inputs. However, before it gets lowered and emitted, assertion happens at lir.cpp line 1503 https://github.com/dotnet/coreclr/blob/master/src/jit/lir.cpp#L1503. I guess because the root node is SIMDIntrinsicWidenRet
, it checks the IR from root to leaf. Can you give me any suggestions?
Since both loAsg
and hiAsg
are assignment already, it may be reasonable to put comma there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I withdraw my suggestion. Go ahead and stick with your original GT_COMMA
implementation.
src/jit/lsraxarch.cpp
Outdated
info->srcCount = 1; | ||
info->internalIntCount = 1; | ||
if (comp->getSIMDInstructionSet() == InstructionSet_AVX || simdTree->gtSIMDBaseType == TYP_ULONG) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coding style: please parenthesize sub-expressions (here and below)
src/jit/simdcodegenxarch.cpp
Outdated
//---------------------------------------------------------------------------------- | ||
// genSIMDLo64BitConvert: "Generate code to convert lower-most 64-bit item (long <--> double) | ||
// | ||
// Arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fill this in.
nit: extra double quote in title line
Why are these not enabled for x86? |
src/jit/simdcodegenxarch.cpp
Outdated
regNumber tmpReg2 = genRegNumFromMask(tmpRegsMask); | ||
assert(tmpReg != op1Reg && tmpReg2 != op1Reg); | ||
|
||
// tmpReg: mask, which contains multiple 0x53000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which contains multiple 0x53000000 [](start = 30, length = 34)
What does "which contains multiple 0x53000000" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be helpful to have a comment showing a sample code sequence that you intend to generate.
In reply to: 109797093 [](ancestors = 109797093)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will add comment.
src/jit/simdcodegenxarch.cpp
Outdated
// | ||
// Notes: | ||
// There are no instructions for converting to/from 64-bit integers, so for these we | ||
// do the convertion an element at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convertion [](start = 13, length = 10)
typo: convertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address my questions?
@BruceForstall Thanks for your feedback. Will make the change. |
@helloguo You should rewrite the VectorConvert.cs tests to use the JitLog test infrastructure that ensures that the given functions are accelerated, and make sure every intrinsic you've created is accelerated. Check VectorAdd.cs for example. You might also be able to use CheckValue<> instead of your own value comparisons. |
@BruceForstall, @helloguo Just looping back to this. Are we in shape to get the AMD64 changes merged? Has the follow up item for the X86 work been opened? |
@russellhadley I haven't re-reviewed. I don't want it merged until x86 is fully supported equal to amd64. |
Why do we need to wait? Doing it in two steps doesn't seem problematic to me. |
@russellhadley Why do we need to rush it and not do it right at the first merge? |
@BruceForstall I don't see it as a rush, just more bake time and checking in early and often. I'm not getting how the big bang is more "right" than in stages. |
@russellhadley As a matter of principle, x86 and x64 have feature parity, including w.r.t. SIMD intrinsics. We don't want to regress that, even temporarily. (I can't be assured that any resources devoted to adding intrinsics don't disappear after merging 1/2 of the work.) As a practical matter, the VectorConvert test here needs to be rewritten to use JitLog to verify that intrinsics were used. We don't want to disable this test for x86 -- we want all tests running on all architectures. |
@BruceForstall There will be feature parity, just one will be slower than the other (i.e. X86 not use the intrinsics initially). In terms of resource allocation I don't think PRs are the appropriate place for that discussion. From my side I'm willing to approve the merge if the amd64 changes if they are correct and meet the design direction. |
src/jit/simdcodegenxarch.cpp
Outdated
// tmpReg = 0 | ||
// tmpReg = (0 > targetReg) // (If signed) Get the sign bits | ||
// punpck[l|h]dq targetReg, tmpReg // Interleave the sign bits | ||
regNumber tmpReg = genRegNumFromMask(simdNode->gtRsvdRegs & RBM_ALLFLOAT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not correct - it needs to reflect the code being generated below.
src/jit/lsraxarch.cpp
Outdated
} | ||
else if ((comp->getSIMDInstructionSet() == InstructionSet_AVX) || (simdTree->gtSIMDBaseType == TYP_ULONG)) | ||
#endif | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be:
#ifdef _TARGET_X86_
if (simdTree->gtSIMDBaseType == TYP_LONG)
{
info->internalFloatCount = 3;
}
else
#endif
if ((comp->getSIMDInstructionSet() == InstructionSet_AVX) || (simdTree->gtSIMDBaseType == TYP_ULONG))
?
src/jit/simd.cpp
Outdated
retVal = simdTree; | ||
#else | ||
JITDUMP("SIMD Conversion is not supported on this platform\n"); | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific here, since ConvertToSingle/Double/Int32/UInt32 is supported. E.g.,
JITDUMP("SIMD Conversion to Int64/UInt64 is not supported on this platform\n");
src/jit/simdcodegenxarch.cpp
Outdated
if (intrinsicID == SIMDIntrinsicConvertToSingle && baseType == TYP_UINT) | ||
{ | ||
regNumber tmpIntReg = genRegNumFromMask(simdNode->gtRsvdRegs & RBM_ALLINT); | ||
regMaskTP tmpRegsMask = (simdNode->gtRsvdRegs & RBM_ALLFLOAT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change all your code that finds/extracts registers from the gtRsvdRegs mask to instead call the ExtractTempReg()
or GetSingleTempReg()
APIs. You can use AvailableTempRegCount()
for asserts, if desired, although generally just using the new APIs will give you the asserts you need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. Because these APIs (ExtractTempReg()
or GetSingleTempReg()
) were not there when this PR was first made, it's not very convenient to fix it in this PR. I will submit a fix once this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you just rebase to upstream/master, make the change, then push your branch again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks.
src/jit/simdcodegenxarch.cpp
Outdated
} | ||
else if (iset == InstructionSet_AVX || (baseType == TYP_ULONG)) | ||
#endif | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be:
#ifdef _TARGET_X86_
if (baseType == TYP_LONG)
{
...
}
#else
else if (iset == InstructionSet_AVX || (baseType == TYP_ULONG))
Yes, I believe that's what I suggested. Note that |
@CarolEidt if we just set |
@helloguo - Sorry for the delay. I spent some time looking into this, and you are right; it doesn't seem that there is a great way to cleanly support both register size specification and operand size specification in the current instruction formats and descriptors. I looked at what I did for the vgather variants (experimental work for https://github.com/dotnet/corefx/issues/1608), which have a similar encoding issue - the 'W' bit encodes single vs. double, while the 'L' bit (which in this case may take on both 0/1) encodes 128 vs. 256. Since the opcode implicitly encodes single vs. double, I had taken an approach similar to what you suggest above in I suggest adding a very explicit comment along these lines:
Does that seem reasonable? |
@pgavlin Can you add a note on the complexity of keeping the register and operand size specification separate? |
The test changes look good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
{ | ||
code = AddRexWPrefix(ins, code); | ||
} | ||
|
||
if (TakesVexPrefix(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceForstall - for AVX instructions that encode REX-equivalent bits (e.g. the W bit), we handle them as for the REX encoding. In fact, for instructions that have both AVX and SSE encodings, we simply generate the AVX encoding from the same encoding bits when AVX encodings are enabled. So, this is the universal way to set 'W' bits, whether they wind up in an AVX or VEX prefix.
I find this approach preferable to adding a separate method. It is what I did for the vgather variants when I was experimenting with that, which also require both the W and L bits to be set appropriately.
{ | ||
code = AddRexWPrefix(ins, code); | ||
} | ||
|
||
if (TakesVexPrefix(ins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helloguo - thanks for making this change; I think it is a better approach.
@dotnet-bot test Windows_NT x64 Debug Build and Test |
Given the dates for 2.0 (we're going to fork imminently), let's hold off on this until master opens for 2.1. |
@helloguo I expect we will be able to merge this soon. In the meantime, can you update the change so it doesn't have any merge conflicts? |
24148af
to
14a2812
Compare
@helloguo Looks like GitHub still thinks there is a merge conflict in codegenlinear.h? Maybe you need to rebase to upstream/master? |
… and x86, except double->long/ulong conversion on x86
14a2812
to
965d5ee
Compare
@dotnet-bot test Ubuntu arm Cross Release Build |
@BruceForstall Done. |
@dotnet-bot test this please |
@dotnet-bot test Windows_NT x64 Release Priority 1 Build and Test |
@helloguo Thanks for all the work! |
This PR adds JIT intrinsic support for vector conversion/narrow/widen on AMD64 SSE2, SSE34 and AVX. It is built on #9920 and #9318. The intrinsics will be tested by #10467.
The following APIs are accelerated by JIT intrinsics provided in this PR:
The semantics of above APIs can be found at https://github.com/dotnet/corefx/issues/15957. The C# implementation of above APIs can be found at dotnet/corefx#16276.
Fix https://github.com/dotnet/coreclr/issues/9317.