-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup some xarch emit logic #85536
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsI was working on a In particular:
Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the
|
INST3(cmpps, "cmpps", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0xC2), INS_TT_FULL, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare packed singles | ||
INST3(cmpss, "cmpss", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0xC2), INS_TT_TUPLE1_SCALAR, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar singles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EVEX versions of these instructions produce a K
register, not a XMM/YMM/ZMM
register.
Since they need entirely different handling (and cannot be freely converted between the two forms), they need to be their own instructions.
INST3(psubd, "psubd", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFA), INS_TT_FULL_MEM, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Subtract packed double-word (32-bit) integers | ||
INST3(psubq, "psubq", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFB), INS_TT_FULL_MEM, Input_64Bit | REX_W1_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // subtract packed quad-word (64-bit) integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b
/w
instructions are typically FULL_MEM
while d
/q
are typically FULL
(with a few exceptions)
In this case, they should've been FULL
@@ -445,14 +445,14 @@ INST3(pmovzxwd, "pmovzxwd", IUM_WR, BAD_CODE, BAD_CODE, | |||
INST3(pmovzxwq, "pmovzxwq", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x34), INS_TT_QUARTER_MEM, Input_16Bit | REX_WIG | Encoding_VEX | Encoding_EVEX) // Packed zero extend short to long | |||
INST3(pmuldq, "pmuldq", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x28), INS_TT_FULL, Input_32Bit | REX_W1_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // packed multiply 32-bit signed integers and store 64-bit result | |||
INST3(pmulld, "pmulld", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x40), INS_TT_FULL, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Packed multiply 32 bit unsigned integers and store lower 32 bits of each result | |||
INST3(ptest, "ptest", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x17), INS_TT_NONE, REX_WIG | Encoding_VEX | Resets_OF | Resets_SF | Writes_ZF | Resets_AF | Resets_PF | Writes_CF) // Packed logical compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The various test
instructions don't mutate either input; the operate in a temp and set flags.
@@ -510,66 +510,66 @@ INST3(vpsrlvq, "psrlvq", IUM_WR, BAD_CODE, BAD_CODE, | |||
|
|||
INST3(FIRST_FMA_INSTRUCTION, "FIRST_FMA_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None) | |||
// id nm um mr mi rm flags | |||
INST3(vfmadd132pd, "fmadd132pd", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x98), INS_TT_FULL, Input_64Bit | REX_W1 | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Fused Multiply-Add of Packed Double-Precision Floating-Point Values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FMA instructions are all RMW
INST3(vbroadcastf64x2, "broadcastf64x2", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x1A), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed float values read from memory to entire register | ||
INST3(vbroadcasti64x2, "broadcasti64x2", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x5A), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed integer values read from memory to entire register | ||
INST3(vbroadcastf64x4, "broadcastf64x4", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x1B), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed float values read from memory to entire register | ||
INST3(vbroadcasti64x4, "broadcasti64x4", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x5B), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed integer values read from memory to entire register | ||
INST3(vcvtpd2udq, "cvtpd2udq", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0x79), INS_TT_FULL, Input_64Bit | REX_W1_EVEX | Encoding_EVEX) // cvt packed doubles to unsigned DWORDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EVEX only instructions don't need to specify REX_W1_EVEX
since they don't have a different encoding for non-EVEX
.
They can just specify REX_W1
directly and go down the faster path
|
||
INST3(LAST_AVX512_INSTRUCTION, "LAST_AVX512_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None) | ||
|
||
// Scalar instructions in SSE4.2 | ||
INST3(crc32, "crc32", IUM_WR, BAD_CODE, BAD_CODE, PSSE38(0xF2, 0xF0), INS_TT_NONE, INS_FLAGS_None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRC32 is an RMW instruction
cd9fdd2
to
d13305d
Compare
4088fa2
to
992a3f2
Compare
CC. @dotnet/jit-contrib We see good TP wins of up to We also see some smaller assembly both from consistent use of the VEX aware path and from using the correct |
/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
Fuzzlyn failure is pre-existing and repros on .NET 8 Preview 3: #85602 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me assuming CI is green
Merged in dotnet/main to resolve the conflicts caused by #85594 |
/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
JitStress failures are #85608 |
Do you think this PR could explain this improvement? |
It's possible, but hard to say for certain without seeing the codegen. This fixed up a number of code paths dealing with vectors to be VEX aware and emit more optimal codegen. |
I was working on a
emitValidateIns
method (debug-only, run as part ofemitDispIns
) which verifies that the variousinstrDesc
we encounter are "correct". As part of this, I found a few places where things are currently incorrect and should be updated. -- TheemitValidateIns
will go up eventually, but as its own PR given size/complexity and because there are still more validation that needs to be done.In particular:
emitAttr
(they passed down a scalar size to a vector instruction)insFormat
Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the
emitfmt
data. I added functions to be able to query that information and utilize it in a few places rather than checking giant switch tables. -- There are more places this could be used as well, such as in the GC liveness update checks and to remove code duplication in places likeemitDispIns
. However, those can be done in their own PRs.