Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup some xarch emit logic #85536

Merged
merged 25 commits into from
May 2, 2023
Merged

Conversation

tannergooding
Copy link
Member

I was working on a emitValidateIns method (debug-only, run as part of emitDispIns) which verifies that the various instrDesc we encounter are "correct". As part of this, I found a few places where things are currently incorrect and should be updated. -- The emitValidateIns will go up eventually, but as its own PR given size/complexity and because there are still more validation that needs to be done.

In particular:

  • Several instructions were not going down the VEX path and were giving potentially incorrect disassembly
  • Several instructions were not passing down a valid emitAttr (they passed down a scalar size to a vector instruction)
  • Several instructions were using an incorrect insFormat

Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the emitfmt data. I added functions to be able to query that information and utilize it in a few places rather than checking giant switch tables. -- There are more places this could be used as well, such as in the GC liveness update checks and to remove code duplication in places like emitDispIns. However, those can be done in their own PRs.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2023
@ghost ghost assigned tannergooding Apr 28, 2023
@ghost
Copy link

ghost commented Apr 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

I was working on a emitValidateIns method (debug-only, run as part of emitDispIns) which verifies that the various instrDesc we encounter are "correct". As part of this, I found a few places where things are currently incorrect and should be updated. -- The emitValidateIns will go up eventually, but as its own PR given size/complexity and because there are still more validation that needs to be done.

In particular:

  • Several instructions were not going down the VEX path and were giving potentially incorrect disassembly
  • Several instructions were not passing down a valid emitAttr (they passed down a scalar size to a vector instruction)
  • Several instructions were using an incorrect insFormat

Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the emitfmt data. I added functions to be able to query that information and utilize it in a few places rather than checking giant switch tables. -- There are more places this could be used as well, such as in the GC liveness update checks and to remove code duplication in places like emitDispIns. However, those can be done in their own PRs.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines -198 to -199
INST3(cmpps, "cmpps", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0xC2), INS_TT_FULL, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare packed singles
INST3(cmpss, "cmpss", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0xC2), INS_TT_TUPLE1_SCALAR, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar singles
Copy link
Member Author

@tannergooding tannergooding Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EVEX versions of these instructions produce a K register, not a XMM/YMM/ZMM register.

Since they need entirely different handling (and cannot be freely converted between the two forms), they need to be their own instructions.

Comment on lines -338 to -339
INST3(psubd, "psubd", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFA), INS_TT_FULL_MEM, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Subtract packed double-word (32-bit) integers
INST3(psubq, "psubq", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFB), INS_TT_FULL_MEM, Input_64Bit | REX_W1_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // subtract packed quad-word (64-bit) integers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b/w instructions are typically FULL_MEM while d/q are typically FULL (with a few exceptions)

In this case, they should've been FULL

@@ -445,14 +445,14 @@ INST3(pmovzxwd, "pmovzxwd", IUM_WR, BAD_CODE, BAD_CODE,
INST3(pmovzxwq, "pmovzxwq", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x34), INS_TT_QUARTER_MEM, Input_16Bit | REX_WIG | Encoding_VEX | Encoding_EVEX) // Packed zero extend short to long
INST3(pmuldq, "pmuldq", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x28), INS_TT_FULL, Input_32Bit | REX_W1_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // packed multiply 32-bit signed integers and store 64-bit result
INST3(pmulld, "pmulld", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x40), INS_TT_FULL, Input_32Bit | REX_W0_EVEX | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Packed multiply 32 bit unsigned integers and store lower 32 bits of each result
INST3(ptest, "ptest", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x17), INS_TT_NONE, REX_WIG | Encoding_VEX | Resets_OF | Resets_SF | Writes_ZF | Resets_AF | Resets_PF | Writes_CF) // Packed logical compare
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various test instructions don't mutate either input; the operate in a temp and set flags.

@@ -510,66 +510,66 @@ INST3(vpsrlvq, "psrlvq", IUM_WR, BAD_CODE, BAD_CODE,

INST3(FIRST_FMA_INSTRUCTION, "FIRST_FMA_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)
// id nm um mr mi rm flags
INST3(vfmadd132pd, "fmadd132pd", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x98), INS_TT_FULL, Input_64Bit | REX_W1 | Encoding_VEX | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // Fused Multiply-Add of Packed Double-Precision Floating-Point Values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMA instructions are all RMW

INST3(vbroadcastf64x2, "broadcastf64x2", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x1A), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed float values read from memory to entire register
INST3(vbroadcasti64x2, "broadcasti64x2", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x5A), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed integer values read from memory to entire register
INST3(vbroadcastf64x4, "broadcastf64x4", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x1B), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed float values read from memory to entire register
INST3(vbroadcasti64x4, "broadcasti64x4", IUM_WR, BAD_CODE, BAD_CODE, SSE38(0x5B), INS_TT_TUPLE2, Input_64Bit | REX_W1 | Encoding_EVEX) // Broadcast packed integer values read from memory to entire register
INST3(vcvtpd2udq, "cvtpd2udq", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0x79), INS_TT_FULL, Input_64Bit | REX_W1_EVEX | Encoding_EVEX) // cvt packed doubles to unsigned DWORDs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVEX only instructions don't need to specify REX_W1_EVEX since they don't have a different encoding for non-EVEX.

They can just specify REX_W1 directly and go down the faster path


INST3(LAST_AVX512_INSTRUCTION, "LAST_AVX512_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)

// Scalar instructions in SSE4.2
INST3(crc32, "crc32", IUM_WR, BAD_CODE, BAD_CODE, PSSE38(0xF2, 0xF0), INS_TT_NONE, INS_FLAGS_None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRC32 is an RMW instruction

@tannergooding tannergooding marked this pull request as ready for review April 30, 2023 22:58
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

Diffs

We see good TP wins of up to -0.20% in minopts and up to -0.03% in fullopts.

We also see some smaller assembly both from consistent use of the VEX aware path and from using the correct insFormat for various instructions as it allows various instruction peepholes to light-up better.

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@tannergooding
Copy link
Member Author

tannergooding commented May 1, 2023

Fuzzlyn failure is pre-existing and repros on .NET 8 Preview 3: #85602

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me assuming CI is green

@tannergooding
Copy link
Member Author

Merged in dotnet/main to resolve the conflicts caused by #85594

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@tannergooding
Copy link
Member Author

JitStress failures are #85608

@sebastienros
Copy link
Member

Do you think this PR could explain this improvement?

@tannergooding
Copy link
Member Author

It's possible, but hard to say for certain without seeing the codegen.

This fixed up a number of code paths dealing with vectors to be VEX aware and emit more optimal codegen.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants