Cleanup some xarch emit logic #85536

tannergooding · 2023-04-28T18:13:32Z

I was working on a emitValidateIns method (debug-only, run as part of emitDispIns) which verifies that the various instrDesc we encounter are "correct". As part of this, I found a few places where things are currently incorrect and should be updated. -- The emitValidateIns will go up eventually, but as its own PR given size/complexity and because there are still more validation that needs to be done.

In particular:

Several instructions were not going down the VEX path and were giving potentially incorrect disassembly
Several instructions were not passing down a valid emitAttr (they passed down a scalar size to a vector instruction)
Several instructions were using an incorrect insFormat

Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the emitfmt data. I added functions to be able to query that information and utilize it in a few places rather than checking giant switch tables. -- There are more places this could be used as well, such as in the GC liveness update checks and to remove code duplication in places like emitDispIns. However, those can be done in their own PRs.

ghost · 2023-04-28T18:13:43Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

I was working on a emitValidateIns method (debug-only, run as part of emitDispIns) which verifies that the various instrDesc we encounter are "correct". As part of this, I found a few places where things are currently incorrect and should be updated. -- The emitValidateIns will go up eventually, but as its own PR given size/complexity and because there are still more validation that needs to be done.

In particular:

Several instructions were not going down the VEX path and were giving potentially incorrect disassembly
Several instructions were not passing down a valid emitAttr (they passed down a scalar size to a vector instruction)
Several instructions were using an incorrect insFormat

Most of these were straightfoward fixes. However, we also had but were not using some "scheduling" information that was defined as part of the emitfmt data. I added functions to be able to query that information and utilize it in a few places rather than checking giant switch tables. -- There are more places this could be used as well, such as in the GC liveness update checks and to remove code duplication in places like emitDispIns. However, those can be done in their own PRs.

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2023-04-28T18:19:41Z

src/coreclr/jit/instrsxarch.h

-INST3(cmpps,            "cmpps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0xC2),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare packed singles
-INST3(cmpss,            "cmpss",            IUM_WR, BAD_CODE,     BAD_CODE,     SSEFLT(0xC2),                            INS_TT_TUPLE1_SCALAR,                Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare scalar singles


The EVEX versions of these instructions produce a K register, not a XMM/YMM/ZMM register.

Since they need entirely different handling (and cannot be freely converted between the two forms), they need to be their own instructions.

tannergooding · 2023-04-28T18:21:25Z

src/coreclr/jit/instrsxarch.h

-INST3(psubd,            "psubd",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xFA),                            INS_TT_FULL_MEM,                     Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Subtract packed double-word (32-bit) integers
-INST3(psubq,            "psubq",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xFB),                            INS_TT_FULL_MEM,                     Input_64Bit    | REX_W1_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // subtract packed quad-word (64-bit) integers


b/w instructions are typically FULL_MEM while d/q are typically FULL (with a few exceptions)

In this case, they should've been FULL

tannergooding · 2023-04-28T18:23:43Z

src/coreclr/jit/instrsxarch.h

@@ -445,14 +445,14 @@ INST3(pmovzxwd,         "pmovzxwd",         IUM_WR, BAD_CODE,     BAD_CODE,
 INST3(pmovzxwq,         "pmovzxwq",         IUM_WR, BAD_CODE,     BAD_CODE,     SSE38(0x34),                             INS_TT_QUARTER_MEM,                  Input_16Bit    | REX_WIG      | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // Packed zero extend short to long
 INST3(pmuldq,           "pmuldq",           IUM_WR, BAD_CODE,     BAD_CODE,     SSE38(0x28),                             INS_TT_FULL,                         Input_32Bit    | REX_W1_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // packed multiply 32-bit signed integers and store 64-bit result
 INST3(pmulld,           "pmulld",           IUM_WR, BAD_CODE,     BAD_CODE,     SSE38(0x40),                             INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Packed multiply 32 bit unsigned integers and store lower 32 bits of each result
-INST3(ptest,            "ptest",            IUM_WR, BAD_CODE,     BAD_CODE,     SSE38(0x17),                             INS_TT_NONE,                                          REX_WIG      | Encoding_VEX                                                         | Resets_OF    | Resets_SF    | Writes_ZF    | Resets_AF    | Resets_PF    | Writes_CF)    // Packed logical compare


The various test instructions don't mutate either input; the operate in a temp and set flags.

tannergooding · 2023-04-28T18:23:54Z

src/coreclr/jit/instrsxarch.h

@@ -510,66 +510,66 @@ INST3(vpsrlvq,          "psrlvq",           IUM_WR, BAD_CODE,     BAD_CODE,

 INST3(FIRST_FMA_INSTRUCTION, "FIRST_FMA_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)
 //    id                nm                  um      mr            mi            rm                                       flags
-INST3(vfmadd132pd,      "fmadd132pd",       IUM_WR, BAD_CODE,     BAD_CODE,     SSE38(0x98),                             INS_TT_FULL,                         Input_64Bit    | REX_W1       | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Fused Multiply-Add of Packed Double-Precision Floating-Point Values


FMA instructions are all RMW

src/coreclr/jit/instrsxarch.h

tannergooding · 2023-04-28T18:25:34Z

src/coreclr/jit/instrsxarch.h

 INST3(vbroadcastf64x2,  "broadcastf64x2",   IUM_WR, BAD_CODE,               BAD_CODE,     SSE38(0x1A),                   INS_TT_TUPLE2,                       Input_64Bit    | REX_W1                       | Encoding_EVEX)                                                                                                                                  // Broadcast packed float values read from memory to entire register
 INST3(vbroadcasti64x2,  "broadcasti64x2",   IUM_WR, BAD_CODE,               BAD_CODE,     SSE38(0x5A),                   INS_TT_TUPLE2,                       Input_64Bit    | REX_W1                       | Encoding_EVEX)                                                                                                                                  // Broadcast packed integer values read from memory to entire register
 INST3(vbroadcastf64x4,  "broadcastf64x4",   IUM_WR, BAD_CODE,               BAD_CODE,     SSE38(0x1B),                   INS_TT_TUPLE2,                       Input_64Bit    | REX_W1                       | Encoding_EVEX)                                                                                                                                  // Broadcast packed float values read from memory to entire register
 INST3(vbroadcasti64x4,  "broadcasti64x4",   IUM_WR, BAD_CODE,               BAD_CODE,     SSE38(0x5B),                   INS_TT_TUPLE2,                       Input_64Bit    | REX_W1                       | Encoding_EVEX)                                                                                                                                  // Broadcast packed integer values read from memory to entire register
-INST3(vcvtpd2udq,       "cvtpd2udq",        IUM_WR, BAD_CODE,               BAD_CODE,     PCKFLT(0x79),                  INS_TT_FULL,                         Input_64Bit    | REX_W1_EVEX                  | Encoding_EVEX)                                                                                                                                  // cvt packed doubles to unsigned DWORDs


EVEX only instructions don't need to specify REX_W1_EVEX since they don't have a different encoding for non-EVEX.

They can just specify REX_W1 directly and go down the faster path

tannergooding · 2023-04-28T18:25:59Z

src/coreclr/jit/instrsxarch.h


 INST3(LAST_AVX512_INSTRUCTION, "LAST_AVX512_INSTRUCTION", IUM_WR, BAD_CODE, BAD_CODE, BAD_CODE, INS_TT_NONE, INS_FLAGS_None)

 // Scalar instructions in SSE4.2
-INST3(crc32,            "crc32",            IUM_WR, BAD_CODE,     BAD_CODE,     PSSE38(0xF2, 0xF0),                      INS_TT_NONE,    INS_FLAGS_None)


CRC32 is an RMW instruction

tannergooding · 2023-04-30T23:00:39Z

CC. @dotnet/jit-contrib

Diffs

We see good TP wins of up to -0.20% in minopts and up to -0.03% in fullopts.

We also see some smaller assembly both from consistent use of the VEX aware path and from using the correct insFormat for various instructions as it allows various instruction peepholes to light-up better.

tannergooding · 2023-05-01T14:30:24Z

/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn

azure-pipelines · 2023-05-01T14:30:44Z

Azure Pipelines successfully started running 2 pipeline(s).

tannergooding · 2023-05-01T17:19:09Z

Fuzzlyn failure is pre-existing and repros on .NET 8 Preview 3: #85602

src/coreclr/jit/simdcodegenxarch.cpp

EgorBo

The changes make sense to me assuming CI is green

tannergooding · 2023-05-01T20:33:01Z

Merged in dotnet/main to resolve the conflicts caused by #85594

tannergooding · 2023-05-01T20:33:21Z

/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn

azure-pipelines · 2023-05-01T20:33:34Z

Azure Pipelines successfully started running 2 pipeline(s).

tannergooding · 2023-05-02T00:34:30Z

JitStress failures are #85608

sebastienros · 2023-05-08T19:34:44Z

Do you think this PR could explain this improvement?

tannergooding · 2023-05-08T19:42:17Z

It's possible, but hard to say for certain without seeing the codegen.

This fixed up a number of code paths dealing with vectors to be VEX aware and emit more optimal codegen.

Ensure floating-point codegen uses the VEX aware path

295a300

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2023

ghost assigned tannergooding Apr 28, 2023

tannergooding commented Apr 28, 2023

View reviewed changes

src/coreclr/jit/instrsxarch.h Outdated Show resolved Hide resolved

tannergooding commented Apr 28, 2023

View reviewed changes

tannergooding added 10 commits April 28, 2023 12:07

Fix IF_RRW_RRW_CNS to be IF_RWR_RRD_CNS

cd59adb

Fixup emitfmtsxarch.h to have a more consistent layout

30a0ff9

Allow querying the scheduling info for an insFormat

9f62010

Ensure the new insFormats are handled

4b42794

Ensure we consistently use emitInsModeFormat

e7cfb3e

Ensure instructions which write to a mask register are EVEX only

d39df96

Improve REX.W handling for EVEX only instructions

820e7fc

Ensure that instructions use the right update mode and tuple type

5dbc5af

Apply formatting patch

28cdc15

Ensure DstSrcSrc is still handled correctly

d13305d

tannergooding force-pushed the xarch-improv branch from cd9fdd2 to d13305d Compare April 28, 2023 19:50

tannergooding added 5 commits April 28, 2023 12:57

Ensure BLSI/BLSR are still handled in emitOutputAM

03814a2

Use static_assert_no_msg

379b4ab

Fixing the disassembly for IF_RRW_SHF

9befa4f

Fixing the IF check for shld/shrd on x86

d4f6a95

Use the correct name: inst_RV_TT_IV

992a3f2

tannergooding force-pushed the xarch-improv branch from 4088fa2 to 992a3f2 Compare April 28, 2023 21:51

tannergooding added 2 commits April 29, 2023 06:43

Ensure the 4 operand insFormats include the necessary constant

312c3e0

Resolve an insFormat check on x86

fa8677b

Ensure emitSizeOfInsDsc_CNS is used for RWR_RRD_*RD_CNS

8884ec7

tannergooding marked this pull request as ready for review April 30, 2023 22:58

This was referenced May 1, 2023

Infra improvements for Helix #68176

Closed

Methodical_others test JIT/Methodical/Coverage/copy_prop_byref_to_native_int crashing #69832

Open

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

tannergooding mentioned this pull request May 1, 2023

Ensure that IF_*WR_RRD formats support 4-byte SIMD instructions #85594

Merged

EgorBo reviewed May 1, 2023

View reviewed changes

src/coreclr/jit/simdcodegenxarch.cpp Outdated Show resolved Hide resolved

EgorBo approved these changes May 1, 2023

View reviewed changes

tannergooding added 2 commits May 1, 2023 11:14

Ensure genSimd12UpperClear uses andps for the pre-SSE4.1 path

a318b17

Merge remote-tracking branch 'dotnet/main' into xarch-improv

1c60535

tannergooding merged commit da0aa0c into dotnet:main May 2, 2023

tannergooding deleted the xarch-improv branch May 2, 2023 00:34

This was referenced May 2, 2023

Failures in System.Net.Mail.Tests.SmtpClientTest tests #85637

Closed

System.IO.Tests.RandomAccess_NoBuffering.ReadUsingSingleBuffer timing out #85659

Closed

This was referenced May 2, 2023

System.Net.Http.Tests.WarningHeaderValueTest.GetWarningLength_DifferentInvalidScenarios_AllReturnZero failing in CI #85687

Closed

Fix AreFlagsSetToZeroCmp to not consider unsupported formats #85714

Merged

tannergooding mentioned this pull request May 9, 2023

Performance regression on Fortunes Windows #85930

Closed

kunalspathak mentioned this pull request May 16, 2023

[Perf] Windows/x64: 6 Improvements on 5/2/2023 12:34:35 AM dotnet/perf-autofiling-issues#17620

Closed

ghost locked as resolved and limited conversation to collaborators Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup some xarch emit logic #85536

Cleanup some xarch emit logic #85536

tannergooding commented Apr 28, 2023

ghost commented Apr 28, 2023

tannergooding Apr 28, 2023 •

edited

Loading

tannergooding Apr 28, 2023

tannergooding Apr 28, 2023

tannergooding Apr 28, 2023

tannergooding Apr 28, 2023

tannergooding Apr 28, 2023

tannergooding commented Apr 30, 2023

tannergooding commented May 1, 2023

azure-pipelines bot commented May 1, 2023

tannergooding commented May 1, 2023 •

edited

Loading

EgorBo left a comment

tannergooding commented May 1, 2023

tannergooding commented May 1, 2023

azure-pipelines bot commented May 1, 2023

tannergooding commented May 2, 2023

sebastienros commented May 8, 2023

tannergooding commented May 8, 2023

		INST3(cmpps, "cmpps", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0xC2), INS_TT_FULL, Input_32Bit \| REX_W0_EVEX \| Encoding_VEX \| Encoding_EVEX \| INS_Flags_IsDstDstSrcAVXInstruction) // compare packed singles
		INST3(cmpss, "cmpss", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0xC2), INS_TT_TUPLE1_SCALAR, Input_32Bit \| REX_W0_EVEX \| Encoding_VEX \| Encoding_EVEX \| INS_Flags_IsDstDstSrcAVXInstruction) // compare scalar singles

		INST3(psubd, "psubd", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFA), INS_TT_FULL_MEM, Input_32Bit \| REX_W0_EVEX \| Encoding_VEX \| Encoding_EVEX \| INS_Flags_IsDstDstSrcAVXInstruction) // Subtract packed double-word (32-bit) integers
		INST3(psubq, "psubq", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xFB), INS_TT_FULL_MEM, Input_64Bit \| REX_W1_EVEX \| Encoding_VEX \| Encoding_EVEX \| INS_Flags_IsDstDstSrcAVXInstruction) // subtract packed quad-word (64-bit) integers

Cleanup some xarch emit logic #85536

Cleanup some xarch emit logic #85536

Conversation

tannergooding commented Apr 28, 2023

ghost commented Apr 28, 2023

tannergooding Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding Apr 28, 2023

Choose a reason for hiding this comment

tannergooding Apr 28, 2023

Choose a reason for hiding this comment

tannergooding Apr 28, 2023

Choose a reason for hiding this comment

tannergooding Apr 28, 2023

Choose a reason for hiding this comment

tannergooding Apr 28, 2023

Choose a reason for hiding this comment

tannergooding commented Apr 30, 2023

tannergooding commented May 1, 2023

azure-pipelines bot commented May 1, 2023

tannergooding commented May 1, 2023 • edited Loading

EgorBo left a comment

Choose a reason for hiding this comment

tannergooding commented May 1, 2023

tannergooding commented May 1, 2023

azure-pipelines bot commented May 1, 2023

tannergooding commented May 2, 2023

sebastienros commented May 8, 2023

tannergooding commented May 8, 2023

tannergooding Apr 28, 2023 •

edited

Loading

tannergooding commented May 1, 2023 •

edited

Loading