Handle addressing modes for HW intrinsics #22944

CarolEidt · 2019-03-01T03:07:07Z

Contribute to #19550
Fix #19521

mikedn · 2019-03-01T06:46:36Z

src/jit/codegenlinear.cpp

@@ -1333,6 +1333,25 @@ void CodeGen::genConsumeRegs(GenTree* tree)
            // Update the life of the lcl var.
            genUpdateLife(tree);
        }
+#ifdef FEATURE_HW_INTRINSICS
+        else if (tree->OperIs(GT_HWIntrinsic))


GT_HWIntrinsic is already handled a few lines below.

Thanks, I missed that. I don't believe that we have (or will have) any Arm64 intrinsics that can be contained, so I'll remove the one further down.

Did this get addressed? I don't see the matching removal

It's a bit hard to see, with the way the diffs are shown, but if you look below, this line was deleted:

else if (tree->OperIsHWIntrinsic())

and the genConsumeReg below it is actually the one for the OperIsInitVal case.

mikedn · 2019-03-01T06:52:39Z

src/jit/hwintrinsiccodegenxarch.cpp

+                    genConsumeAddress(op1);
+                    // Until we improve the handling of addressing modes in the emitter, we'll create a
+                    // temporary GT_IND to generate code with.
+                    GenTreeIndir load = indirForm(node->TypeGet(), op1);


Meh, and I just got rid of the last usage of indirForm in XARCH codegen in a recent PR 😞 . I'll clean this up once the 3.0 release storm passes.

Yes, after looking at your change here: mikedn@af3df85#diff-04ea18d1ee0bf7efd98adb6840a48369 I seriously considered waiting for and/or incorporating those changes, but decided to take the expedient path so that this stands a better chance of getting into 3.0.

mikedn · 2019-03-01T06:54:24Z

src/jit/emitxarch.cpp

+//    rmOp      -- The second operand, which may be a memory node or a node producing a register
+//    ival      -- The immediate operand
+//
+void emitter::emitIns_R_RM_I(instruction ins, emitAttr attr, regNumber reg1, GenTree* rmOp, int ival)


I wouldn't put this in the emitter. Ideally the emitter interface should not deal with GenTree, only instructions, registers and immediates/displacements.

You could probably just update genHWIntrinsic_R_RM_I():

coreclr/src/jit/hwintrinsiccodegenxarch.cpp

Line 528 in 407cc7d

void CodeGen::genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival)

Hmm, it actually looks like most of this logic was pulled from the previous impl of genHWIntrinsic_R_RM_I

I extracted this from genHWIntrinsic_R_RM_I because it is not specific to HW intrinsics. And historically the emitter has dealt with GenTree nodes for addresses. That said, I take the point that it should be in CodeGen instead of in emitter. I'll make that change.

I think most of the genHWIntrinsic_R_... methods are probably more "general-purpose" than HWIntrinsic specific.

That is, outside some asserts and the explicit code path for containedOp->OperIsHWIntrinsic(), they are just taking a node and determining which emitter call to make (R, AR, C, A, S, etc).

It might be worth refactoring all of these to be more generally usable?

The various "RM" handling functions and the relevant emitter interface functions are pretty messed up as is. It will take a significant effort to address all that and it certainly doesn't belong in this PR. For now just avoiding adding more stuff to the emitter interface should be enough. It already suffers from duplication, bad function and parameter names etc.

It will take a significant effort to address all that and it certainly doesn't belong in this PR.

Completely right, I was more asking as a potential future work item.

As it turns out, extracting the HWIntrinsic-dependencies from this was a bit problematic, so although I moved to to CodeGen::inst_RV_TT_IV(), it is still under #ifdef FEATURE_HW_INTRINSICS.

tannergooding · 2019-03-01T14:40:05Z

src/jit/codegenlinear.cpp

+        return;
+    }
+    if (op1->OperIs(GT_LIST))
+    {


nit: assert(node->gtGetOp2() == nullptr) when op1 is a list?

tannergooding · 2019-03-01T14:40:17Z

src/jit/codegenlinear.cpp

+        assert(node->gtGetOp2() == nullptr);
+        return;
+    }
+    if (op1->OperIs(GT_LIST))


nit: OperIsList()?

I don't think we have a convention for this, but the OperIsXXX methods pre-date @mikedn's addition of the OperIs method, which I tend to prefer since it makes it clear that you are checking for one or more specific operators. I prefer the OperIsXXX when it is used to check for a "class" of operators. cc @dotnet/jit-contrib for opinions on usage of these.

I wouldn't bother too much with this. Hopefully we'll just get rid of GT_LIST. I have ~75% of it done, just waiting for intrinsics to finally stabilize so I don't have to keep fixing conflicts.

tannergooding · 2019-03-01T14:41:18Z

src/jit/codegenlinear.cpp

+//    None.
+//
+
+void CodeGen::genConsumeHWIntrinsicOperands(GenTreeHWIntrinsic* node)


Might be nice to centralize the asserting of numArgs to number of node operands here as well?

Makes sense.

src/jit/gentree.cpp

tannergooding · 2019-03-01T14:51:50Z

src/jit/hwintrinsiccodegenxarch.cpp

-                    emit->emitIns_AR_R(ins, simdSize, op2Reg, op1Reg, 0);
-                }
-                else if ((ival != -1) && varTypeIsFloating(baseType))
+                if ((ival != -1) && varTypeIsFloating(baseType))
                {
                    assert((ival >= 0) && (ival <= 127));
                    genHWIntrinsic_R_R_RM_I(node, ins, ival);
                }
                else if (category == HW_Category_MemoryLoad)


Why isn't this case genConsumeAddress like the other loads?

It's sort of arbitrary. genConsumeRegs will correctly handle an address. genConsumeAddress is useful if you already know you have an address, because it does fewer checks. This is distinct from genConsumeReg which is only used if a node is known to produce a single register.

tannergooding · 2019-03-01T14:53:41Z

src/jit/hwintrinsiccodegenxarch.cpp

-        op1Reg = op1->gtRegNum;
-        genConsumeOperands(node);
-    }
+    genConsumeHWIntrinsicOperands(node);


Shouldn't there be some additional removals of genConsumeReg from the below (for this and similar edits)?

I didn't transform all the methods in this file to use genConsumeHWIntrinsicOperands - only the ones where it was specifically beneficial to abstract out the possibility that they might be contained memory ops/addresses. That may be a useful future change, but I wouldn't necessarily do that with this PR.

CarolEidt · 2019-03-04T23:30:36Z

test OSX10.12 x64 Checked Innerloop Build and Test
test Windows_NT arm Cross Checked Innerloop Build and Test

CarolEidt · 2019-03-05T15:52:13Z

@dotnet/jit-contrib PTAL

mikedn · 2019-03-05T18:51:00Z

src/jit/hwintrinsiccodegenxarch.cpp

+                    // Until we improve the handling of addressing modes in the emitter, we'll create a
+                    // temporary GT_IND to generate code with.
+                    GenTreeIndir load = indirForm(node->TypeGet(), op1);
+                    emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);


The old code was using emitIns_R_AR, that one has a Is4ByteSSEInstruction check that increases the instruction size by 2. emitInsLoadInd does not seem to have such a check.

Right - I took a quick look at that when you made that comment earlier, and I wasn't sure it was actually required. I need to take another look.

That 4 byte check appeared in #16249 and was updated in #21528. It's not clear to me why that was done in emitIns functions instead of emitInsSize functions.

So far I believe that this 2-byte increment should actually be 1 byte. That said, there is definitely inconsistency in how the sizes are determined, and the code allows the size to be over-estimated, so those errors go undetected. I'm trying to decide whether to make a more general cleanup here or just go with the 1 byte increment for this case.

I thought we had a work item tracking the cleanup of emitInsSize, but I can't seem to find it.

From what I remember, these one offs were done because it was non-trivial to update the existing emitInsSize checks due to other complications at the time. I think it is +2 to handle the non-VEX encoded case but only needs to be +1 for the VEX encoding.

I believe it's the case that for these addressing-mode instructions, the SSE encodings require one additional bit (and the VEX encodings don't require an additional bytes). I've run all the hw intrinsic tests without AVX & AVX2 with the most recent set of changes.
The size analysis is so convoluted that I am mostly relying on empirical results (i.e. testing).

I added a note to https://github.com/dotnet/coreclr/issues/23006 that this should also be addressed.

CarolEidt · 2019-03-08T23:58:45Z

src/jit/instr.cpp

+    else
+    {
+        regNumber rmOpReg = rmOp->gtRegNum;
+        getEmitter()->emitIns_SIMD_R_R_I(ins, attr, reg1, rmOpReg, ival);


This isn't really SIMD-specific (and should probably be moved), but it's the method that will do a copy to the target if needed for the RMW form.

CarolEidt · 2019-03-12T14:49:34Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Formatting

CarolEidt · 2019-03-12T14:56:02Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests

CarolEidt · 2019-03-12T20:13:19Z

@dotnet/jit-contrib @tannergooding @fiigii PTAL

tannergooding · 2019-03-13T20:49:26Z

src/jit/codegenlinear.cpp

+    GenTree* op1     = node->gtGetOp1();
+    if (op1 == nullptr)
+    {
+        assert((numArgs == 0) && (node->gtGetOp2() == nullptr));


nit: It is generally nice to break && asserts into two separate ones. That makes it much clearer which assert is the problematic one when they are hit.

I generally disagree. Asserts already clutter the code, and though they are useful, they should impact the readability & flow of the code as little as possible, IMO.

And this is another bit of code that will go away when I get rid of GT_LIST...

src/jit/codegenlinear.cpp

tannergooding · 2019-03-13T20:56:00Z

src/jit/hwintrinsiccodegenxarch.cpp

-                {
-                    assert((ival >= 0) && (ival <= 127));
-                    genHWIntrinsic_R_RM_I(node, ins, (int8_t)ival);
+                    genConsumeAddress(op1);


I believe you mentioned that genConsumeOperands will also handle the address case, but genConsumeAddress is more efficient.

My question then would be: Is it actually worth doing explicit genConsumeAddress calls here? It looks like it makes the overall logic slightly more complex...

In what way does it make the logic more complex? It's probably a matter of taste, but this case needs special handling anyway, and it seems clearest to explicitly consume an address.

It's probably a matter of taste

Possibly. I would have thought that the existing genConsumeOperands(node) call would have handled things, in which case the changes could have been minimized to just changing emit->emitIns_R_AR(ins, simdSize, targetReg, op1Reg, 0); to:

// Until we improve the handling of addressing modes in the emitter, we'll create a // temporary GT_IND to generate code with. GenTreeIndir load = indirForm(node->TypeGet(), op1); emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);

This would have made the delta smaller and kept the nesting of the if blocks smaller. It also gives a single spot that operand consumption happens so future changes don't need to worry about it.

Likewise, down on L144, just the if block could have been added and the existing genConsumeOperands(node) would have handle both the memory store and non-memory store cases (rather than splitting this into 4 separate calls).

Please avoid genConsumeOperands in SIMD and HWIntrinsic code, it's incompatible with my pending GT_LIST changes.

fiigii

LGTM, just one suggestion.

fiigii · 2019-03-15T08:53:49Z

tests/src/JIT/Regression/JitBlue/GitHub_19550/GitHub_19550.cs

+                // This is the aligned case.
+                Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
+                // This is the unaligned case.
+                Sse2.StoreScalar(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1)));


The test cases look not enough. This PR also changes several Is4ByteSSEInstruction points, so I suggest adding a Sse41.LoadAlignedVector128NonTemporal test and an Vector256<T> load/store test.

In the future, it would be nice to update the intrinsic test template to cover more address modes.

CarolEidt · 2019-03-19T21:46:42Z

@redknightlois - I plan to get this in for 3.0, though while waiting for the CI issues to resolve I've gotten side-tracked on other things. I should have an updated PR by the end of this week.

Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt · 2019-03-21T16:01:02Z

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0)

CarolEidt · 2019-03-21T16:01:47Z

@dotnet-bot test Windows_NT arm Cross Checked Innerloop Build and Test

fiigii · 2019-03-21T18:51:19Z

tests/src/JIT/Regression/JitBlue/GitHub_19550/GitHub_19550.cs

+
+                // Now do a non-temporal load from the aligned location.
+                // That should give us {0, 0, 1, 0}.
+                Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();


Sse41.IsSupported?

Right - of course! Forgot all about the checks!

fiigii · 2019-03-21T18:51:35Z

tests/src/JIT/Regression/JitBlue/GitHub_19550/GitHub_19550.cs

+        {
+            try
+            {
+                Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1)));


Avx.IsSupported?

CarolEidt · 2019-03-21T21:54:15Z

@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test

CarolEidt · 2019-03-21T23:02:26Z

@dotnet/jit-contrib @tannergooding @fiigii PTAL

fiigii · 2019-03-22T20:17:41Z

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitsse2only

@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitsse2only

@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

fiigii · 2019-03-23T00:11:28Z

tests/src/JIT/Regression/JitBlue/GitHub_19550/GitHub_19550.cs

+
+                    // Now do a non-temporal load from the aligned location.
+                    // That should give us {0, 0, 1, 0}.
+                    Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();


We need if (Sse41.IsSupported) for this case.

Excellent, thanks. I split this out so that we do an Sse2 load when Sse41 isn't available.

CarolEidt · 2019-03-25T19:43:02Z

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitsse2only

@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitsse2only

@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

CarolEidt · 2019-03-25T23:39:10Z

@dotnet/jit-contrib @tannergooding ping - I addressed @fiigii 's feedback on the test case, and fixed an error in the guard for it. I think this is ready to merge.

tannergooding

LGTM

* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to dotnet/coreclr#19550 Fix dotnet/coreclr#19521 Commit migrated from dotnet/coreclr@da6ed11

mikedn reviewed Mar 1, 2019

View reviewed changes

tannergooding reviewed Mar 1, 2019

View reviewed changes

src/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 1, 2019

View reviewed changes

CarolEidt changed the title ~~[WIP] Handle addressing modes for HW intrinsics~~ Handle addressing modes for HW intrinsics Mar 4, 2019

mikedn reviewed Mar 5, 2019

View reviewed changes

CarolEidt commented Mar 8, 2019

View reviewed changes

saucecontrol mentioned this pull request Mar 11, 2019

Base64 encoding with simd-support dotnet/corefx#34529

Merged

CarolEidt force-pushed the Issue19521 branch from cc73a47 to 29ab492 Compare March 12, 2019 00:09

tannergooding reviewed Mar 13, 2019

View reviewed changes

src/jit/codegenlinear.cpp Show resolved Hide resolved

tannergooding reviewed Mar 13, 2019

View reviewed changes

fiigii approved these changes Mar 15, 2019

View reviewed changes

Handle addressing modes for HW intrinsics

0ae8f19

Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt force-pushed the Issue19521 branch from fa99cfd to 0ae8f19 Compare March 20, 2019 14:48

Fix Broadcast

f1ae7f8

Enhance the test case.

a8e5f21

fiigii reviewed Mar 21, 2019

View reviewed changes

Add checks for ISA support

a5d0992

fiigii reviewed Mar 23, 2019

View reviewed changes

Fix guard for NonTemporal load

2384672

tannergooding approved these changes Mar 25, 2019

View reviewed changes

CarolEidt merged commit da6ed11 into dotnet:master Mar 26, 2019

CarolEidt deleted the Issue19521 branch March 26, 2019 23:13

CarolEidt mentioned this pull request Jan 31, 2020

CodeGen: Clarify the naming and usage of consumeReg methods dotnet/runtime#12179

Open

saucecontrol mentioned this pull request Jan 31, 2020

HWIntrinsics: FMA suboptimal codegen dotnet/runtime#12212

Closed

GrabYourPitchforks mentioned this pull request Jan 31, 2020

Vector<T> operations don't take advantage of memory operands dotnet/runtime#13798

Open

CarolEidt mentioned this pull request Oct 12, 2020

HWIntrinsics: Load folding to immediate address? dotnet/runtime#12308

Open

Handle addressing modes for HW intrinsics #22944

Handle addressing modes for HW intrinsics #22944

Conversation

CarolEidt commented Mar 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn Mar 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 4, 2019

CarolEidt commented Mar 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 12, 2019

CarolEidt commented Mar 12, 2019

CarolEidt commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 19, 2019

CarolEidt commented Mar 21, 2019

CarolEidt commented Mar 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 21, 2019

CarolEidt commented Mar 21, 2019

fiigii commented Mar 22, 2019

fiigii Mar 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarolEidt commented Mar 25, 2019

CarolEidt commented Mar 25, 2019

tannergooding left a comment

Choose a reason for hiding this comment

mikedn Mar 4, 2019 •

edited

Loading

tannergooding Mar 13, 2019 •

edited

Loading

fiigii Mar 23, 2019 •

edited

Loading