Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Handle addressing modes for HW intrinsics #22944

Merged
merged 5 commits into from
Mar 26, 2019
Merged

Conversation

CarolEidt
Copy link

Contribute to #19550
Fix #19521

@@ -1333,6 +1333,25 @@ void CodeGen::genConsumeRegs(GenTree* tree)
// Update the life of the lcl var.
genUpdateLife(tree);
}
#ifdef FEATURE_HW_INTRINSICS
else if (tree->OperIs(GT_HWIntrinsic))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GT_HWIntrinsic is already handled a few lines below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that. I don't believe that we have (or will have) any Arm64 intrinsics that can be contained, so I'll remove the one further down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get addressed? I don't see the matching removal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to see, with the way the diffs are shown, but if you look below, this line was deleted:

        else if (tree->OperIsHWIntrinsic())

and the genConsumeReg below it is actually the one for the OperIsInitVal case.

genConsumeAddress(op1);
// Until we improve the handling of addressing modes in the emitter, we'll create a
// temporary GT_IND to generate code with.
GenTreeIndir load = indirForm(node->TypeGet(), op1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, and I just got rid of the last usage of indirForm in XARCH codegen in a recent PR 😞 . I'll clean this up once the 3.0 release storm passes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after looking at your change here: mikedn@af3df85#diff-04ea18d1ee0bf7efd98adb6840a48369 I seriously considered waiting for and/or incorporating those changes, but decided to take the expedient path so that this stands a better chance of getting into 3.0.

// rmOp -- The second operand, which may be a memory node or a node producing a register
// ival -- The immediate operand
//
void emitter::emitIns_R_RM_I(instruction ins, emitAttr attr, regNumber reg1, GenTree* rmOp, int ival)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't put this in the emitter. Ideally the emitter interface should not deal with GenTree, only instructions, registers and immediates/displacements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just update genHWIntrinsic_R_RM_I():

void CodeGen::genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it actually looks like most of this logic was pulled from the previous impl of genHWIntrinsic_R_RM_I

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted this from genHWIntrinsic_R_RM_I because it is not specific to HW intrinsics. And historically the emitter has dealt with GenTree nodes for addresses. That said, I take the point that it should be in CodeGen instead of in emitter. I'll make that change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the genHWIntrinsic_R_... methods are probably more "general-purpose" than HWIntrinsic specific.

That is, outside some asserts and the explicit code path for containedOp->OperIsHWIntrinsic(), they are just taking a node and determining which emitter call to make (R, AR, C, A, S, etc).

It might be worth refactoring all of these to be more generally usable?

Copy link

@mikedn mikedn Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The various "RM" handling functions and the relevant emitter interface functions are pretty messed up as is. It will take a significant effort to address all that and it certainly doesn't belong in this PR. For now just avoiding adding more stuff to the emitter interface should be enough. It already suffers from duplication, bad function and parameter names etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will take a significant effort to address all that and it certainly doesn't belong in this PR.

Completely right, I was more asking as a potential future work item.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, extracting the HWIntrinsic-dependencies from this was a bit problematic, so although I moved to to CodeGen::inst_RV_TT_IV(), it is still under #ifdef FEATURE_HW_INTRINSICS.

return;
}
if (op1->OperIs(GT_LIST))
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert(node->gtGetOp2() == nullptr) when op1 is a list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

assert(node->gtGetOp2() == nullptr);
return;
}
if (op1->OperIs(GT_LIST))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: OperIsList()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a convention for this, but the OperIsXXX methods pre-date @mikedn's addition of the OperIs method, which I tend to prefer since it makes it clear that you are checking for one or more specific operators. I prefer the OperIsXXX when it is used to check for a "class" of operators. cc @dotnet/jit-contrib for opinions on usage of these.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother too much with this. Hopefully we'll just get rid of GT_LIST. I have ~75% of it done, just waiting for intrinsics to finally stabilize so I don't have to keep fixing conflicts.

// None.
//

void CodeGen::genConsumeHWIntrinsicOperands(GenTreeHWIntrinsic* node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to centralize the asserting of numArgs to number of node operands here as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

src/jit/gentree.cpp Outdated Show resolved Hide resolved
emit->emitIns_AR_R(ins, simdSize, op2Reg, op1Reg, 0);
}
else if ((ival != -1) && varTypeIsFloating(baseType))
if ((ival != -1) && varTypeIsFloating(baseType))
{
assert((ival >= 0) && (ival <= 127));
genHWIntrinsic_R_R_RM_I(node, ins, ival);
}
else if (category == HW_Category_MemoryLoad)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this case genConsumeAddress like the other loads?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sort of arbitrary. genConsumeRegs will correctly handle an address. genConsumeAddress is useful if you already know you have an address, because it does fewer checks. This is distinct from genConsumeReg which is only used if a node is known to produce a single register.

op1Reg = op1->gtRegNum;
genConsumeOperands(node);
}
genConsumeHWIntrinsicOperands(node);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be some additional removals of genConsumeReg from the below (for this and similar edits)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't transform all the methods in this file to use genConsumeHWIntrinsicOperands - only the ones where it was specifically beneficial to abstract out the possibility that they might be contained memory ops/addresses. That may be a useful future change, but I wouldn't necessarily do that with this PR.

@CarolEidt CarolEidt changed the title [WIP] Handle addressing modes for HW intrinsics Handle addressing modes for HW intrinsics Mar 4, 2019
@CarolEidt
Copy link
Author

test OSX10.12 x64 Checked Innerloop Build and Test
test Windows_NT arm Cross Checked Innerloop Build and Test

@CarolEidt
Copy link
Author

@dotnet/jit-contrib PTAL

// Until we improve the handling of addressing modes in the emitter, we'll create a
// temporary GT_IND to generate code with.
GenTreeIndir load = indirForm(node->TypeGet(), op1);
emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code was using emitIns_R_AR, that one has a Is4ByteSSEInstruction check that increases the instruction size by 2. emitInsLoadInd does not seem to have such a check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I took a quick look at that when you made that comment earlier, and I wasn't sure it was actually required. I need to take another look.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That 4 byte check appeared in #16249 and was updated in #21528. It's not clear to me why that was done in emitIns functions instead of emitInsSize functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I believe that this 2-byte increment should actually be 1 byte. That said, there is definitely inconsistency in how the sizes are determined, and the code allows the size to be over-estimated, so those errors go undetected. I'm trying to decide whether to make a more general cleanup here or just go with the 1 byte increment for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a work item tracking the cleanup of emitInsSize, but I can't seem to find it.

From what I remember, these one offs were done because it was non-trivial to update the existing emitInsSize checks due to other complications at the time. I think it is +2 to handle the non-VEX encoded case but only needs to be +1 for the VEX encoding.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the case that for these addressing-mode instructions, the SSE encodings require one additional bit (and the VEX encodings don't require an additional bytes). I've run all the hw intrinsic tests without AVX & AVX2 with the most recent set of changes.
The size analysis is so convoluted that I am mostly relying on empirical results (i.e. testing).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note to https://github.com/dotnet/coreclr/issues/23006 that this should also be addressed.

else
{
regNumber rmOpReg = rmOp->gtRegNum;
getEmitter()->emitIns_SIMD_R_R_I(ins, attr, reg1, rmOpReg, ival);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really SIMD-specific (and should probably be moved), but it's the method that will do a copy to the target if needed for the RMW form.

@CarolEidt
Copy link
Author

@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Formatting

@CarolEidt
Copy link
Author

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests

@CarolEidt
Copy link
Author

@dotnet/jit-contrib @tannergooding @fiigii PTAL

GenTree* op1 = node->gtGetOp1();
if (op1 == nullptr)
{
assert((numArgs == 0) && (node->gtGetOp2() == nullptr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is generally nice to break && asserts into two separate ones. That makes it much clearer which assert is the problematic one when they are hit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally disagree. Asserts already clutter the code, and though they are useful, they should impact the readability & flow of the code as little as possible, IMO.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is another bit of code that will go away when I get rid of GT_LIST...

{
assert((ival >= 0) && (ival <= 127));
genHWIntrinsic_R_RM_I(node, ins, (int8_t)ival);
genConsumeAddress(op1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you mentioned that genConsumeOperands will also handle the address case, but genConsumeAddress is more efficient.

My question then would be: Is it actually worth doing explicit genConsumeAddress calls here? It looks like it makes the overall logic slightly more complex...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way does it make the logic more complex? It's probably a matter of taste, but this case needs special handling anyway, and it seems clearest to explicitly consume an address.

Copy link
Member

@tannergooding tannergooding Mar 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a matter of taste

Possibly. I would have thought that the existing genConsumeOperands(node) call would have handled things, in which case the changes could have been minimized to just changing emit->emitIns_R_AR(ins, simdSize, targetReg, op1Reg, 0); to:

// Until we improve the handling of addressing modes in the emitter, we'll create a
// temporary GT_IND to generate code with.
GenTreeIndir load = indirForm(node->TypeGet(), op1);
emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);

This would have made the delta smaller and kept the nesting of the if blocks smaller. It also gives a single spot that operand consumption happens so future changes don't need to worry about it.

Likewise, down on L144, just the if block could have been added and the existing genConsumeOperands(node) would have handle both the memory store and non-memory store cases (rather than splitting this into 4 separate calls).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid genConsumeOperands in SIMD and HWIntrinsic code, it's incompatible with my pending GT_LIST changes.

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one suggestion.

// This is the aligned case.
Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
// This is the unaligned case.
Sse2.StoreScalar(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases look not enough. This PR also changes several Is4ByteSSEInstruction points, so I suggest adding a Sse41.LoadAlignedVector128NonTemporal test and an Vector256<T> load/store test.

In the future, it would be nice to update the intrinsic test template to cover more address modes.

@CarolEidt
Copy link
Author

@redknightlois - I plan to get this in for 3.0, though while waiting for the CI issues to resolve I've gotten side-tracked on other things. I should have an updated PR by the end of this week.

Also, eliminate some places where the code size estimates were over-estimating.

Contribute to #19550
Fix #19521
@CarolEidt
Copy link
Author

@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0)

@CarolEidt
Copy link
Author

@dotnet-bot test Windows_NT arm Cross Checked Innerloop Build and Test


// Now do a non-temporal load from the aligned location.
// That should give us {0, 0, 1, 0}.
Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sse41.IsSupported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - of course! Forgot all about the checks!

{
try
{
Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avx.IsSupported?

@CarolEidt
Copy link
Author

@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test

@CarolEidt
Copy link
Author

@dotnet/jit-contrib @tannergooding @fiigii PTAL

@fiigii
Copy link

fiigii commented Mar 22, 2019

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitsse2only

@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitsse2only

@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic


// Now do a non-temporal load from the aligned location.
// That should give us {0, 0, 1, 0}.
Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();
Copy link

@fiigii fiigii Mar 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need if (Sse41.IsSupported) for this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks. I split this out so that we do an Sse2 load when Sse41 isn't available.

@CarolEidt
Copy link
Author

@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitsse2only

@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitsse2only

@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitsse2only
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

@CarolEidt
Copy link
Author

@dotnet/jit-contrib @tannergooding ping - I addressed @fiigii 's feedback on the test case, and fixed an error in the guard for it. I think this is ready to merge.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CarolEidt CarolEidt merged commit da6ed11 into dotnet:master Mar 26, 2019
@CarolEidt CarolEidt deleted the Issue19521 branch March 26, 2019 23:13
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Handle addressing modes for HW intrinsics

Also, eliminate some places where the code size estimates were over-estimating.

Contribute to dotnet/coreclr#19550
Fix dotnet/coreclr#19521


Commit migrated from dotnet/coreclr@da6ed11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants