Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

Merged
merged 28 commits into from Mar 21, 2024

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Mar 12, 2024

Currently all mask variables are being converted to a vector before being stored
to memory. They are then converted from vector to mask after loading from
memory.

This patch allows LCL_VARs to have a type of mask, and skip the conversions.

When creating the convert to mask, there is no way of knowing what the parent
node is. This means the convert to mask must always be created and then can
be removed by the LCL_VAR. This is done in importer.cpp (there may be a better
place to do this).

Other changes are require to allow the LCL_VAR to have a type TYP_MASK.

I suspect the changes in codegenarm64.cpp and emitarm64.cpp introduce a TP
regression. These are removable once predicates have register allocation.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 12, 2024
@a74nh a74nh marked this pull request as ready for review March 12, 2024 16:22
@a74nh
Copy link
Contributor Author

a74nh commented Mar 12, 2024

@dotnet/arm64-contrib @kunalspathak @tannergooding
This is ready, but if TP is an issue it may need to wait until predicate register allocation is done.

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Mar 12, 2024
Comment on lines 2969 to 2972
if (ins == INS_sve_str && !varTypeUsesMaskReg(targetType))
{
emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0, INS_SCALABLE_OPTS_UNPREDICATED);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this pass down UNPREDICATED rather than doing the inverse?

That is, I imagine most instructions we encounter will end up unpredicated (or effectively unpredicated by using a TrueMask), so I'd expect we end up with overall less checks if we simply said if (varTypeUsesMaskReg(targetType)) { insOpts |= INS_SCALABLE_OPTS_PREDICATED; }

Otherwise, we end up having to special case every single instruction that has a predicated and unpredicated form and additionally check if they use a mask register or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the emit function is inconveniently the wrong way around. I was going to fix the emit function up in this PR, but, once register allocation is working we can get rid the enum and get rid of all these checks.

Given the register allocation work is going to take some time, then maybe I should fix up the emit code in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Just wanted to get clarification as it did seem backwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the emit function is inconveniently the wrong way around

Switched this around. It no longer matches some of the other emit_R_R_etc functions, but that's ok because it'll all vanish eventually.

Comment on lines 6423 to 6430
// Masks must be converted to vectors before being stored to memory.
// But, for local stores we can optimise away the conversion
if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector)
{
op1 = op1->AsHWIntrinsic()->Op(1);
lvaTable[lclNum].lvType = TYP_MASK;
lclTyp = lvaGetActualType(lclNum);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do they need to be converted before being stored to memory?

Shouldn't this be entirely dependent on the target type, just with a specialization for locals since we want to allow efficient consumption in the typical case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a suggestion to change the comments, or a code change too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a change to the comment.

I think we only need to clarify that we're optimizing masks stored to locals to allow better consumption in the expected typical case and that we have the handling in place to ensure that consumers which actually need a vector get the ConvertMaskToVector inserted back in.


Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

I'm not sure this is the right place either. Originally I wanted to avoid creating the conversion in the first place, but realised it has to be created and then removed after/during creation of the local. That's how I ended up putting it in importer.

I can't see an obvious later phase this would be done in. Morph? Or maybe during lowering? Anything that already parses local vars would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the correct location for this would be lclmorph, which is the same place we do sequencing, mark address exposed locals, and even combine field stores to SIMD.

However, @EgorBo or @jakobbotsch may have a better idea on where the code should go.

The consideration in general is really just that we need to find non-address exposed TYP_SIMD locals where all stores are ConvertMaskToVector so we can replace it with a TYP_MASK local instead. There are of course other optimizations that could be done, including splitting a local into two if some stores are masks and some are vectors, but those will be less common than the first.

This work needs to happen after import since that's the only code that would be using pre-existing locals. Any locals introduced by CSE or other phases involving a mask will already be TYP_MASK themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this kind of brute force changing of a local's type is not safe. It will break all sorts of IR invariants we want to be able to rely on if you don't also do a full pass to fix up other uses of the local.

It seems like this kind of optimization should be its own separate pass after local morph when we know whether or not it is address exposed. We have linked locals at that point, so finding the occurrences is relatively efficient. You can leave some breadcrumb around to only run the pass when there are opportunities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this kind of brute force changing of a local's type is not safe.

I'll add a new pass then. For now I've removed the importer changes.

Technically, this PR could be merged as is. It won't make any jit difference by itself, but it's quite a lot of code that is a blocker for other hw intrinsic work I'm doing (the embedded masks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have linked locals at that point

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it in a separate PR sounds good to me. Presumably it needs some heuristics to figure out if it's profitable to make the replacements as well.

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

See Statement::LocalsTreeList. It allows to quickly check whether a statement contains a local you are interested in.

if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector)
{
op1 = op1->AsHWIntrinsic()->Op(1);
lvaTable[lclNum].lvType = TYP_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're doing this retyping here. But I don't see where we're doing any fixups to ensure that something which reads the TYP_MASK local but expects a TYP_SIMD will get the ConvertMaskToVector inserted back.

Imagine for example, something like:

Vector<int> mask = Vector.GreaterThan(x, y);
return mask + Vector.Create(5);

Where Vector.GreaterThan will produce a TYP_MASK, but the latter + consumes it as a vector. -- Noting that this is an abnormal pattern, but something we still need to account for and ensure works correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might expect such logic to insert a general ConvertMaskToVector helper to exist as part of impSimdPopStack and/or as part of the general import helpers we have in hwintrinsic.cpp.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 18, 2024

TP issues should now be resolved.

My addition of a sopt parameter to emitIns_R_S/emitIns_S_R means potentially having to change every call to emitIns_R_S/emitIns_S_R, and there are lots of them (I fell over this in my follow on PR).

Instead, I've removed all that code and instead added a new instruction INS_sve_ldr_mask and INS_sve_str_mask, defined directly in instr.h in the same way how INS_lea is defined out of line. It is returned from ins_Load()/ins_Store() and means non of the callers need fixing up.

All of this is removable once register allocation is done.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 18, 2024

I'm happy again with the PR, everything is fixed up and all tests look good.

@kunalspathak
Copy link
Member

TP issues should now be resolved.

image

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR is kind of adding support for future scenarios, I wouldn't expect any TP impact from this, but there is some regression. The changes in emitIns_R_S might be the cause. I am wondering, the switching for format if isScalable, is that correct? Basically, we still want to execute the code e.g. like codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);, but still retaining the scalable format?

@@ -66,6 +66,10 @@ enum instruction : uint32_t

INS_lea, // Not a real instruction. It is used for load the address of stack locals

// TODO-SVE: Removable once REG_V0 and REG_P0 are distinct
INS_sve_str_mask, // Not a real instruction. It is used to load masks from the stack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go in instrsarm64.h. We already have one for "align" for example.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 19, 2024

Given that this PR is kind of adding support for future scenarios, I wouldn't expect any TP impact from this, but there is some regression. The changes in emitIns_R_S might be the cause.

Looking through the PR, everything that could effect the TP is:

  • Changes to emitIns_R_S/emitIns_S_R - extra if (!isScalable) { fmt = scalarfmt; }
  • Changes to ins_Move_Extend/ins_Load /ins_Store/ins_Copy/ins_StoreFromSrc - extra calls to varTypeUsesMaskReg() which just does a TypeGet(vt) == TYP_MASK check.

I am wondering, the switching for format if isScalable, is that correct? Basically, we still want to execute the code e.g. like codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);, but still retaining the scalable format?

Yes, that's right. I've refactored this a little now so it should be clearer. There should be no overall function change, but it should be clearer and hopefully a better throughput.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak
Copy link
Member

/azp run runtime-coreclr superpmi-diffs

@kunalspathak
Copy link
Member

/azp run runtime-coreclr superpmi-replay

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@a74nh
Copy link
Contributor Author

a74nh commented Mar 21, 2024

Tests were being cancelled again, so rebased.
Before doing so, the TP results were:

Screenshot 2024-03-21 at 09 24 19

Which is better than the overall 0.11% we were seeing before.

@kunalspathak
Copy link
Member

superpmi-replay failures seems timeout on arm32.

@kunalspathak kunalspathak merged commit 12d96cc into dotnet:main Mar 21, 2024
105 of 110 checks passed
@a74nh a74nh deleted the lcl_var_mask_github branch March 22, 2024 10:20
@github-actions github-actions bot locked and limited conversation to collaborators Apr 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants