JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

a74nh · 2024-03-12T14:58:15Z

Currently all mask variables are being converted to a vector before being stored
to memory. They are then converted from vector to mask after loading from
memory.

This patch allows LCL_VARs to have a type of mask, and skip the conversions.

When creating the convert to mask, there is no way of knowing what the parent
node is. This means the convert to mask must always be created and then can
be removed by the LCL_VAR. This is done in importer.cpp (there may be a better
place to do this).

Other changes are require to allow the LCL_VAR to have a type TYP_MASK.

I suspect the changes in codegenarm64.cpp and emitarm64.cpp introduce a TP
regression. These are removable once predicates have register allocation.

a74nh · 2024-03-12T16:23:56Z

@dotnet/arm64-contrib @kunalspathak @tannergooding
This is ready, but if TP is an issue it may need to wait until predicate register allocation is done.

src/coreclr/jit/hwintrinsic.cpp

src/coreclr/jit/CMakeLists.txt

src/coreclr/jit/hwintrinsic.cpp

tannergooding · 2024-03-13T15:17:57Z

src/coreclr/jit/codegenarm64.cpp

+            if (ins == INS_sve_str && !varTypeUsesMaskReg(targetType))
+            {
+                emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0, INS_SCALABLE_OPTS_UNPREDICATED);
+            }


Why does this pass down UNPREDICATED rather than doing the inverse?

That is, I imagine most instructions we encounter will end up unpredicated (or effectively unpredicated by using a TrueMask), so I'd expect we end up with overall less checks if we simply said if (varTypeUsesMaskReg(targetType)) { insOpts |= INS_SCALABLE_OPTS_PREDICATED; }

Otherwise, we end up having to special case every single instruction that has a predicated and unpredicated form and additionally check if they use a mask register or not.

That's because the emit function is inconveniently the wrong way around. I was going to fix the emit function up in this PR, but, once register allocation is working we can get rid the enum and get rid of all these checks.

Given the register allocation work is going to take some time, then maybe I should fix up the emit code in this PR.

Sounds good to me. Just wanted to get clarification as it did seem backwards

That's because the emit function is inconveniently the wrong way around

Switched this around. It no longer matches some of the other emit_R_R_etc functions, but that's ok because it'll all vanish eventually.

src/coreclr/jit/hwintrinsicarm64.cpp

tannergooding · 2024-03-13T15:25:22Z

src/coreclr/jit/importer.cpp

+                // Masks must be converted to vectors before being stored to memory.
+                // But, for local stores we can optimise away the conversion
+                if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector)
+                {
+                    op1                     = op1->AsHWIntrinsic()->Op(1);
+                    lvaTable[lclNum].lvType = TYP_MASK;
+                    lclTyp                  = lvaGetActualType(lclNum);
+                }


Why do they need to be converted before being stored to memory?

Shouldn't this be entirely dependent on the target type, just with a specialization for locals since we want to allow efficient consumption in the typical case?

Is this just a suggestion to change the comments, or a code change too?

Just a change to the comment.

I think we only need to clarify that we're optimizing masks stored to locals to allow better consumption in the expected typical case and that we have the handling in place to ensure that consumers which actually need a vector get the ConvertMaskToVector inserted back in.

Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

I'm not sure this is the right place either. Originally I wanted to avoid creating the conversion in the first place, but realised it has to be created and then removed after/during creation of the local. That's how I ended up putting it in importer.

I can't see an obvious later phase this would be done in. Morph? Or maybe during lowering? Anything that already parses local vars would be better?

I believe the correct location for this would be lclmorph, which is the same place we do sequencing, mark address exposed locals, and even combine field stores to SIMD.

However, @EgorBo or @jakobbotsch may have a better idea on where the code should go.

The consideration in general is really just that we need to find non-address exposed TYP_SIMD locals where all stores are ConvertMaskToVector so we can replace it with a TYP_MASK local instead. There are of course other optimizations that could be done, including splitting a local into two if some stores are masks and some are vectors, but those will be less common than the first.

This work needs to happen after import since that's the only code that would be using pre-existing locals. Any locals introduced by CSE or other phases involving a mask will already be TYP_MASK themselves.

Indeed this kind of brute force changing of a local's type is not safe. It will break all sorts of IR invariants we want to be able to rely on if you don't also do a full pass to fix up other uses of the local.

It seems like this kind of optimization should be its own separate pass after local morph when we know whether or not it is address exposed. We have linked locals at that point, so finding the occurrences is relatively efficient. You can leave some breadcrumb around to only run the pass when there are opportunities.

Indeed this kind of brute force changing of a local's type is not safe.

I'll add a new pass then. For now I've removed the importer changes.

Technically, this PR could be merged as is. It won't make any jit difference by itself, but it's quite a lot of code that is a blocker for other hw intrinsic work I'm doing (the embedded masks).

We have linked locals at that point

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

Doing it in a separate PR sounds good to me. Presumably it needs some heuristics to figure out if it's profitable to make the replacements as well.

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

See Statement::LocalsTreeList. It allows to quickly check whether a statement contains a local you are interested in.

src/coreclr/jit/importer.cpp

tannergooding · 2024-03-13T15:32:42Z

src/coreclr/jit/importer.cpp

+                if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector)
+                {
+                    op1                     = op1->AsHWIntrinsic()->Op(1);
+                    lvaTable[lclNum].lvType = TYP_MASK;


I see we're doing this retyping here. But I don't see where we're doing any fixups to ensure that something which reads the TYP_MASK local but expects a TYP_SIMD will get the ConvertMaskToVector inserted back.

Imagine for example, something like:

Vector<int> mask = Vector.GreaterThan(x, y); return mask + Vector.Create(5);

Where Vector.GreaterThan will produce a TYP_MASK, but the latter + consumes it as a vector. -- Noting that this is an abnormal pattern, but something we still need to account for and ensure works correctly.

I might expect such logic to insert a general ConvertMaskToVector helper to exist as part of impSimdPopStack and/or as part of the general import helpers we have in hwintrinsic.cpp.

src/coreclr/jit/gentree.h

a74nh · 2024-03-18T09:27:27Z

TP issues should now be resolved.

My addition of a sopt parameter to emitIns_R_S/emitIns_S_R means potentially having to change every call to emitIns_R_S/emitIns_S_R, and there are lots of them (I fell over this in my follow on PR).

Instead, I've removed all that code and instead added a new instruction INS_sve_ldr_mask and INS_sve_str_mask, defined directly in instr.h in the same way how INS_lea is defined out of line. It is returned from ins_Load()/ins_Store() and means non of the callers need fixing up.

All of this is removable once register allocation is done.

a74nh · 2024-03-18T17:25:19Z

I'm happy again with the PR, everything is fixed up and all tests look good.

kunalspathak · 2024-03-18T17:31:38Z

TP issues should now be resolved.

kunalspathak

Given that this PR is kind of adding support for future scenarios, I wouldn't expect any TP impact from this, but there is some regression. The changes in emitIns_R_S might be the cause. I am wondering, the switching for format if isScalable, is that correct? Basically, we still want to execute the code e.g. like codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);, but still retaining the scalable format?

kunalspathak · 2024-03-18T17:37:46Z

src/coreclr/jit/instr.h

@@ -66,6 +66,10 @@ enum instruction : uint32_t

    INS_lea,   // Not a real instruction. It is used for load the address of stack locals

+    // TODO-SVE: Removable once REG_V0 and REG_P0 are distinct
+    INS_sve_str_mask, // Not a real instruction. It is used to load masks from the stack


this should go in instrsarm64.h. We already have one for "align" for example.

a74nh · 2024-03-19T16:16:24Z

Given that this PR is kind of adding support for future scenarios, I wouldn't expect any TP impact from this, but there is some regression. The changes in emitIns_R_S might be the cause.

Looking through the PR, everything that could effect the TP is:

Changes to emitIns_R_S/emitIns_S_R - extra if (!isScalable) { fmt = scalarfmt; }
Changes to ins_Move_Extend/ins_Load /ins_Store/ins_Copy/ins_StoreFromSrc - extra calls to varTypeUsesMaskReg() which just does a TypeGet(vt) == TYP_MASK check.

I am wondering, the switching for format if isScalable, is that correct? Basically, we still want to execute the code e.g. like codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);, but still retaining the scalable format?

Yes, that's right. I've refactored this a little now so it should be clearer. There should be no overall function change, but it should be clearer and hopefully a better throughput.

src/coreclr/jit/instr.cpp

kunalspathak

LGTM

kunalspathak · 2024-03-20T20:26:08Z

/azp run runtime-coreclr superpmi-diffs

kunalspathak · 2024-03-20T20:26:12Z

/azp run runtime-coreclr superpmi-replay

azure-pipelines · 2024-03-20T20:26:24Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-03-20T20:26:24Z

Azure Pipelines successfully started running 1 pipeline(s).

a74nh · 2024-03-21T09:27:23Z

Tests were being cancelled again, so rebased.
Before doing so, the TP results were:

Which is better than the overall 0.11% we were seeing before.

kunalspathak · 2024-03-21T16:38:40Z

superpmi-replay failures seems timeout on arm32.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 12, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 12, 2024

a74nh marked this pull request as ready for review March 12, 2024 16:22

a74nh force-pushed the lcl_var_mask_github branch from d689060 to 309b60d Compare March 12, 2024 16:24

JIT ARM64-SVE: Allow LCL_VARs to store as mask

6628904

a74nh force-pushed the lcl_var_mask_github branch from 309b60d to 6628904 Compare March 12, 2024 16:31

kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Mar 12, 2024

tannergooding reviewed Mar 12, 2024

View reviewed changes

src/coreclr/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 12, 2024

View reviewed changes

src/coreclr/jit/CMakeLists.txt Outdated Show resolved Hide resolved

a74nh added 2 commits March 13, 2024 09:43

Remove FEATURE_MASKED_SIMD

ed574f9

More generic ifdefs

02fa227

a74nh commented Mar 13, 2024

View reviewed changes

src/coreclr/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved

a74nh added 2 commits March 13, 2024 10:17

Add varTypeIsSIMDOrMask

2e2e174

Add extra type checks

fcdb18a

tannergooding reviewed Mar 13, 2024

View reviewed changes

src/coreclr/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Mar 13, 2024

View reviewed changes

src/coreclr/jit/importer.cpp Outdated Show resolved Hide resolved

Merge main

687af37

tannergooding reviewed Mar 13, 2024

View reviewed changes

a74nh added 7 commits March 13, 2024 15:35

Fix use of isValidSimm9, and add extra uses

1fc8d5b

Rename mask conversion functions to gtNewSimdConvert*

9dbfe63

Add OperIs functions

85f09bf

Mark untested uses of mov

7945d51

Add INS_SCALABLE_OPTS_PREDICATE_DEST

bd5d951

Valuenum fixes for tier 1

ce61a40

Remove importer changes

b5502a6

kunalspathak reviewed Mar 14, 2024

View reviewed changes

src/coreclr/jit/gentree.h Show resolved Hide resolved

build-analysis bot mentioned this pull request Mar 14, 2024

System.DirectoryServices.Protocols.Tests.BerConverterTests.Decode_Bytes_ReturnsExpected failing on Windows x86 #99725

Closed

kunalspathak reviewed Mar 18, 2024

View reviewed changes

a74nh added 3 commits March 19, 2024 15:04

Refactor emitIns_S_R and emitIns_R_S

ec05e34

Move str_mask/ldr_mask

71bcb48

Fix formatting

24cd68b

a74nh added 2 commits March 19, 2024 16:41

Set imm

5b995ae

fix assert

3a82d5d

build-analysis bot mentioned this pull request Mar 19, 2024

Tracking issue for CI build timeouts #76454

Closed

a74nh added 3 commits March 20, 2024 09:16

Fix assert (2)

8baee38

Fix assert (3)

b22755a

nop

bd8db6e

kunalspathak reviewed Mar 20, 2024

View reviewed changes

src/coreclr/jit/instr.cpp Show resolved Hide resolved

kunalspathak approved these changes Mar 20, 2024

View reviewed changes

Merge branch 'main' into lcl_var_mask_github

e359c93

This was referenced Mar 21, 2024

Assert failure(PID 13812 [0x000035f4], Thread: 14128 [0x3730]): promoted_bytes (heap_number) == promoted #100035

Open

"error: out of range pc-relative fixup value" on linux-armel checked CoreCLR_NonPortable #100047

Closed

kunalspathak merged commit 12d96cc into dotnet:main Mar 21, 2024
105 of 110 checks passed

amanasifkhalid mentioned this pull request Mar 21, 2024

JIT ARM64-SVE: Assertion failed '(varDsc->lvIsParam && !varDsc->lvIsRegArg) || isPrespilledArg' when running SVE unit tests #100101

Closed

a74nh deleted the lcl_var_mask_github branch March 22, 2024 10:20

kunalspathak mentioned this pull request Apr 12, 2024

Arm64/Sve: Predicated Abs, Predicated/UnPredicated Add, Conditional Select #100743

Merged

github-actions bot locked and limited conversation to collaborators Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

a74nh commented Mar 12, 2024

a74nh commented Mar 12, 2024

tannergooding Mar 13, 2024

a74nh Mar 13, 2024

tannergooding Mar 13, 2024

a74nh Mar 14, 2024

tannergooding Mar 13, 2024

a74nh Mar 13, 2024

tannergooding Mar 13, 2024

a74nh Mar 13, 2024

tannergooding Mar 13, 2024

jakobbotsch Mar 13, 2024 •

edited

a74nh Mar 14, 2024

a74nh Mar 14, 2024

jakobbotsch Mar 14, 2024

tannergooding Mar 13, 2024

tannergooding Mar 13, 2024

a74nh commented Mar 18, 2024

a74nh commented Mar 18, 2024

kunalspathak commented Mar 18, 2024

kunalspathak left a comment

kunalspathak Mar 18, 2024

a74nh commented Mar 19, 2024

kunalspathak left a comment

kunalspathak commented Mar 20, 2024

kunalspathak commented Mar 20, 2024

azure-pipelines bot commented Mar 20, 2024

azure-pipelines bot commented Mar 20, 2024

a74nh commented Mar 21, 2024

kunalspathak commented Mar 21, 2024

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

Conversation

a74nh commented Mar 12, 2024

a74nh commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a74nh commented Mar 18, 2024

a74nh commented Mar 18, 2024

kunalspathak commented Mar 18, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a74nh commented Mar 19, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak commented Mar 20, 2024

kunalspathak commented Mar 20, 2024

azure-pipelines bot commented Mar 20, 2024

azure-pipelines bot commented Mar 20, 2024

a74nh commented Mar 21, 2024

kunalspathak commented Mar 21, 2024

jakobbotsch Mar 13, 2024 •

edited