Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64-SVE: Allow LCL_VARs to store as mask #99608

Merged
merged 28 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
6628904
JIT ARM64-SVE: Allow LCL_VARs to store as mask
a74nh Mar 8, 2024
ed574f9
Remove FEATURE_MASKED_SIMD
a74nh Mar 13, 2024
02fa227
More generic ifdefs
a74nh Mar 13, 2024
2e2e174
Add varTypeIsSIMDOrMask
a74nh Mar 13, 2024
fcdb18a
Add extra type checks
a74nh Mar 13, 2024
687af37
Merge main
a74nh Mar 13, 2024
1fc8d5b
Fix use of isValidSimm9, and add extra uses
a74nh Mar 13, 2024
9dbfe63
Rename mask conversion functions to gtNewSimdConvert*
a74nh Mar 13, 2024
85f09bf
Add OperIs functions
a74nh Mar 13, 2024
7945d51
Mark untested uses of mov
a74nh Mar 14, 2024
bd5d951
Add INS_SCALABLE_OPTS_PREDICATE_DEST
a74nh Mar 14, 2024
ce61a40
Valuenum fixes for tier 1
a74nh Mar 14, 2024
b5502a6
Remove importer changes
a74nh Mar 14, 2024
39c02d0
XARCH versions of OperIsConvertMaskToVector
a74nh Mar 14, 2024
d8dea0e
Revert "Remove importer changes"
a74nh Mar 14, 2024
8ec8e38
Add tests fopr emitIns_S_R and emitIns_R_S
a74nh Mar 14, 2024
3ec441c
Fix formatting
a74nh Mar 15, 2024
f569512
Reapply "Remove importer changes"
a74nh Mar 15, 2024
0110170
Use dummy mask ldr and str
a74nh Mar 18, 2024
ec05e34
Refactor emitIns_S_R and emitIns_R_S
a74nh Mar 19, 2024
71bcb48
Move str_mask/ldr_mask
a74nh Mar 19, 2024
24cd68b
Fix formatting
a74nh Mar 19, 2024
5b995ae
Set imm
a74nh Mar 19, 2024
3a82d5d
fix assert
a74nh Mar 19, 2024
8baee38
Fix assert (2)
a74nh Mar 20, 2024
b22755a
Fix assert (3)
a74nh Mar 20, 2024
bd8db6e
nop
a74nh Mar 20, 2024
e359c93
Merge branch 'main' into lcl_var_mask_github
a74nh Mar 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/coreclr/jit/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,14 @@ function(create_standalone_jit)
if ((TARGETDETAILS_ARCH STREQUAL "x64") OR (TARGETDETAILS_ARCH STREQUAL "arm64") OR ((TARGETDETAILS_ARCH STREQUAL "x86") AND NOT (TARGETDETAILS_OS STREQUAL "unix")))
target_compile_definitions(${TARGETDETAILS_TARGET} PRIVATE FEATURE_SIMD)
target_compile_definitions(${TARGETDETAILS_TARGET} PRIVATE FEATURE_HW_INTRINSICS)
target_compile_definitions(${TARGETDETAILS_TARGET} PRIVATE FEATURE_MASKED_HW_INTRINSICS)
endif ()
endfunction()

if (CLR_CMAKE_TARGET_ARCH_AMD64 OR CLR_CMAKE_TARGET_ARCH_ARM64 OR (CLR_CMAKE_TARGET_ARCH_I386 AND NOT CLR_CMAKE_HOST_UNIX))
add_compile_definitions($<$<NOT:$<BOOL:$<TARGET_PROPERTY:IGNORE_DEFAULT_TARGET_ARCH>>>:FEATURE_SIMD>)
add_compile_definitions($<$<NOT:$<BOOL:$<TARGET_PROPERTY:IGNORE_DEFAULT_TARGET_ARCH>>>:FEATURE_HW_INTRINSICS>)
add_compile_definitions($<$<NOT:$<BOOL:$<TARGET_PROPERTY:IGNORE_DEFAULT_TARGET_ARCH>>>:FEATURE_MASKED_HW_INTRINSICS>)
endif ()

# JIT_BUILD disables certain PAL_TRY debugging features
Expand Down
21 changes: 19 additions & 2 deletions src/coreclr/jit/codegenarm64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2771,7 +2771,16 @@ void CodeGen::genCodeForLclVar(GenTreeLclVar* tree)
emitAttr attr = emitActualTypeSize(targetType);

emitter* emit = GetEmitter();
emit->emitIns_R_S(ins, attr, tree->GetRegNum(), varNum, 0);

if (ins == INS_sve_ldr && !varTypeUsesMaskReg(targetType))
{
emit->emitIns_R_S(ins, attr, tree->GetRegNum(), varNum, 0, INS_SCALABLE_OPTS_UNPREDICATED);
}
else
{
emit->emitIns_R_S(ins, attr, tree->GetRegNum(), varNum, 0);
}

genProduceReg(tree);
}
}
Expand Down Expand Up @@ -2956,7 +2965,15 @@ void CodeGen::genCodeForStoreLclVar(GenTreeLclVar* lclNode)
instruction ins = ins_StoreFromSrc(dataReg, targetType);
emitAttr attr = emitActualTypeSize(targetType);

emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0);
// TODO-SVE: Removable once REG_V0 and REG_P0 are distinct
if (ins == INS_sve_str && !varTypeUsesMaskReg(targetType))
{
emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0, INS_SCALABLE_OPTS_UNPREDICATED);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this pass down UNPREDICATED rather than doing the inverse?

That is, I imagine most instructions we encounter will end up unpredicated (or effectively unpredicated by using a TrueMask), so I'd expect we end up with overall less checks if we simply said if (varTypeUsesMaskReg(targetType)) { insOpts |= INS_SCALABLE_OPTS_PREDICATED; }

Otherwise, we end up having to special case every single instruction that has a predicated and unpredicated form and additionally check if they use a mask register or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the emit function is inconveniently the wrong way around. I was going to fix the emit function up in this PR, but, once register allocation is working we can get rid the enum and get rid of all these checks.

Given the register allocation work is going to take some time, then maybe I should fix up the emit code in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Just wanted to get clarification as it did seem backwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because the emit function is inconveniently the wrong way around

Switched this around. It no longer matches some of the other emit_R_R_etc functions, but that's ok because it'll all vanish eventually.

else
{
emit->emitIns_S_R(ins, attr, dataReg, varNum, /* offset */ 0);
}
}
else // store into register (i.e move into register)
{
Expand Down
129 changes: 103 additions & 26 deletions src/coreclr/jit/emitarm64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17311,13 +17311,19 @@ void emitter::emitIns_S(instruction ins, emitAttr attr, int varx, int offs)
*
* Add an instruction referencing a register and a stack-based local variable.
*/
void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int varx, int offs)
void emitter::emitIns_R_S(instruction ins,
emitAttr attr,
regNumber reg1,
int varx,
int offs,
insScalableOpts sopt /* = INS_SCALABLE_OPTS_NONE */)
{
emitAttr size = EA_SIZE(attr);
insFormat fmt = IF_NONE;
int disp = 0;
unsigned scale = 0;
bool isLdrStr = false;
emitAttr size = EA_SIZE(attr);
insFormat fmt = IF_NONE;
int disp = 0;
unsigned scale = 0;
bool isLdrStr = false;
bool isScalable = false;

assert(offs >= 0);

Expand Down Expand Up @@ -17353,16 +17359,42 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va
scale = 0;
break;

case INS_sve_ldr:
assert(isVectorRegister(reg1) || isPredicateRegister(reg1));
isScalable = true;

// TODO-SVE: This should probably be set earlier in the caller
size = EA_SCALABLE;
attr = size;

// TODO-SVE: Use register number instead of enum
if (sopt == INS_SCALABLE_OPTS_UNPREDICATED)
{
fmt = IF_SVE_IE_2A;
// TODO-SVE: Don't assume 128bit vectors
scale = NaturalScale_helper(EA_16BYTE);
}
else
{
assert(insScalableOptsNone(sopt));
fmt = IF_SVE_ID_2A;
// TODO-SVE: Don't assume 128bit vectors
// Predicate size is vector length / 8
scale = NaturalScale_helper(EA_2BYTE);
}
break;

default:
NYI("emitIns_R_S"); // FP locals?
return;

} // end switch (ins)

/* Figure out the variable's frame position */
ssize_t imm;
int base;
bool FPbased;
ssize_t imm;
int base;
bool FPbased;
insFormat scalarfmt = fmt;

base = emitComp->lvaFrameAddress(varx, &FPbased);
disp = base + offs;
Expand All @@ -17387,13 +17419,13 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va

if (imm <= 0x0fff)
{
fmt = IF_DI_2A; // add reg1,reg2,#disp
scalarfmt = IF_DI_2A; // add reg1,reg2,#disp
}
else
{
regNumber rsvdReg = codeGen->rsGetRsvdReg();
codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);
fmt = IF_DR_3A; // add reg1,reg2,rsvdReg
scalarfmt = IF_DR_3A; // add reg1,reg2,rsvdReg
}
}
else
Expand All @@ -17402,13 +17434,13 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va
imm = disp;
if (imm == 0)
{
fmt = IF_LS_2A;
scalarfmt = IF_LS_2A;
}
else if ((imm < 0) || ((imm & mask) != 0))
{
if ((imm >= -256) && (imm <= 255))
{
fmt = IF_LS_2C;
scalarfmt = IF_LS_2C;
}
else
{
Expand All @@ -17417,11 +17449,13 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va
}
else if (imm > 0)
{
// TODO: We should be able to scale values <0 for all variants.

if (((imm & mask) == 0) && ((imm >> scale) < 0x1000))
{
imm >>= scale; // The immediate is scaled by the size of the ld/st

fmt = IF_LS_2B;
scalarfmt = IF_LS_2B;
}
else
{
Expand All @@ -17433,10 +17467,15 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va
{
regNumber rsvdReg = codeGen->rsGetRsvdReg();
codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);
fmt = IF_LS_3A;
scalarfmt = IF_LS_3A;
}
}

// Set the format based on the immediate encoding
if (!isScalable)
{
fmt = scalarfmt;
}
assert(fmt != IF_NONE);

// Try to optimize a load/store with an alternative instruction.
Expand Down Expand Up @@ -17564,7 +17603,12 @@ void emitter::emitIns_R_R_S_S(
*
* Add an instruction referencing a stack-based local variable and a register
*/
void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int varx, int offs)
void emitter::emitIns_S_R(instruction ins,
emitAttr attr,
regNumber reg1,
int varx,
int offs,
insScalableOpts sopt /* = INS_SCALABLE_OPTS_NONE */)
{
assert(offs >= 0);
emitAttr size = EA_SIZE(attr);
Expand All @@ -17573,6 +17617,7 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
unsigned scale = 0;
bool isVectorStore = false;
bool isStr = false;
bool isScalable = false;

// TODO-ARM64-CQ: use unscaled loads?
/* Figure out the encoding format of the instruction */
Expand Down Expand Up @@ -17604,6 +17649,31 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
isStr = true;
break;

case INS_sve_str:
assert(isVectorRegister(reg1) || isPredicateRegister(reg1));
isScalable = true;

// TODO-SVE: This should probably be set earlier in the caller
size = EA_SCALABLE;
attr = size;

// TODO-SVE: Use register number instead of enum
if (sopt == INS_SCALABLE_OPTS_UNPREDICATED)
{
fmt = IF_SVE_JH_2A;
// TODO-SVE: Don't assume 128bit vectors
scale = NaturalScale_helper(EA_16BYTE);
}
else
{
assert(insScalableOptsNone(sopt));
fmt = IF_SVE_JG_2A;
// TODO-SVE: Don't assume 128bit vectors
// Predicate size is vector length / 8
scale = NaturalScale_helper(EA_2BYTE);
}
break;

default:
NYI("emitIns_S_R"); // FP locals?
return;
Expand All @@ -17617,7 +17687,7 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
base = emitComp->lvaFrameAddress(varx, &FPbased);
disp = base + offs;
assert(scale >= 0);
if (isVectorStore)
if (isVectorStore || isScalable)
{
assert(scale <= 4);
}
Expand All @@ -17630,18 +17700,19 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
regNumber reg2 = FPbased ? REG_FPBASE : REG_SPBASE;
reg2 = encodingSPtoZR(reg2);

bool useRegForImm = false;
ssize_t imm = disp;
ssize_t mask = (1 << scale) - 1; // the mask of low bits that must be zero to encode the immediate
bool useRegForImm = false;
ssize_t imm = disp;
ssize_t mask = (1 << scale) - 1; // the mask of low bits that must be zero to encode the immediate
insFormat scalarfmt = fmt;
if (imm == 0)
{
fmt = IF_LS_2A;
scalarfmt = IF_LS_2A;
}
else if ((imm < 0) || ((imm & mask) != 0))
{
if ((imm >= -256) && (imm <= 255))
if (isValidSimm9(imm))
{
fmt = IF_LS_2C;
scalarfmt = IF_LS_2C;
}
else
{
Expand All @@ -17650,11 +17721,12 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
}
else if (imm > 0)
{
// TODO: We should be able to scale values <0 for all variants.

if (((imm & mask) == 0) && ((imm >> scale) < 0x1000))
{
imm >>= scale; // The immediate is scaled by the size of the ld/st

fmt = IF_LS_2B;
scalarfmt = IF_LS_2B;
}
else
{
Expand All @@ -17668,9 +17740,14 @@ void emitter::emitIns_S_R(instruction ins, emitAttr attr, regNumber reg1, int va
// It is instead implicit when idSetIsLclVar() is set, with this encoding format.
regNumber rsvdReg = codeGen->rsGetRsvdReg();
codeGen->instGen_Set_Reg_To_Imm(EA_PTRSIZE, rsvdReg, imm);
fmt = IF_LS_3A;
scalarfmt = IF_LS_3A;
}

// Set the format based on the immediate encoding
if (!isScalable)
{
fmt = scalarfmt;
}
assert(fmt != IF_NONE);

// Try to optimize a store with an alternative instruction.
Expand Down
6 changes: 4 additions & 2 deletions src/coreclr/jit/emitarm64.h
Original file line number Diff line number Diff line change
Expand Up @@ -1782,7 +1782,8 @@ void emitIns_C(instruction ins, emitAttr attr, CORINFO_FIELD_HANDLE fdlHnd, int

void emitIns_S(instruction ins, emitAttr attr, int varx, int offs);

void emitIns_S_R(instruction ins, emitAttr attr, regNumber ireg, int varx, int offs);
void emitIns_S_R(
instruction ins, emitAttr attr, regNumber ireg, int varx, int offs, insScalableOpts sopt = INS_SCALABLE_OPTS_NONE);

void emitIns_S_S_R_R(
instruction ins, emitAttr attr, emitAttr attr2, regNumber ireg, regNumber ireg2, int varx, int offs);
Expand All @@ -1800,7 +1801,8 @@ void emitIns_R_R_R_I_LdStPair(instruction ins,
int offs2 = -1 DEBUG_ARG(unsigned var1RefsOffs = BAD_IL_OFFSET)
DEBUG_ARG(unsigned var2RefsOffs = BAD_IL_OFFSET));

void emitIns_R_S(instruction ins, emitAttr attr, regNumber ireg, int varx, int offs);
void emitIns_R_S(
instruction ins, emitAttr attr, regNumber ireg, int varx, int offs, insScalableOpts sopt = INS_SCALABLE_OPTS_NONE);

void emitIns_R_R_S_S(
instruction ins, emitAttr attr, emitAttr attr2, regNumber ireg, regNumber ireg2, int varx, int offs);
Expand Down
13 changes: 8 additions & 5 deletions src/coreclr/jit/hwintrinsic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -778,7 +778,7 @@ GenTree* Compiler::getArgForHWIntrinsic(var_types argType,
{
arg = impSIMDPopStack();
}
assert(varTypeIsSIMD(arg));
assert(varTypeIsSIMDOrMask(arg));
}
else
{
Expand Down Expand Up @@ -1593,17 +1593,20 @@ GenTree* Compiler::impHWIntrinsic(NamedIntrinsic intrinsic,
#if defined(TARGET_ARM64)
if (HWIntrinsicInfo::IsMaskedOperation(intrinsic))
{
// Op1 input is a vector. HWInstrinsic requires a mask, so convert to a mask.
assert(numArgs > 0);
GenTree* op1 = retNode->AsHWIntrinsic()->Op(1);
op1 = convertHWIntrinsicToMask(retType, op1, simdBaseJitType, simdSize);
retNode->AsHWIntrinsic()->Op(1) = op1;
GenTree* op1 = retNode->AsHWIntrinsic()->Op(1);
if (!varTypeIsMask(op1))
{
// Op1 input is a vector. HWInstrinsic requires a mask.
retNode->AsHWIntrinsic()->Op(1) = convertHWIntrinsicToMask(retType, op1, simdBaseJitType, simdSize);
}
}

if (retType != nodeRetType)
{
// HWInstrinsic returns a mask, but all returns must be vectors, so convert mask to vector.
assert(HWIntrinsicInfo::ReturnsPerElementMask(intrinsic));
assert(nodeRetType == TYP_MASK);
retNode = convertHWIntrinsicFromMask(retNode->AsHWIntrinsic(), retType);
}
#endif // defined(TARGET_ARM64)
Expand Down
6 changes: 5 additions & 1 deletion src/coreclr/jit/hwintrinsicarm64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2219,6 +2219,8 @@ GenTree* Compiler::convertHWIntrinsicToMask(var_types type,
CorInfoType simdBaseJitType,
unsigned simdSize)
{
assert(varTypeIsSIMD(node));

// ConvertVectorToMask uses cmpne which requires an embedded mask.
GenTree* embeddedMask = gtNewSimdHWIntrinsicNode(TYP_MASK, NI_Sve_CreateTrueMaskAll, simdBaseJitType, simdSize);
return gtNewSimdHWIntrinsicNode(TYP_MASK, embeddedMask, node, NI_Sve_ConvertVectorToMask, simdBaseJitType,
Expand All @@ -2237,7 +2239,9 @@ GenTree* Compiler::convertHWIntrinsicToMask(var_types type,
//
GenTree* Compiler::convertHWIntrinsicFromMask(GenTreeHWIntrinsic* node, var_types type)
a74nh marked this conversation as resolved.
Show resolved Hide resolved
{
assert(node->TypeGet() == TYP_MASK);
assert(varTypeIsMask(node));
assert(varTypeIsSIMD(type));

return gtNewSimdHWIntrinsicNode(type, node, NI_Sve_ConvertMaskToVector, node->GetSimdBaseJitType(),
node->GetSimdSize());
}
Expand Down
11 changes: 11 additions & 0 deletions src/coreclr/jit/importer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6419,6 +6419,17 @@ void Compiler::impImportBlockCode(BasicBlock* block)
impSpillSideEffects(false, CHECK_SPILL_ALL DEBUGARG("Spill before store to pinned local"));
}

#if defined(TARGET_ARM64) && defined(FEATURE_MASKED_HW_INTRINSICS)
// Masks must be converted to vectors before being stored to memory.
// But, for local stores we can optimise away the conversion
if (op1->OperIsHWIntrinsic() && op1->AsHWIntrinsic()->GetHWIntrinsicId() == NI_Sve_ConvertMaskToVector)
a74nh marked this conversation as resolved.
Show resolved Hide resolved
{
op1 = op1->AsHWIntrinsic()->Op(1);
lvaTable[lclNum].lvType = TYP_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're doing this retyping here. But I don't see where we're doing any fixups to ensure that something which reads the TYP_MASK local but expects a TYP_SIMD will get the ConvertMaskToVector inserted back.

Imagine for example, something like:

Vector<int> mask = Vector.GreaterThan(x, y);
return mask + Vector.Create(5);

Where Vector.GreaterThan will produce a TYP_MASK, but the latter + consumes it as a vector. -- Noting that this is an abnormal pattern, but something we still need to account for and ensure works correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might expect such logic to insert a general ConvertMaskToVector helper to exist as part of impSimdPopStack and/or as part of the general import helpers we have in hwintrinsic.cpp.

lclTyp = lvaGetActualType(lclNum);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do they need to be converted before being stored to memory?

Shouldn't this be entirely dependent on the target type, just with a specialization for locals since we want to allow efficient consumption in the typical case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a suggestion to change the comments, or a code change too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a change to the comment.

I think we only need to clarify that we're optimizing masks stored to locals to allow better consumption in the expected typical case and that we have the handling in place to ensure that consumers which actually need a vector get the ConvertMaskToVector inserted back in.


Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting, however, that this could be an incorrect optimization in some cases. For example, if it's a user-defined local where actual vectors are also stored then it would require a ConvertVectorToMask to be inserted but it would also be a lossy conversion and therefore unsafe.

So I'm not entirely certain this is the "right" place to be doing this either. We might need to actually do this in a later phase where we can observe all uses of a local from the perspective of user defined code, so that we only do this optimization when all values being stored are masks.

I'm not sure this is the right place either. Originally I wanted to avoid creating the conversion in the first place, but realised it has to be created and then removed after/during creation of the local. That's how I ended up putting it in importer.

I can't see an obvious later phase this would be done in. Morph? Or maybe during lowering? Anything that already parses local vars would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the correct location for this would be lclmorph, which is the same place we do sequencing, mark address exposed locals, and even combine field stores to SIMD.

However, @EgorBo or @jakobbotsch may have a better idea on where the code should go.

The consideration in general is really just that we need to find non-address exposed TYP_SIMD locals where all stores are ConvertMaskToVector so we can replace it with a TYP_MASK local instead. There are of course other optimizations that could be done, including splitting a local into two if some stores are masks and some are vectors, but those will be less common than the first.

This work needs to happen after import since that's the only code that would be using pre-existing locals. Any locals introduced by CSE or other phases involving a mask will already be TYP_MASK themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this kind of brute force changing of a local's type is not safe. It will break all sorts of IR invariants we want to be able to rely on if you don't also do a full pass to fix up other uses of the local.

It seems like this kind of optimization should be its own separate pass after local morph when we know whether or not it is address exposed. We have linked locals at that point, so finding the occurrences is relatively efficient. You can leave some breadcrumb around to only run the pass when there are opportunities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this kind of brute force changing of a local's type is not safe.

I'll add a new pass then. For now I've removed the importer changes.

Technically, this PR could be merged as is. It won't make any jit difference by itself, but it's quite a lot of code that is a blocker for other hw intrinsic work I'm doing (the embedded masks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have linked locals at that point

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it in a separate PR sounds good to me. Presumably it needs some heuristics to figure out if it's profitable to make the replacements as well.

Can you expand a little on what you mean here? I want to make sure I'm parsing the right data in the pass.

See Statement::LocalsTreeList. It allows to quickly check whether a statement contains a local you are interested in.

#endif // TARGET_ARM64 && FEATURE_MASKED_HW_INTRINSICS

op1 = gtNewStoreLclVarNode(lclNum, op1);

// TODO-ASG: delete this zero-diff quirk. Requires some forward substitution work.
Expand Down