Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle > 64 registers + predicate registers for Arm64 #98258

Open
wants to merge 212 commits into
base: main
Choose a base branch
from

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Feb 10, 2024

Review guide

Predicate Registers

  • registerarm64.h adds the 16 new predicate registers numbered 64 thru 79. Their masks are from 0x0 thru 0x8000. Following files adds the predicate registers. On arm64, now we need 7 bits to represent the register number and hence the REGNUM_BITS has changed from 6 -> 7 bits (targetarm64.h).

AllRegsMask

  • The new data structure struct is introduced to represent the register mask. If HAS_MORE_THAN_64_REGISTERS is not defined (for all non-arm64 platforms), this contains a single 64-bit field. But for arm64, (HAS_MORE_THAN_64_REGISTERS is defined), the struct contains an extra field of 4-bytes to represent the predicate registers. The definition of this struct is present in target.h and is very similar to how I mentioned it here.
typedef struct _regMaskAll
{
private:
#ifdef HAS_MORE_THAN_64_REGISTERS
  union
  {
      RegBitSet32 _registers[3];
      struct
      {
          union
          {
              // Represents combined registers bitset including gpr/float
              RegBitSet64 _combinedRegisters;
              struct
              {
                  RegBitSet32 _gprRegs;
                  RegBitSet32 _floatRegs;
              };
          };
          RegBitSet32 _predicateRegs;
      };
  };
#else
  // Represents combined registers bitset including gpr/float and on some platforms
  // mask or predicate registers
  RegBitSet64 _combinedRegisters;
#endif

Few things that are worth explaining for this struct:

  • As seen, for non-arm64 platforms, where HAS_MORE_THAN_64_REGISTERS is not defined, the struct will continue to operate on a 64-bits field, thus not impacting the TP of these platforms.
  • For arm64, there are essentially 3 4-bytes fields here, one each for gpr, float and predicate registers, in that order. The way they are defined under union is so that the AllRegsMask can access relevant fields without any branches or conditions. For e.g. If a float register d2 has to be added in the mask, I did not want to have something like:
if (regNum < 32) { _gprRegs = ... ;}
else if (regNum < 64) {_floatRegs = ... ;}
else { _predicateRegs = ... ;}

Instead, a mapping is added in all register*.h files to map all gpr registers -> 0, float registers -> 1 and predicate registers -> 2. Having that, I could rewrite the above code as:

_registers[regIndexForRegNum(regNum)] = ...;
  • Using predicate registers from the mask is uncommon and mostly, the consumers of AllRegsMask are interested in gpr/float registers. Hence a _combinedRegisters field is added in the union to have easy access of them.

  • This design also provide an easy way to retrieve just gprRegs() or floatRegs() or predicateRegs() easily.

Until now, the manipulation of registers (adding/removing registers) from the mask was trivial and was done using bit manipulation. However, with AllRegsMask struct, it offers new methods to do such manipulation.

  • Firstly, various operators are implemented to give seemless manipulation of underlying fields throughout the code base, without having to make changes at all those places.
  • Most methods that directly take regNumber as input already know which field of the AllRegsMask struct they are operating. If the method takes a set of registers, those methods also need to know the register class of that mask, to determine which field of the AllRegsMask it needs to update. The definition of these methods are present in compiler.hpp.
  • Two methods that are worth mentioning about are encodeForIndex() and decodeForIndex(). Imagine adding mask of a register (1 << regNumber) in the relevant field, here is how we could write:
mask = genRegMask(regNumber);
if (regNum < 64) { _combinedRegisters |= mask ;}
else {_predicateRegs = mask ;}

Alternatively, since both gpr and predicate register mask starts with bit 0 0x0 and hence can be directly added to _registers[0] or _registers[2], we could rewrite it as following:

index = regIndexForRegNum(regNumber);
mask = genRegMask(regNumber)l
if (regNum < 32 || regNum > 63) _register[index] = mask;
else { _combinedRegisters |= mask; /* float register mask */ }

Either way, we have to do a branch to add the mask to the relevant field. To make this code branch-free, for float register mask, I right shift it by 32 to fit it in 4-bytes using encodeForIndex() and while returning back the float register, use decodeForIndex() by left shifting it by 32-bits. With that, we can just do something like this:

index = regIndexForRegNum(regNumber);
mask = genRegMask(regNumber);
_registers[index] = encodeForIndex(index, mask);

In LSRA, until now, we would use a single 64-bits primitive to represent register set. Whenever we want to extract each register corresponding to the bit ON in the set, we would iterate through it and return the next ON bit and toggle it. Following new methods are added to handle that aspect:

  • genFirstRegNumFromMaskAndToggle() : With AllRegsMask, once we run out of gpr/float field, we need to iterate over the _predicateRegs field.
  • genRegNumFromMask() : This method will now also take the type that it expects the register to extract and accordingly add 64 to the extracted ON bit from the mask.
  • genFirstRegNumFromMask() : This too first scans the gpr/float fields to see if anything is set and if not, will look into _predicateRegs field.

Lsra

  • lsrabuild.cpp
    • General renaming of types, mostly things are renamed from regMaspTP to regMaskOnlyOne
    • The signature of various methods (e.g. addRefsForPhysRegMask) that takes mask containing registers of different register class (gpr/float/predicate) are now taking AllRegsMask. They either pass through the mask to other methods, or iterate over all the registers present in the mask (and toggle them) using the newly added genFirstRegNumFromMaskAndToggle().
    • Use AllRegsMask instead to save the killMask
    • Certain methods that takes regMaskOnlyOne as parameter, an extra parameter of type is passed to know what register class they represent, for further support updating the right fields of AllRegsMask.
    • BuildDef* methods: Added few new methods to group some of the logic around building definitions for calls BuildCallDefs()/BuildCallDefsWithKills(), the ones that just build RefPosition for kills BuildKills(). Most of them takes AllRegsMask as the killMask.
  • lsra.h
    • RegisterType field is removed from RegRecord and Interval and moved it inside the parent class Referanceable. This was done, so it is easy to query the type to determine the register class for a given RefPosition. Without this, we would have to check first if the RefPosition represents virtual register (Interval) or a physical register (RegRecord).
    • General renaming of regMaskTP to regMaskGpr or regMaskFloat.
    • A new overload for certains methods like freeRegisters(), verifyFreeRegisters(), updateDeadCandidatesAtBlockStart(), inActivateRegisters() has been added. The original method will continue operating of the 64-bit mask (regMaskTP), and the overloaded method operates on AllRegsMask. The only difference between the two is how the bits are iterated and toggled inside the method.
    • The type of certain fields like m_AvailableRegs, placedArgRegs, registersToDump, m_RegistersWithConstants, fixedRegs, regsBusyUntilKill, regsInUseThisLocation, regsInUseNextLocation is changed from regMaskTP to AllRegsMask. The relevant methods that read/write these fields are updated to take the registerType as parameter. Based on that, it will add/remove the given register mask from the AllRegsMask.
  • lsra.cpp
    • Most of the methods that touch fields like m_AvailableRegs, etc. now have to use the methods from AllRegsMask to add/remove/update the register/register mask. For that, we need to pass the registerType to those methods.
    • Methods that previously defined 64-bit register mask variables like regsToFree, delayRegsToFree, regsToMakeInactive, delayRegsToMakeInactive, copyRegsToFree, targetRegsToDo, targetRegsReady, targetRegsFromStack, etc. that tracks the registers during allocation/resolution are now changed to AllRegsMask and so is the way they manipulate the add/removing of registers. They now use methods from AllRegsMask and sometimes need to pass registerType to know the register class of the mask that is being added/removed.
  • lsraxarch.cpp
  • lsraarmarch.cpp
  • lsraarm64.cpp
    • General renaming of regMaskTP type.
    • Uses the new methods created for building for killMask.

Codegen

  • codegenarmarch.cpp
  • codegencommon.cpp
  • codegenxarch.cpp
    • register(s) are added in and retrieved from regSet using new methods on regSet, based on the type.
    • General renaming of types
    • Use AllRegsMask instead to save the killMask
  • codegenarm64.cpp
    • gen(Save|Restore)CalleeSavedRegisterGroup renamed the type of register mask from regMaskTP to regMaskOnlyOne and it takes the type as additional parameter so it can pass along to other methods like genBuildRegPairsStack, etc.
    • gen(Save|Restore)CalleeSavedRegistersHelp() now takes AllRegsMask instead of regMaskTP, because it has to also save/restore predicate registers. This method pass the individual register class mask to gen(Save|Restore)CalleeSavedRegisterGroup. Callers of this method basically extract the AllRegsMask from the regSet field to send all the callee saved registers. I might simplofy some of this to erase some of the changes.

Compiler

  • compiler.cpp
    • Most of the RBM_* masks that are today defined in various target*.h files, we need to have corresponding AllRegsMask_* equivalent. Most of them rely on the availability of float registers and for AVX512, they are not known until we initialize the compiler object with CPU features. Hence these fields are defined after such initialization so they contain the accurate active register set, specially the float registers, needed for the compilation of the method.

Misc

Most of the other changes mentioned below are minimal and are needed because the regMaskTP is renamed to one of regMaskGpr, regMaskFloat, etc.

  • Following files just has changes to the type names

    • abi.cpp
    • abi.h
    • block.h
    • codegen.h (along with using AllRegsMask in FuncletInfo)
    • codegeninterface.h
    • codegenarm.cpp
    • codegenlinear.cpp
    • emit.cpp (along with some new methods to display the AllRegsMask)
    • emit.h (along with ID_EXTRA_BITFIELD_BITS increased from 21 to 23 bits because we have two REGNUM_BITS in instrDesc)
    • emitarm.cpp
    • emitarm.h
    • emitarm64.cpp
    • emitarm64.h
    • emitxarch.cpp
    • emitxarch.h
    • emitinl.h
    • emitpub.h
    • gcinfo.cpp
    • gcencode.cpp
    • instr.cpp
    • jitgcinfo.h
    • lclvars.cpp
    • morph.cpp
    • optimizer.cpp
    • regalloc.cpp
    • registerargconvention.cpp
    • registerargconvention.h
    • targetamd64.cpp
    • targetarm.cpp
    • targetarm64.cpp
    • targetx86.cpp
    • typelist.h
    • unwindarmarch.cpp
  • Following files add new parameter to categorize the register class

    • register.h
    • registerarm.h
    • registerriscv64.h
    • registerloongarch64.h
    • emitloongarch64.cpp
    • emitriscv64.cpp
  • Added REG_FP_COUNT, REG_MASK_COUNT and RBM_ALLGPR

    • targetamd64.h
    • targetarm.h
    • targetx86.h
  • We have RegSet class that tracks the registers touched and is used during codegen, return unused register or spill registers. To track all the different type of registers, I converted rsModifiedRegsMask from regMaskTP to AllRegsMask. All the other methods that were changed was saving a particular register or set of registers (depending on the type) to rsModifiedRegsMask and returning back the register set for given type.

    • regset.h
    • regset.cpp
  • Some of the methods in GenTree* now need to return AllRegsMask because the ABI might require it (I am certain that this can be just floatgpr, but had it for now):

    • gentree.h
    • gentree.cpp
  • Refactoring

    • lsraarm.cpp
  • Handling of mask registers

    • unwind.cpp
Old TODO

This is just a prototype to see if the asserts added are hit or not.

TODO:

  • Make superpmi-replay pass
  • Make superpmi-asmdiff to be zero diffs
  • Make the size of regMask* that this PR introduces to 4 bytes instead of 8 bytes. This will happen in follow-up PR.
  • Report the memory savings from changing 8 bytes -> 4 bytes NA
    • Convert all the RBM_* to represent 4 bytes i.e. VMASK will change to be no-op
    • RBM_ALLFLOAT/RBM_ALLDOUBLE might not be relevant anymore. So will need to remove/tweak code that relies on it.
  • Improve TP
    • condition free alteration of AllRegsMask() depending on if method has just int, int/float, int/float/predicate
    • Make non-predicate registers scenarios almost zero TP difference. Edit: Done
    • See if the changes in this PR except AllRegsMask() can be merged without TP impact. Edit: It should go all together.
    • Can converting the AllRegsMask() to array would improve? We can have a prepopulated map of register to index of AllRegMask() the register should touch if it needs to be added/removed/tracked
    • We can probably have AllRegsMask that contains a single 64-bit mask field and all the operations on it would be done either on low 32-bits (gpr) or high 32-bits. There can be a struct AllRegsMaskWithPredicate that inherits from AllRegsMask and just has a 32-bits field for predicate registers. Places that mostly deal with gpr/float can use AllRegsMask. Edit: We changed the design to use AllRegsMask instead.
  • Extend support for risc/loongarch
  • Make all jitstress* pipelines pass
  • Code cleanup
    • Remove some redundant asserts to make sure that checked performance is not too much impacted. Most likely when we convert the regMaskGpr, etc. to 4 bytes, all of the asserts will be gone.
  • Replace HAS_PREDICATE_REGS with FEATURE_MASKED_HW_INTRINSICS

Fixes: #99658

@kunalspathak
Copy link
Member Author

This is ready for review. Please go through it and let me know what you think. It seems a one-time cost of around ~10% regression is what we are getting for MinOpts and ~4% regression on FullOpts is what we are looking for. The regression is currently just on arm64 platform. The non-arm64 are mostly untouched.

I have not yet updated the names of following typedefs and would like some suggestions. Here is my proposal:

  • regMaskGpr : Should be called GprRegs
  • regMaskFloat: Should be called FloatRegs
  • regMaskPredicate: Should be called PredicateRegs
  • regMaskOnlyOne: Should be called SingleTypeRegs
  • singleRegMask: Should be called SingleReg
  • AllRegsMask: Should be called AllRegs
  • RegBitSet64: Should be called _64Regs
  • RegBitSet32: Should be called _32Regs

@kunalspathak kunalspathak marked this pull request as ready for review April 10, 2024 23:36
@kunalspathak kunalspathak changed the title Predicate registers Handle > 64 registers + predicate registers for Arm64 Apr 10, 2024
@AndyAyersMS
Copy link
Member

@kunalspathak you should nominate some people specifically for review.

Also seems like you ought to remove the "NO" labels on the PR.

@kunalspathak kunalspathak removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Apr 12, 2024
@jakobbotsch
Copy link
Member

I have not yet updated the names of following typedefs and would like some suggestions. Here is my proposal:

  • regMaskGpr : Should be called GprRegs
  • regMaskFloat: Should be called FloatRegs
  • regMaskPredicate: Should be called PredicateRegs
  • regMaskOnlyOne: Should be called SingleTypeRegs
  • singleRegMask: Should be called SingleReg
  • AllRegsMask: Should be called AllRegs
  • RegBitSet64: Should be called _64Regs
  • RegBitSet32: Should be called _32Regs

My two cents: Set in the name seems nice to indicate whenever something is a set. r in "gpr" already stands for register, so GprRegs is a bit redundant. My proposal:

  • GpRegSet
  • FloatRegSet
  • PredicateRegSet
  • SingleTypeRegSet
  • SingleReg (what is the difference between this and regNumber?)
  • AnyTypeRegSet
  • RegSet64, RegSet32 (what is the intended use case for these?)

On a related note I think we shouldn't mix naming conventions for some of the new fields, like AllRegsMask_CALLEE_TRASH_NOGC. Perhaps some new prefix can be used (e.g. instead of RBM_ it could be ALLREGS_, or whatever is appropriate for the final name of the set type that we end up with...)

Comment on lines +2022 to +2216
AllRegsMask_STOP_FOR_GC_TRASH =
AllRegsMask((RBM_INT_CALLEE_TRASH & ~RBM_INTRET), (RBM_FLT_CALLEE_TRASH & ~RBM_FLOATRET), RBM_MSK_CALLEE_TRASH);
AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask_CALLEE_TRASH;
#endif // UNIX_AMD64_ABI

AllRegsMask_PROFILER_LEAVE_TRASH = AllRegsMask_STOP_FOR_GC_TRASH;
AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_PROFILER_LEAVE_TRASH;

// The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
AllRegsMask_INIT_PINVOKE_FRAME_TRASH = AllRegsMask_CALLEE_TRASH;
AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_VALIDATE_INDIRECT_CALL_TRASH);

#elif defined(TARGET_ARM)

AllRegsMask_CALLEE_TRASH_NOGC = GprRegsMask(RBM_CALLEE_TRASH_NOGC);
AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask_NONE;

// Registers killed by CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
AllRegsMask_CALLEE_TRASH_WRITEBARRIER = GprRegsMask(RBM_R0 | RBM_R3 | RBM_LR | RBM_DEFAULT_HELPER_CALL_TARGET);

// Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER = AllRegsMask_CALLEE_TRASH_WRITEBARRIER;

// Registers killed by CORINFO_HELP_ASSIGN_BYREF.
AllRegsMask_CALLEE_TRASH_WRITEBARRIER_BYREF =
GprRegsMask(RBM_WRITE_BARRIER_DST_BYREF | RBM_WRITE_BARRIER_SRC_BYREF | RBM_CALLEE_TRASH_NOGC);

// Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_BYREF.
// Note that r0 and r1 are still valid byref pointers after this helper call, despite their value being changed.
AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER_BYREF = AllRegsMask_CALLEE_TRASH_NOGC;
AllRegsMask_PROFILER_RET_SCRATCH = GprRegsMask(RBM_R2);
// While REG_PROFILER_RET_SCRATCH is not trashed by the method, the register allocator must
// consider it killed by the return.
AllRegsMask_PROFILER_LEAVE_TRASH = AllRegsMask_PROFILER_RET_SCRATCH;
AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_NONE;
// The registers trashed by the CORINFO_HELP_STOP_FOR_GC helper (JIT_RareDisableHelper).
// See vm\arm\amshelpers.asm for more details.
AllRegsMask_STOP_FOR_GC_TRASH =
AllRegsMask((RBM_INT_CALLEE_TRASH & ~(RBM_LNGRET | RBM_R7 | RBM_R8 | RBM_R11)),
(RBM_FLT_CALLEE_TRASH & ~(RBM_DOUBLERET | RBM_F2 | RBM_F3 | RBM_F4 | RBM_F5 | RBM_F6 | RBM_F7)));
// The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
AllRegsMask_INIT_PINVOKE_FRAME_TRASH =
(AllRegsMask_CALLEE_TRASH | GprRegsMask(RBM_PINVOKE_TCB | RBM_PINVOKE_SCRATCH));

AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_INT_CALLEE_TRASH);

#elif defined(TARGET_ARM64)

AllRegsMask_CALLEE_TRASH_NOGC = GprRegsMask(RBM_CALLEE_TRASH_NOGC);
AllRegsMask_PROFILER_ENTER_TRASH = AllRegsMask((RBM_INT_CALLEE_TRASH & ~(RBM_ARG_REGS | RBM_ARG_RET_BUFF | RBM_FP)),
(RBM_FLT_CALLEE_TRASH & ~RBM_FLTARG_REGS), RBM_MSK_CALLEE_TRASH);
// Registers killed by CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
AllRegsMask_CALLEE_TRASH_WRITEBARRIER = GprRegsMask(RBM_R14 | RBM_CALLEE_TRASH_NOGC);

// Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_REF and CORINFO_HELP_CHECKED_ASSIGN_REF.
AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER = AllRegsMask_CALLEE_TRASH_NOGC;

// Registers killed by CORINFO_HELP_ASSIGN_BYREF.
AllRegsMask_CALLEE_TRASH_WRITEBARRIER_BYREF =
GprRegsMask(RBM_WRITE_BARRIER_DST_BYREF | RBM_WRITE_BARRIER_SRC_BYREF | RBM_CALLEE_TRASH_NOGC);

// Registers no longer containing GC pointers after CORINFO_HELP_ASSIGN_BYREF.
// Note that x13 and x14 are still valid byref pointers after this helper call, despite their value being changed.
AllRegsMask_CALLEE_GCTRASH_WRITEBARRIER_BYREF = AllRegsMask_CALLEE_TRASH_NOGC;

AllRegsMask_PROFILER_LEAVE_TRASH = AllRegsMask_PROFILER_ENTER_TRASH;
AllRegsMask_PROFILER_TAILCALL_TRASH = AllRegsMask_PROFILER_ENTER_TRASH;

// The registers trashed by the CORINFO_HELP_STOP_FOR_GC helper
AllRegsMask_STOP_FOR_GC_TRASH = AllRegsMask_CALLEE_TRASH;
// The registers trashed by the CORINFO_HELP_INIT_PINVOKE_FRAME helper.
AllRegsMask_INIT_PINVOKE_FRAME_TRASH = AllRegsMask_CALLEE_TRASH;
AllRegsMask_VALIDATE_INDIRECT_CALL_TRASH = GprRegsMask(RBM_VALIDATE_INDIRECT_CALL_TRASH);
#endif

#if defined(TARGET_ARM)
// profiler scratch remains gc live
AllRegsMask_PROF_FNC_LEAVE = AllRegsMask_PROFILER_LEAVE_TRASH & ~AllRegsMask_PROFILER_RET_SCRATCH;
#else
AllRegsMask_PROF_FNC_LEAVE = AllRegsMask_PROFILER_LEAVE_TRASH;
#endif // TARGET_ARM

#ifdef TARGET_XARCH

// Make sure we copy the register info and initialize the
// trash regs after the underlying fields are initialized

const regMaskTP vtCalleeTrashRegs[TYP_COUNT]{
#define DEF_TP(tn, nm, jitType, sz, sze, asze, st, al, regTyp, regFld, csr, ctr, tf) ctr,
#include "typelist.h"
#undef DEF_TP
};
memcpy(varTypeCalleeTrashRegs, vtCalleeTrashRegs, sizeof(regMaskTP) * TYP_COUNT);

if (codeGen != nullptr)
{
codeGen->CopyRegisterInfo();
}
#endif // TARGET_XARCH
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may have said this earlier, but why can these not be static variables with values baked into the .dll?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because compiler object has ISA information that we use to determine the float/mask registers to include. #98258 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of these sets need dynamic creation? Isn't it just the ones defined here that do?

#if defined(TARGET_AMD64)
regMaskTP rbmAllFloat;
regMaskTP rbmFltCalleeTrash;
FORCEINLINE regMaskTP get_RBM_ALLFLOAT() const
{
return this->rbmAllFloat;
}
FORCEINLINE regMaskTP get_RBM_FLT_CALLEE_TRASH() const
{
return this->rbmFltCalleeTrash;
}
#endif // TARGET_AMD64
#if defined(TARGET_XARCH)
regMaskTP rbmAllMask;
regMaskTP rbmMskCalleeTrash;
// Call this function after the equivalent fields in Compiler have been initialized.
void CopyRegisterInfo();
FORCEINLINE regMaskTP get_RBM_ALLMASK() const
{
return this->rbmAllMask;
}
FORCEINLINE regMaskTP get_RBM_MSK_CALLEE_TRASH() const
{
return this->rbmMskCalleeTrash;
}
#endif // TARGET_XARCH

@kunalspathak
Copy link
Member Author

SingleReg (what is the difference between this and regNumber?)

Probably the name should SingleRegBitSet which basically says that the mask contains just 1 bit set...It is usually the returned type from genRegMask(regNumber).

RegSet64, RegSet32 (what is the intended use case for these?)

RegSet64 is basically just regMaskTP to indicate that the entity represents 64 registers and likewise for RegSet32. I introduced RegSet32 so I can make the GPR to that type, but I will do it in a follow-up PR.

@jakobbotsch jakobbotsch self-requested a review April 15, 2024 13:01

// Represents that the mask in this type is from one of the register type - gpr/float/predicate
// but not more than 1.
typedef unsigned __int64 regMaskOnlyOne;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have this type come with a tag that can ensure (in DEBUG only) that we don't try to do operations with mismatched register types? Currently I assume that would result in bogus results.

@AndyAyersMS
Copy link
Member

TP cost still seems awfully high...

@jakobbotsch had an idea which might give us the freedom to explore how to control costs better, and also unblock dependent work, with (we suspect) very little or no downside.

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

We will still need to solve the > 64 allocatable things problem but, we'll have time to work on it independently.

@a74nh
Copy link
Contributor

a74nh commented Apr 18, 2024

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

I suspect it'll also be rare to require more than 8 mask registers.

There are a large group of instructions (at least 200, see use of isLowPredicateRegister()) where only predicates p0 to p7 are allowed. There are 10 instructions where only predicates p8 to p15 are allowed (see use of isHighPredicateRegister()). We don't have any APIs which can directly access the high predicate instructions and I doubt we'll need to generate them indirectly. If we did need to later then we could offset the 8 values we get from the allocator by 2, giving predicates p2 to p9 ?

We might want to reserve two mask registers for all zero and all ones as that is very common usage. Due to the low predicate instructions, these would have to come from the low predicates registers. Maybe this is a good use for predicates p0 and p1. There would still need to be a mechanism to keep track of whether these had been set for the current function, but it can be separate from the standard register mask?

Thinking wider, how many other registers are fixed? LR, FP and SP won't ever be directly allocated? Are there any others always in use (thread local storage etc?). If so these don't need to be in the register mask either, freeing up more space?

@tannergooding
Copy link
Member

I expect we're going to end up paying more in terms of actual execution cost by trying to play funny tricks with bitpacking (i.e. using only 29 fp registers) than we would be just using the extra space.

I expect there is going to be a short term higher cost to getting the support in here, no matter what route we take, and we're ultimately going to need the same work done for the x64 APX feature. While many functions may not need the full register set, there are many instructions which have special allocation requirements (like 4 sequential registers or having to start at a register divisible by n) where having a smaller set impacts codegen. We also can win a lot of this back longer term. We will have more opportunities for cleanup, refactorings, and simplification to get more out of it.

I'd also like to call out again that it's very hard to gauge actual cost by TP numbers alone. SPMI doesn't factor in the overhead of the VM calls/token resolution, it doesn't really factor in the difference between debug vs release costs, it doesn't factor in that methods that are substantially slower may be infrequently compiled (excluding crossgen), it doesn't factor in that instructions can be pipelined, fused, or that some single instructions (division or multiplication) can have the cost of dozens of other instructions.

We've seen this nuance in the actual perf numbers we track and triage weekly, such as https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html, where the single biggest jump in that was the enabling of DPGO and the steady increase since is due to the organic growth of S.P.Corelib otherwise. The increase in time for changes like this (even being "10% throughput impact") are incredibly minimal in comparison to simply adding new APIs to the BCL or broadly enabling new optimizations.

I think we should worry more about pushing this in the right direction and ensuring the code is understandable/maintainable, and then longer term get it optimized around what we land on from that perspective.

@kunalspathak
Copy link
Member Author

Can't agree more on what @tannergooding says.

I expect we're going to end up paying more in terms of actual execution cost by trying to play funny tricks with bitpacking (i.e. using only 29 fp registers) than we would be just using the extra space.

Yes, and we might introduce bugs in longer run and would have to add more workarounds to deal with them.

I expect there is going to be a short term higher cost to getting the support in here, no matter what route we take, and we're ultimately going to need the same work done for the x64 APX feature.

Yes, this is precisely my thinking that I expressed somewhere above. The entire code base had an assumption that we will not surpass more than 64 registers and now, we do. So, we need to include that functionality in various places. It is similar to how arm32 has higher TP cost just because of special handling that is needed to handle even-odd pair of registers, that is not present in other platforms.

no matter what route we take

And there were several routes (4~5 prototypes) that were explored in last couple of months just in pursuit of bringing the TP numbers (reported by superpmi down), after which I had to settle down on current solution, which is simpler and more importantly maintainable.

While many functions may not need the full register set, there are many instructions which have special allocation requirements (like 4 sequential registers or having to start at a register divisible by n) where having a smaller set impacts codegen.

Back when I added consecutive registers support in #80297, I had to disable some special stress register modes just because that wouldn't satisfy the register requirements for the method, given that they needed consecutive registers at multiple places within the same method.

We also can win a lot of this back longer term. We will have more opportunities for cleanup, refactorings, and simplification to get more out of it.

For sure. I have a work item in mind to reduce the size of float/vector registers field from 8 bytes to 4 bytes. With that, all the register masks will be reduced to 4 bytes, which will reduce the size of common data structures like GenTree, RefPosition, Interval by 4 bytes each.

I'd also like to call out again that it's very hard to gauge actual cost by TP numbers alone. SPMI doesn't factor in the overhead of the VM calls/token resolution, it doesn't really factor in the difference between debug vs release costs, it doesn't factor in that methods that are substantially slower may be infrequently compiled (excluding crossgen), it doesn't factor in that instructions can be pipelined, fused, or that some single instructions (division or multiplication) can have the cost of dozens of other instructions.

The numbers I collected in #99658 (comment) proves that there was no impact seen on crossgen2 throughput. Even my previous TP improvements done in #96386, #85144, #87424 and #85842 that combined improved TP numbers by around 15%, none of that showed up in https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html.

We've seen this nuance in the actual perf numbers we track and triage weekly, such as https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html, where the single biggest jump in that was the enabling of DPGO and the steady increase since is due to the organic growth of S.P.Corelib otherwise. The increase in time for changes like this (even being "10% throughput impact") are incredibly minimal in comparison to simply adding new APIs to the BCL or broadly enabling new optimizations.

I think we should worry more about pushing this in the right direction and ensuring the code is understandable/maintainable, and then longer term get it optimized around what we land on from that perspective.

Yes, I would like to get more feedback on "understandable/maintainable" and "code readability" part.

What if we restrict arm64 for the time being to only be able to allocate 24 FP registers? Then we can fit 32 GPR + 24 FP + 8 Mask into 64 bits, presumably with fairly minimal TP impact. Likely we have no cases where we really need more than 24 FP regs, so there won't be much CQ impact either.

@AndyAyersMS - I assume you are talking about having these restrictions on methods that need mask registers, but continue to have 32 fp registers otherwise. We will still not know about it until we get past importer, but by then, we populate most of the register masks like callee-save and callee-trash masks, etc. and will need to reset.

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 21, 2024

The idea would be to just treat some FP registers to not exist universally, so the change would be simple. From my side it was merely a suggestion on how to unblock work if we were unhappy about taking the TP regressions. The wall clock measurements above clearly show MinOpts impact in clrjit.dll (what it translates to on actual startup scenarios is another question).

I had to settle down on current solution, which is simpler and more importantly maintainable.

IMO the current solution seems complex and less maintainable. It adds multiple thousand lines of code and makes it possible to silently get register set operations wrong (like union between two regMaskOnlyOne representing different register types). There's a bunch of different set types you have to decide between when to use and how to convert between.

It's still surprising to me that just bumping regMaskTP to a 12 or 16 byte struct had such large measured throughput impact throughout the JIT. I wonder if there is a simple explanation for some of the cost or if it truly boils down to the more expensive bit set operations throughout the JIT.
One thing we've seen previously is that expanding the size of some types can have disproportionately large impact in number of instructions executed because multiplying by the size of the type can start using different patterns of instructions. These are the kind of costs that we should definitely feel free to ignore.

I have a work item in mind to reduce the size of float/vector registers field from 8 bytes to 4 bytes. With that, all the register masks will be reduced to 4 bytes, which will reduce the size of common data structures like GenTree, RefPosition, Interval by 4 bytes each.

I agree it would be great to have these optimizations.

The numbers I collected in #99658 (comment) proves that there was no impact seen on crossgen2 throughput. Even my previous TP improvements done in #96386, #85144, #87424 and #85842 that combined improved TP numbers by around 15%, none of that showed up in https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html.

Does this measure a significant number of MinOpts compilations? I assume crossgen2 is compiling everything in FullOpts, and we know that crossgen2 itself interacts poorly with tiered compilation (#83112), so it might just be dominated by other costs than jitting.

@jakobbotsch
Copy link
Member

I tried your PR at #96196 and see the following for benchmarks.run_pgo. This is arm64 cross compiled from x64 where I use the intrinsic for BitOperations::PopCount (which reduced the SPMI tp impact from +5.6% to +4.07% initially -- I believe we should be able to use an intrinsic for popcount on arm64 host?)

Base: 141063333131, Diff: 146798692822, +4.0658%

1339309854 : +33.69%     : 20.61% : +0.9494% : public: unsigned __int64 __cdecl LinearScan::RegisterSelection::select<0>(class Interval *, class RefPosition *)                                                                                                                                                                                                      
790662416  : +12966.72%  : 12.16% : +0.5605% : public: void __cdecl Interval::mergeRegisterPreferences(unsigned __int64)                                                                                                                                                                                                                                             
668963185  : +22.84%     : 10.29% : +0.4742% : public: void __cdecl LinearScan::allocateRegisters<0>(void)                                                                                                                                                                                                                                                           
534637331  : +26.96%     : 8.23%  : +0.3790% : private: void __cdecl LinearScan::processBlockStartLocations(struct BasicBlock *)                                                                                                                                                                                                                                     
358948739  : +31.97%     : 5.52%  : +0.2545% : public: void __cdecl LinearScan::allocateRegistersMinimal(void)                                                                                                                                                                                                                                                       
318294318  : +25.55%     : 4.90%  : +0.2256% : private: class RefPosition * __cdecl LinearScan::newRefPosition(class Interval *, unsigned int, enum RefType, struct GenTree *, unsigned __int64, unsigned int)                                                                                                                                                       
256068471  : +39.17%     : 3.94%  : +0.1815% : protected: enum _regNumber_enum __cdecl CodeGen::genConsumeReg(struct GenTree *)                                                                                                                                                                                                                                      
230314932  : +21.92%     : 3.54%  : +0.1633% : private: void __cdecl LinearScan::associateRefPosWithInterval(class RefPosition *)                                                                                                                                                                                                                                    
153840898  : +34.77%     : 2.37%  : +0.1091% : private: void __cdecl LinearScan::addRefsForPhysRegMask(unsigned __int64, unsigned int, enum RefType, bool)                                                                                                                                                                                                           
153099855  : +40.08%     : 2.36%  : +0.1085% : private: void __cdecl LinearScan::freeRegisters(unsigned __int64)                                                                                                                                                                                                                                                     
124799070  : +122.19%    : 1.92%  : +0.0885% : public: void __cdecl GCInfo::gcMarkRegPtrVal(enum _regNumber_enum, enum var_types)                                                                                                                                                                                                                                    
104909768  : +10.68%     : 1.61%  : +0.0744% : protected: void __cdecl CodeGen::genCodeForBBlist(void)                                                                                                                                                                                                                                                               
85800424   : NA          : 1.32%  : +0.0608% : public: void __cdecl emitter::emitUpdateLiveGCregs(enum GCtype, unsigned __int64, unsigned char *)                                                                                                                                                                                                                    
72840134   : +8.71%      : 1.12%  : +0.0516% : private: int __cdecl LinearScan::BuildNode(struct GenTree *)                                                                                                                                                                                                                                                          

So the vast majority of the TP impact is coming from a small number of functions within LSRA. I wonder if it would worth it to try to restrict the introduction of the segregated register sets to LSRA only such that the rest of the JIT doesn't need to learn about the differences.

@kunalspathak
Copy link
Member Author

kunalspathak commented Apr 26, 2024

Startup impact

I did some measurements on TE benchmarks and measured startup and first request time. I barely see ~1.5% regression.
Edit: Added results from Orchard benchmarks that JITs around 34,125 methods

Benchmarks Avg. First request (Base) Avg. First request (Diff) % diff # of Tier 0 methods
Fortune 324.1 325.9 0.56% 6899
Json-minimal 193 195.8 1.45% 3353
Json 204.5 208.3 1.86% 5339
Orchard 5220 5293 1.40% 34,125
Benchmarks Avg. Startup time (Base) Avg. Startup time (Diff) % diff
Fortune 313 315.3 0.73%
Json-minimal 248.2 249.8 0.64%
Json 347.9 350.4 0.72%
Orchard 502 510 1.59%
10 iterations data

Fortunes

First Request Base First Request Diff Startup Base Startup Diff
332 341 309 312
321 348 315 311
348 328 322 311
323 320 313 322
323 321 310 317
326 320 318 316
319 322 313 310
325 314 308 328
313 326 306 312
311 319 316 314

Json-Minimal

First Request Base First Request Diff Startup Base Startup Diff
189 197 249 249
192 196 244 256
196 199 256 248
191 192 244 248
193 195 248 258
194 196 248 244
193 193 247 250
194 196 256 251
194 198 244 244
194 196 246 250

Json Mvc

First Request Base First Request Diff Startup Base Startup Diff
209 210 354 353
211 212 355 350
210 200 340 347
204 215 346 352
193 205 348 351
199 206 352 348
204 207 342 341
206 205 334 344
201 213 349 361
208 210 359 357

Orchard

First Request Base First Request Diff Startup Base Startup Diff
5,130 5,295 514 523
5,103 5,389 513 515
5,389 5,303 506 501
5,152 5,256 505 510
5,250 5,266 513 519
5,314 5,261 495 502
5,148 5,201 483 505
5,224 5,213 500 506
5,368 5,353 493 504
5,119 5,388 497 513

Crossgen2 throughout impact

Crossgen2 Throughput data: #99658 (comment)

TP impact

Now, let's take a look at TP impact reported by superpmi. Looking at the TP difference for benchmarks.run_tiered collection for windows/arm64, I see 6% regression for MinOpts, which contains around 37,089 method contexts.

Overall (+4.25%)
Collection PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch +4.25%
MinOpts (+6.47%)
Collection PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch +6.47%
FullOpts (+2.57%)
Collection PDIFF
benchmarks.run_tiered.windows.arm64.checked.mch +2.57%
?allocateRegistersMinimal@LinearScan@@QEAAXXZ                                                                                       : 340392599  : +45.58%  : 25.24% : +2.9085%
?addRefsForPhysRegMask@LinearScan@@AEAAXAEBU_regMaskAll@@IW4RefType@@_N@Z                                                           : 184181130  : NA       : 13.66% : +1.5737%
?freeRegisters@LinearScan@@AEAAXU_regMaskAll@@@Z                                                                                    : 86201120   : NA       : 6.39%  : +0.7365%
?allocateRegMinimal@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                          : 81623577   : +13.48%  : 6.05%  : +0.6974%
?freeRegister@LinearScan@@AEAAXPEAVRegRecord@@@Z                                                                                    : 70217892   : NA       : 5.21%  : +0.6000%
?gtGetGprRegMask@GenTree@@QEBA_KXZ                                                                                                  : 45216669   : NA       : 3.35%  : +0.3863%
?writeRegisters@LinearScan@@QEAAXPEAVRefPosition@@PEAUGenTree@@@Z                                                                   : 30848120   : NA       : 2.29%  : +0.2636%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                            : 28280466   : +40.08%  : 2.10%  : +0.2416%
?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z                                       : 23259358   : +8.59%   : 1.72%  : +0.1987%
?unassignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVRefPosition@@@Z                                                                : 17337530   : +21.69%  : 1.29%  : +0.1481%
?assignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                     : 16785216   : +27.27%  : 1.24%  : +0.1434%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@IAEBU_regMaskAll@@@Z                                                      : 11895844   : NA       : 0.88%  : +0.1016%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                                                                : 11645946   : +50.81%  : 0.86%  : +0.0995%
?PopCount@BitOperations@@SAI_K@Z                                                                                                    : 10508274   : +134.96% : 0.78%  : +0.0898%
?gcMarkRegPtrVal@GCInfo@@QEAAXW4_regNumber_enum@@W4var_types@@@Z                                                                    : 9900926    : +40.80%  : 0.73%  : +0.0846%
?genConsumeReg@CodeGen@@IEAA?AW4_regNumber_enum@@PEAUGenTree@@@Z                                                                    : 8915253    : +7.52%   : 0.66%  : +0.0762%
?BuildNode@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                         : 8813598    : +5.66%   : 0.65%  : +0.0753%
?newRefPositionRaw@LinearScan@@AEAAPEAVRefPosition@@IPEAUGenTree@@W4RefType@@@Z                                                     : 7806564    : +1.89%   : 0.58%  : +0.0667%
?buildPhysRegRecords@LinearScan@@AEAAXXZ                                                                                            : 7529067    : +16.06%  : 0.56%  : +0.0643%
?BuildCall@LinearScan@@AEAAHPEAUGenTreeCall@@@Z                                                                                     : 5614378    : +14.60%  : 0.42%  : +0.0480%
?genCodeForTreeNode@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                   : 5226233    : +3.53%   : 0.39%  : +0.0447%
?ins_Copy@CodeGen@@QEAA?AW4instruction@@W4_regNumber_enum@@W4var_types@@@Z                                                          : 4644888    : NA       : 0.34%  : +0.0397%
?genProduceReg@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                        : 3660030    : +3.24%   : 0.27%  : +0.0313%
?allocateMemory@ArenaAllocator@@QEAAPEAX_K@Z                                                                                        : 3084940    : +0.89%   : 0.23%  : +0.0264%
?genSetRegToConst@CodeGen@@IEAAXW4_regNumber_enum@@W4var_types@@PEAUGenTree@@@Z                                                     : 3041553    : +30.06%  : 0.23%  : +0.0260%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                   : 2856060    : +1.26%   : 0.21%  : +0.0244%
?instGen_Set_Reg_To_Imm@CodeGen@@QEAAXW4emitAttr@@W4_regNumber_enum@@_JW4insFlags@@@Z                                               : 2522843    : +6.25%   : 0.19%  : +0.0216%
?genRestoreCalleeSavedRegistersHelp@CodeGen@@IEAAXAEBU_regMaskAll@@HH@Z                                                             : 2100312    : NA       : 0.16%  : +0.0179%
?emitOutputInstr@emitter@@IEAA_KPEAUinsGroup@@PEAUinstrDesc@1@PEAPEAE@Z                                                             : 1693293    : +0.44%   : 0.13%  : +0.0145%
?genSaveCalleeSavedRegistersHelp@CodeGen@@IEAAXAEBU_regMaskAll@@HH@Z                                                                : 1689330    : NA       : 0.13%  : +0.0144%
?compCompileHelper@Compiler@@QEAAHPEAUCORINFO_MODULE_STRUCT_@@PEAVICorJitInfo@@PEAUCORINFO_METHOD_INFO@@PEAPEAXPEAIPEAVJitFlags@@@Z : 1594827    : +17.09%  : 0.12%  : +0.0136%
?HasMultiRegRetVal@GenTreeCall@@QEBA_NXZ                                                                                            : -1430094   : -12.99%  : 0.11%  : -0.0122%
?BuildDefsWithKills@LinearScan@@AEAAXPEAUGenTree@@H_K1@Z                                                                            : -1555166   : -100.00% : 0.12%  : -0.0133%
?inst_Mov@CodeGen@@QEAAXW4var_types@@W4_regNumber_enum@@1_NW4emitAttr@@W4insFlags@@@Z                                               : -1927004   : -13.42%  : 0.14%  : -0.0165%
?UpdateLifeVar@?$TreeLifeUpdater@$00@@AEAAXPEAUGenTree@@PEAUGenTreeLclVarCommon@@@Z                                                 : -2246490   : -8.57%   : 0.17%  : -0.0192%
?resetAllRegistersState@LinearScan@@AEAAXXZ                                                                                         : -3366369   : -6.09%   : 0.25%  : -0.0288%
?BuildUse@LinearScan@@AEAAPEAVRefPosition@@PEAUGenTree@@_KH@Z                                                                       : -5405424   : -3.81%   : 0.40%  : -0.0462%
?updateMaxSpill@LinearScan@@QEAAXPEAVRefPosition@@@Z                                                                                : -7422214   : -9.99%   : 0.55%  : -0.0634%
??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                                                                                        : -10955233  : -4.12%   : 0.81%  : -0.0936%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@I_K@Z                                                                     : -11299476  : -100.00% : 0.84%  : -0.0965%
?BuildDefs@LinearScan@@AEAAXPEAUGenTree@@H_K@Z                                                                                      : -12740041  : -100.00% : 0.94%  : -0.1089%
?gtGetRegMask@GenTree@@QEBA_KXZ                                                                                                     : -30758149  : -100.00% : 2.28%  : -0.2628%
?freeRegisters@LinearScan@@AEAAX_K@Z                                                                                                : -94590308  : -100.00% : 7.02%  : -0.8082%
?addRefsForPhysRegMask@LinearScan@@AEAAX_KIW4RefType@@_N@Z                                                                          : -104673921 : -100.00% : 7.76%  : -0.8944%

Most of the regression is coming from allocateRegistersMinimal (which mostly operates on AllRegsMask instead of regMaskTP) and addRefsForPhysRegMask (which now iterates over AllRegsMask on all the register bits set to create RefPosition). I will take a more deeper look on what can be optimized here.

@kunalspathak
Copy link
Member Author

Added a Review guide in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LSRA: Add support to track more than 64 registers
6 participants