-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable AVX512 Additional 16 SIMD Registers #79544
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsPR is a draft for testing purposes. Will update with a full description of the changes when ready for review.
|
d169ba4
to
317d180
Compare
{ | ||
case IF_RWR_RRD_RRD: | ||
case IF_RWR_RRD_RRD_CNS: | ||
case IF_RWR_RRD_RRD_RRD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that this one is here and in idHasReg4
?
Could we maybe simplify it to default: return idHasReg4()
?
@@ -78,10 +78,16 @@ | |||
#endif // !UNIX_AMD64_ABI | |||
#define CSE_CONSTS 1 // Enable if we want to CSE constants | |||
|
|||
#define RBM_ALLFLOAT (RBM_XMM0 | RBM_XMM1 | RBM_XMM2 | RBM_XMM3 | RBM_XMM4 | RBM_XMM5 | RBM_XMM6 | RBM_XMM7 | RBM_XMM8 | RBM_XMM9 | RBM_XMM10 | RBM_XMM11 | RBM_XMM12 | RBM_XMM13 | RBM_XMM14 | RBM_XMM15) | |||
#define RBM_LOWFLOAT (RBM_XMM0 | RBM_XMM1 | RBM_XMM2 | RBM_XMM3 | RBM_XMM4 | RBM_XMM5 | RBM_XMM6 | RBM_XMM7 | RBM_XMM8 | RBM_XMM9 | RBM_XMM10 | RBM_XMM11 | RBM_XMM12 | RBM_XMM13 | RBM_XMM14 | RBM_XMM15 ) | |||
#define RBM_HIGHFLOAT (RBM_XMM16 | RBM_XMM17 | RBM_XMM18 | RBM_XMM19 | RBM_XMM20 | RBM_XMM21 | RBM_XMM22 | RBM_XMM23 | RBM_XMM24 | RBM_XMM25 | RBM_XMM26 | RBM_XMM27 | RBM_XMM28 | RBM_XMM29 | RBM_XMM30 | RBM_XMM31) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could come up with a better name than "HIGHFLOAT"
From the initial 8
SIMD registers, we've had 2 extensions so far and its conceivable a 3rd could happen in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used "high" to refer to the 16 AVX512 registers mostly because that is the term reference in the Intel manual, with "next" being used to reference the 8 AVX registers.
I'm in favor of a better naming scheme, but not certain what to choose. We could select RBM_AVX512FLOAT
or something along those lines...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding any more thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a particular preference and I don't think it's "strictly" necessary to update. More a passing thought about how it might be confusing.
@anthonycanino Looks like you need to fix the formatting. Diffs -- no diffs but reasonably significant TP regressions (~0.5%) on x64. |
Do you want to include https://github.com/dotnet/runtime/compare/main...BruceForstall:EnableJitStressHighAvx512Regs?expand=1 in this change? |
Yes I think that would be a good idea. Do I merge it in this this PR or do we do that seperately? |
Include the changes as a new commit. Doesn't need to be a branch merge from mine or anything like that. |
Looks like this is likely to conflict with #79478 |
c2ee20b
to
4011e25
Compare
I've rebased and pushed to test now. |
src/coreclr/jit/compiler.cpp
Outdated
#if defined(TARGET_AMD64) | ||
rbmAllFloat = RBM_ALLFLOAT_INIT; | ||
rbmFltCalleeTrash = RBM_FLT_CALLEE_TRASH_INIT; | ||
rbmCntCalleeTrashFloat = CNT_CALLEE_TRASH_FLOAT_INIT; | ||
|
||
if (DoJitStressEvexEncoding()) | ||
{ | ||
rbmAllFloat |= RBM_HIGHFLOAT; | ||
rbmFltCalleeTrash |= RBM_HIGHFLOAT; | ||
rbmCntCalleeTrashFloat += 16; | ||
} | ||
rbmCalleeTrash = RBM_CALLEE_TRASH_INIT; | ||
rbmCalleeTrashNoGC = RBM_CALLEE_TRASH_NOGC_INIT; | ||
rbmCalleeTrashWriteBarrier = RBM_CALLEE_TRASH_WRITEBARRIER_INIT; | ||
rbmCalleeGCTrashWriteBarrier = RBM_CALLEE_GCTRASH_WRITEBARRIER_INIT; | ||
rbmCalleeTrashWriteBarrierByref = RBM_CALLEE_TRASH_WRITEBARRIER_BYREF_INIT; | ||
rbmCalleeGCTrashWriteBarrierByref = RBM_CALLEE_GCTRASH_WRITEBARRIER_BYREF_INIT; | ||
rbmStopForGCTrash = RBM_STOP_FOR_GC_TRASH_INIT; | ||
rbmInitPInvokeFrameTrash = RBM_INIT_PINVOKE_FRAME_TRASH_INIT; | ||
rbmProfilerEnterTrash = RBM_PROFILER_ENTER_TRASH_INIT; | ||
rbmProfilerLeaveTrash = RBM_PROFILER_LEAVE_TRASH_INIT; | ||
rbmProfilerTailcallTrash = RBM_PROFILER_TAILCALL_TRASH_INIT; | ||
#endif // TARGET_AMD64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are global variables now, that are initialized every time the JIT is initialized?
Can we assume the memory write is atomic? Should they be initialized using InterlockedExchange64? Using Interlocked, we could assert they either have the same value as written, or zero (uninitialized).
I presume they are not on Compiler because either (1) it was inconvenient to plumb a Compiler* everywhere they are needed, (2) throughput, or (3) you couldn't/didn't want to use the TLS Compiler* (maybe only available in DEBUG?)
Could there be a scenario where these change for a single clrjit.dll instance? E.g., what if crossgen2/AOT compilation loads the JIT and compiles both an AVX-512 targeting function and a non-AVX-512 targeting function?
Currently, they are only changed with the JitStress mode. They will presumably be set based on ISA. Does this initialization location have all the appropriate ISA information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these are global variables now, that are initialized every time the JIT is initialized?
Can we assume the memory write is atomic? Should they be initialized using InterlockedExchange64? Using Interlocked, we could assert they either have the same value as written, or zero (uninitialized).
I presume they are not on Compiler because either (1) it was inconvenient to plumb a Compiler* everywhere they are needed, (2) throughput, or (3) you couldn't/didn't want to use the TLS Compiler* (maybe only available in DEBUG?)
Yes, this is why --- passing compiler
around was both clunky and also incurred its own overhead because it needed to be checked on each access.
Could there be a scenario where these change for a single clrjit.dll instance? E.g., what if crossgen2/AOT compilation loads the JIT and compiles both an AVX-512 targeting function and a non-AVX-512 targeting function?
I could use some help here, because honestly I'm not familiar enough with the internals to understand this. As it's done now, I am assuming that these are unchanged and compiling for a single arch/ISA.
Currently, they are only changed with the JitStress mode. They will presumably be set based on ISA. Does this initialization location have all the appropriate ISA information?
The compiler object has this information as far as I know. Essentially, I would envision replacing with an compOpportunisticallyDependsOn(InstructionSet_AVX512F_VL)
check.
@tannergooding @kunalspathak are my assumptions here correct? Will this be an issue with crossgen
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be filtered oncompOpportunisticallyDependsOn(InstructionSet_AVX512F)
, but I agree that I would expect crossgen to be work correctly presuming that.
I do think this needs to be "per compiler instance" and not part of the global state as we can have AltJits or other scenarios where one compiler
supports an ISA but another does not.
If we could cheapen this further, such as pulling the relevant state into a local where the register allocator needs it, that would presumably be "better".
If the TP regression is "too much" even with that, I could imagine a few ways we could use a template or inlining to ensure the LSRA code still uses a constant and is primarily filtered from a centralized high level check (of course, this comes at the cost of increased deployment size for clrjit.dll).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could cheapen this further, such as pulling the relevant state into a local where the register allocator needs it
The problem is it is not just LSRA that depends on them, but optimizer and codegen as well. Hence it is little hard to extract that out just at the LSRA level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume they are not on Compiler because either (1) it was inconvenient to plumb a Compiler* everywhere they are needed, (2) throughput, or (3) you couldn't/didn't want to use the TLS Compiler* (maybe only available in DEBUG?)
The reason was #1
, but what you have brought up is a legitimate concern. The only way to avoid it to have them on compiler object and then pass compiler
instance around. Depending on which area are using these macros, we could do something like, where you have a macro (per file since they are not that many) which takes compiler object, but at all the places where it is needed, you use a version of macro that actually call the one that takes compiler object. That way, at the callsite, you don't have to pass compiler
object. Its ugly :(
#define ABC_WITH_COMPILER(compiler) compiler->isEvex ? x : y
#define ABC ABC_WITH_COMPILER(this)
= ABC;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak @tannergooding do you have any more thoughts on how to adjust the PR for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try the approach in #79544 (comment) and #79544 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me try this approach #79544 (comment) then and see how the numbers look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak I have a draft of this up at 9c386bc . Currently waiting for test data / throughput.
To be honest, I do not like it, it seems really messy (even if we move the macro definitions to the call sites). It also seems the alternatives aren't great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, with the change 9c386bc ...
It seems like most approaches have been pretty similar with throughput impact.
Triggered avx-512 pipeline: https://dev.azure.com/dnceng-public/public/_build/results?buildId=112900&view=results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments for code around LSRA.
src/coreclr/jit/emitinl.h
Outdated
@@ -214,7 +214,8 @@ inline ssize_t emitter::emitGetInsAmdAny(instrDesc* id) | |||
|
|||
/*static*/ inline void emitter::emitEncodeCallGCregs(regMaskTP regmask, instrDesc* id) | |||
{ | |||
assert((regmask & RBM_CALLEE_TRASH) == 0); | |||
// TODO-XARCH-AVX512 global defined in compiler.h, not in scope here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix and uncomment?
src/coreclr/jit/lsraxarch.cpp
Outdated
@@ -1900,21 +1905,30 @@ int LinearScan::BuildIntrinsic(GenTree* tree) | |||
break; | |||
} | |||
assert(tree->gtGetOp2IfPresent() == nullptr); | |||
|
|||
// TODO-XARCH-AVX512 this is overly constraining register available as NI_System_Math_Abs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this extracted out in a separate method (along with the comment) instead of having this check at multiple places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, shouldn't this be under isEvexCompatible
check instead of just using RBM_LOWSIMD
for AMD64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, its the NamedIntrinsics
and not the HWIntrinsics
, e.g., those defined in namedintrinsiclist.h
.
I suppose I can try and add a similar marking for these but as far as I know there is not a set of flags for these named intrinsics?
srcCount += isRMW ? BuildDelayFreeUses(op3, op1) : BuildOperandUses(op3); | ||
regMaskTP op3RegCandidates = RBM_NONE; | ||
#if defined(TARGET_AMD64) | ||
if (!isEvexCompatible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I would expect the above code too.
src/coreclr/jit/lsrabuild.cpp
Outdated
if (mask == RBM_LOWSIMD) | ||
{ | ||
// Constrain if we have to for float/simd types | ||
if (varTypeIsFloating(theInterval->registerType) || varTypeIsSIMD(theInterval->registerType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is heavy check to add in newRefPosition
. Is there a way to move this check (you can have this logic in a helper method or something) to relevant callers which would definitely pass mask == RBM_LOWSIMD
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the callsite, since you already have the GenTree*
for the RefPosition
, you can do something like this:
regMaskTP mask = RBM_NONE;
if (varTypeIsFloat(node) || varTypeIsSIMD(node)) {
mask = lowSIMDRegs();
}
BuildUse(node, mask);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there was a case where specifying the mask at a BuildUse
callsite did not cover, something related to a contained GT_INDIR
node whose register set was not the same as its containing node.
Let me see if I can find the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change 6cd4d43 we can see the following example cause some potential issues...
[MethodImpl(MethodImplOptions.NoInlining)]
static bool Test_Sse41_TestZ_Normal(in Vector128<Int32> x, in Vector128<Int32> y)
{
return Sse41.TestZ(x, y);
}
which lowers to....
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR byref V00 arg0 u:1 (last use) $80
/--* t0 byref
N002 ( 7, 5) [000001] ---XG------ t1 = * IND simd16
N003 ( 1, 1) [000002] ----------- t2 = LCL_VAR byref V01 arg1 u:1 (last use) $81
/--* t2 byref
N004 ( 7, 5) [000003] -c-XG------ t3 = * IND simd16
/--* t1 simd16
+--* t3 simd16
N005 ( 15, 11) [000004] ---XG------ * HWINTRINSIC void int PTEST <l:$243, c:$242>
Since TestZ
associated lowered instruction ptest
does not have an EVEX form, we restrict it's operands (which are both float variables that are contained) to the lower SIMD registers. I believe this creates an issue though since if we specify constraints on t3 = IND
it also propagates to its child node t2 = LCL_VAR byref ...
. but the nodes have different variable types. Previously, RBM_NONE
would have been passed down, which would let newRefPosition
select a set of registers based on ref positions type.
Does this make sense? Essentially, RBM_LOWSIMD
was a version of RBM_NONE
to get around this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hhm, I thought you were planning to remove it. Did you forget to push?
@dotnet/avx512-contrib can I get a rerun of the avx512 stress pipelines? |
Requeued the failing jobs. |
-- Just worth noting that between November and now we've gotten multiply TP improvements in, including some that were There are likely more TP improvements to be had for this code of course, but it's potentially something we can iterate on separately. |
I agree. |
So are we ok with the changes now? |
Can you remind me why there is any TP impact on x86? [Edit] I suppose that could be due to the |
Yes, this is what I think. |
It looks like on Linux (and Mac?) we'll need additional updates in src\coreclr\pal\src\thread\context.cpp, specifically But maybe that's not causing problems in this change, and can be deferred to a separate change? |
Convert from macros to accessor functions for RBM_ALLFLOAT, RBM_FLT_CALLEE_TRASH, CNT_CALLEE_TRASH_FLOAT. Convert LSRA use of ACTUAL_REG_COUNT to AVAILABLE_REG_COUNT, and create an accessor for that value for AMD64 as well.
The macros defining the AVX512-specific masks and counts were bothering me, so I wrote #81818, to implement them as accessor functions in the various classes that need them. I think it looks better this way. Comments? [Edit] I created a PR to this overall change if you decide to take it: anthonycanino#6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register allocator portion looks good to me modulo the changes suggested by @BruceForstall in #81818.
const unsigned lsraRegOrderSize = ArrLen(lsraRegOrder); | ||
static const regNumber lsraRegOrder[] = {REG_VAR_ORDER}; | ||
const unsigned lsraRegOrderSize = ArrLen(lsraRegOrder); | ||
// TODO-XARCH-AVX512 we might want to move this to be configured with the rbm variables too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this happen in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting will do in a follow up PR
Yes, I will pull in. |
…egs-with-reg-accessors Use inline accessor functions instead of macros
No diffs, TP regressions as before. Failures looks like infra or unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your patience and efforts!
Thanks all! |
PR enables AVX512 additional 16 SIMD registers for use with the register allocator and codegen emitter. The PR largely consists of the following
xarch
regMask type fromuint32
touint64
to make room for additional registers.MM16
-MM31
to the register definitions (targetamd.h
).RBM
macros to refer to globally defined variables (float register availability, calleer/callee save register sets etc.) that can be initialized at runtime (so as to add the upper SIMD registers to the definitions if AVX512 is available). Previously, the bank of registers was entirely determined by the architecture, whereas it is now determined by arch and ISA availability).JitStressRegs=4
which forces an upper SIMD register to be selected during register allocation.instrDesc
struct instead of a pureinstruction
to determine when an EVEX prefix is needed. Previously, the instruction itself was enough to determine if SSE/VEX is needed --- now, we require more information about how the instruction is used, e.g., does it require an upper SIMD register.Of all the points above, (4) requires a little more elaboration. When the LSRA constraints are build up for a node, typically
RBM_NONE
is passed as the constraint mask for an ref interval, which eventually leads to the LSRA selecting the set of available registers for the ref interval type. For nodes whose lowering does not have an available EVEX form (see the commit message), one would like to specify the constraint mask toRBM_LOWFLOAT
or an equivalent set of registers; however, this came into conflict with the way contained nodes are handling during lowering (and comments even state that the constraint mask feature will not work if contained nodes do not have the same type as the containing node).To mitigate a lot of rewriting, I created a special
RBM_LOWSIMD
tag in theregMask
type which functions much likeRBM_NONE
except that during the creation of a new ref position (LinearScan::newRefPosition
) it will select the low SIMD nodes if the interval is a floating/simd type, otherwise it will default toRBM_NONE
functionality.