Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

BruceForstall · 2021-03-20T23:48:33Z

This change fixes a long-existing difference between DEBUG and non-DEBUG codegen
that recently manifest as asm diffs in instruction encoding in a Checked/Release
SuperPMI run, mostly in the tests\JIT\Directed\arglist\vararg.cs test, specifically:

NativeVarargTest.VarArg:TestPassingManyFloats(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoublesManaged(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoubles(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyFloatsManaged(System.Single[]):bool
double1:ProcessJagged3DArray(System.Double[][])

Example diff:

215c7758454: 76 6a                      jbe     106 // checked
215c77b8ab4: 0f 86 6a 00 00 00          jbe     106 // release

Note that the offset ends up exactly the same, but the encoding is different.

Some context on the emitter:

The emitter collects generated machine instructions as a set of variable-sized
instrDesc structures in instruction groups (IG, or insGroup). An IG is filled
with a variable number of instrDescs into a fixed-sized buffer, which, when full
or a new label (e.g., branch target) is encountered, is copied to dynamically
allocated precise-sized memory for the set of instructions collected. If a BasicBlock
required more instructions than fit in a single IG, multiple IG are generated, and
the ones besides the first are marked with the IGF_EXTEND flag, called "emitAdd"
or overflow IG. After codegen, all the IG are filled, and we have an estimated
size for each instruction, and each IG has an estimated code offset from the
beginning of the function. The estimated sizes could be wrong if we incorrectly
estimated the size of an instruction (which in most cases could be considered a
bug, but is particularly common with our Intel vector instructions currently) or
the offset required for a branch (especially a forward branch). The JIT only does
a single pass of branch tightening, so could have some cases of non-optimal estimates.
Note that estimates are required to be too large; it is fatal for an estimate to
be too small. During emission, we walk the IG list and instructions, encoding them
into the memory buffer provided by the VM. At this point, we might shrink an instruction
encoding, possibly because we can use a smaller branch encoding that expected.

Notably, the size of instrDescs and insGroups varies between DEBUG and non-DEBUG builds.
That means that the number of instrDescs that fit in an IG differs. This should have
no effect on generated code, as long as code properly handles overflow IG and doesn't
affect codegen based on IG size or number of contained instructions.

The problem:

During emission, we calculate an "offset adjustment" (in emitOffsAdj) that is used to
estimate the distance from a forward jump instruction to its target. This determines the
size of jump instruction encodings that we emit. For example, on x86/x64, a relative
conditional branch can have a 1 byte or a 4 byte target offset. A 1-byte offset can be
used if the target instruction is 127 bytes or fewer away. So, we estimate the distance
using the jump instruction's known offset and the estimated offset of the target, which
is the estimated offset of the instruction group (the label) calculated before emission
minus the "shrinkage" of instruction sizes that we've seen so far. This "shrinkage" is
due to over-estimation of instruction encoding sizes and branch estimated sizes.

The shrinkage in emitOffsAdjs is only computed at the beginning of an IG. As additional
instructions in an IG shrink, we don't take that into account when we reach a forward jump
instruction and try to determine if a "small" branch encoding can be used. Thus, it's
possible for a branch to be estimated to require a large 4-byte offset instead of a 1-byte
offset because the forward branch is barely over the limit, but wouldn't be if we considered
the already-shrunk instructions in the current IG. In fact, the more instructions that
shrink in an IG, the larger our target offset estimate becomes, because the source instruction
address is precise (since we know precisely how many bytes of instructions we've emitted),
but the target offset estimate is fixed.

Due to the different number of instructions in an IG between DEBUG and non-DEBUG builds,
the estimate we use for a target offset for any particular branch, given enough instruction
shrinkage and overflowing IG (such that there are a different number of IG between DEBUG and
non-DEBUG) is different. This leads to different instruction encoding between the flavors.

The solution:

The solution is simple: track instruction shrinkage as we emit instructions, and use the
latest, most accurate shrinkage adjustment when estimating branches. This gives no difference
between DEBUG and non-DEBUG because it removes the indirect dependence on IG instruction
count in recomputing emitOffsAdj.

There are asm diffs (with just a Checked x64 JIT compared to baseline Checked JIT) due to some
branches now being SHORT where previously they were LONG.

This change fixes a long-existing difference between DEBUG and non-DEBUG codegen that recently manifest as asm diffs in instruction encoding in a Checked/Release SuperPMI run, mostly in the tests\JIT\Directed\arglist\vararg.cs test, specifically: ``` NativeVarargTest.VarArg:TestPassingManyFloats(System.Double[]):bool NativeVarargTest.VarArg:TestPassingManyDoublesManaged(System.Double[]):bool NativeVarargTest.VarArg:TestPassingManyDoubles(System.Double[]):bool NativeVarargTest.VarArg:TestPassingManyFloatsManaged(System.Single[]):bool double1:ProcessJagged3DArray(System.Double[][]) `` Example diff: ``` 215c7758454: 76 6a jbe 106 // checked 215c77b8ab4: 0f 86 6a 00 00 00 jbe 106 // release ``` Note that the offset ends up exactly the same, but the encoding is different. Some context on the emitter: The emitter collects generated machine instructions as a set of variable-sized `instrDesc` structures in instruction groups (IG, or `insGroup`). An IG is filled with a variable number of instrDescs into a fixed-sized buffer, which, when full or a new label (e.g., branch target) is encountered, is copied to dynamically allocated precise-sized memory for the set of instructions collected. If a BasicBlock required more instructions than fit in a single IG, multiple IG are generated, and the ones besides the first are marked with the IGF_EXTEND flag, called "emitAdd" or overflow IG. After codegen, all the IG are filled, and we have an estimated size for each instruction, and each IG has an estimated code offset from the beginning of the function. The estimated sizes could be wrong if we incorrectly estimated the size of an instruction (which in most cases could be considered a bug, but is particularly common with our Intel vector instructions currently) or the offset required for a branch (especially a forward branch). The JIT only does a single pass of branch tightening, so could have some cases of non-optimal estimates. Note that estimates are required to be too large; it is fatal for an estimate to be too small. During emission, we walk the IG list and instructions, encoding them into the memory buffer provided by the VM. At this point, we might shrink an instruction encoding, possibly because we can use a smaller branch encoding that expected. Notably, the size of instrDescs and insGroups varies between DEBUG and non-DEBUG builds. That means that the number of instrDescs that fit in an IG differs. This should have no effect on generated code, as long as code properly handles overflow IG and doesn't affect codegen based on IG size or number of contained instructions. The problem: During emission, we calculate an "offset adjustment" (in `emitOffsAdj`) that is used to estimate the distance from a forward jump instruction to its target. This determines the size of jump instruction encodings that we emit. For example, on x86/x64, a relative conditional branch can have a 1 byte or a 4 byte target offset. A 1-byte offset can be used if the target instruction is 127 bytes or fewer away. So, we estimate the distance using the jump instruction's known offset and the estimated offset of the target, which is the estimated offset of the instruction group (the label) calculated before emission minus the "shrinkage" of instruction sizes that we've seen so far. This "shrinkage" is due to over-estimation of instruction encoding sizes and branch estimated sizes. The shrinkage in `emitOffsAdjs` is only computed at the beginning of an IG. As additional instructions in an IG shrink, we don't take that into account when we reach a forward jump instruction and try to determine if a "small" branch encoding can be used. Thus, it's possible for a branch to be estimated to require a large 4-byte offset instead of a 1-byte offset because the forward branch is barely over the limit, but wouldn't be if we considered the already-shrunk instructions in the current IG. In fact, the more instructions that shrink in an IG, the larger our target offset estimate becomes, because the source instruction address is precise (since we know precisely how many bytes of instructions we've emitted), but the target offset estimate is fixed. Due to the different number of instructions in an IG between DEBUG and non-DEBUG builds, the estimate we use for a target offset for any particular branch, given enough instruction shrinkage and overflowing IG (such that there are a different number of IG between DEBUG and non-DEBUG) is different. This leads to different instruction encoding between the flavors. The solution: The solution is simple: track instruction shrinkage as we emit instructions, and use the latest, most accurate shrinkage adjustment when estimating branches. This gives no difference between DEBUG and non-DEBUG because it removes the indirect dependence on IG instruction count in recomputing `emitOffsAdj`. There are asm diffs (with just a Checked x64 JIT compared to baseline Checked JIT) due to some branches now being SHORT where previously they were LONG.

BruceForstall · 2021-03-20T23:49:00Z

/azp run runtime-coreclr outerloop

BruceForstall · 2021-03-20T23:49:13Z

/azp run runtime-coreclr jitstress

azure-pipelines · 2021-03-20T23:49:29Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2021-03-20T23:49:47Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2021-03-20T23:50:29Z

@kunalspathak @dotnet/jit-contrib PTAL

kunalspathak · 2021-03-21T05:59:59Z

Possibly, it will also fix #8748?

BruceForstall · 2021-03-21T06:52:06Z

JitStress failures are #49954

BruceForstall · 2021-03-21T06:53:30Z

Possibly, it will also fix #8748?

Yes, it will. (I looked for this issue earlier and couldn't find it.)

Of course, it won't fix the mis-estimated instruction sizes. Is there a separate issue for that?

AndyAyersMS

LGTM.

AndyAyersMS · 2021-03-21T16:08:13Z

src/coreclr/jit/emit.cpp

+                printf("Block predicted offs = %08X, actual = %08X -> size adj = %d\n", ig->igOffs, emitCurCodeOffs(cp),
+                       newOffsAdj);
+            }
+            if (emitOffsAdj != newOffsAdj)


Should this be an assert?

There's a noway_assert below for the fatal condition. We could make the non-fatal condition an assert, to ensure we notice it.

BruceForstall · 2021-03-21T16:45:40Z

/azp run runtime-coreclr jitstress

azure-pipelines · 2021-03-21T16:45:55Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2021-03-21T16:45:59Z

/azp run runtime-coreclr outerloop

azure-pipelines · 2021-03-21T16:46:18Z

Azure Pipelines successfully started running 1 pipeline(s).

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 20, 2021

BruceForstall requested review from kunalspathak and AndyAyersMS March 20, 2021 23:49

AndyAyersMS approved these changes Mar 21, 2021

View reviewed changes

Add assert

667e503

BruceForstall merged commit 7012cd7 into dotnet:main Mar 22, 2021

BruceForstall deleted the FixDebugReleaseDiffs2 branch March 22, 2021 17:55

BruceForstall mentioned this pull request Mar 22, 2021

JIT: instruction size mis-estimates can lead to CHK/REL diffs in codegen #8748

Closed

JulieLeeMSFT assigned BruceForstall Mar 23, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Mar 23, 2021

ghost locked as resolved and limited conversation to collaborators Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

BruceForstall commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

azure-pipelines bot commented Mar 20, 2021

azure-pipelines bot commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

kunalspathak commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

AndyAyersMS left a comment

AndyAyersMS Mar 21, 2021

BruceForstall Mar 21, 2021

BruceForstall commented Mar 21, 2021

azure-pipelines bot commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

azure-pipelines bot commented Mar 21, 2021

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

Conversation

BruceForstall commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

azure-pipelines bot commented Mar 20, 2021

azure-pipelines bot commented Mar 20, 2021

BruceForstall commented Mar 20, 2021

kunalspathak commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

AndyAyersMS left a comment

Choose a reason for hiding this comment

AndyAyersMS Mar 21, 2021

Choose a reason for hiding this comment

BruceForstall Mar 21, 2021

Choose a reason for hiding this comment

BruceForstall commented Mar 21, 2021

azure-pipelines bot commented Mar 21, 2021

BruceForstall commented Mar 21, 2021

azure-pipelines bot commented Mar 21, 2021