Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DEBUG/non-DEBUG asm diffs due to instruction group size estimation #49947

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

BruceForstall
Copy link
Member

This change fixes a long-existing difference between DEBUG and non-DEBUG codegen
that recently manifest as asm diffs in instruction encoding in a Checked/Release
SuperPMI run, mostly in the tests\JIT\Directed\arglist\vararg.cs test, specifically:

NativeVarargTest.VarArg:TestPassingManyFloats(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoublesManaged(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoubles(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyFloatsManaged(System.Single[]):bool
double1:ProcessJagged3DArray(System.Double[][])

Example diff:

215c7758454: 76 6a                      jbe     106 // checked
215c77b8ab4: 0f 86 6a 00 00 00          jbe     106 // release

Note that the offset ends up exactly the same, but the encoding is different.

Some context on the emitter:

The emitter collects generated machine instructions as a set of variable-sized
instrDesc structures in instruction groups (IG, or insGroup). An IG is filled
with a variable number of instrDescs into a fixed-sized buffer, which, when full
or a new label (e.g., branch target) is encountered, is copied to dynamically
allocated precise-sized memory for the set of instructions collected. If a BasicBlock
required more instructions than fit in a single IG, multiple IG are generated, and
the ones besides the first are marked with the IGF_EXTEND flag, called "emitAdd"
or overflow IG. After codegen, all the IG are filled, and we have an estimated
size for each instruction, and each IG has an estimated code offset from the
beginning of the function. The estimated sizes could be wrong if we incorrectly
estimated the size of an instruction (which in most cases could be considered a
bug, but is particularly common with our Intel vector instructions currently) or
the offset required for a branch (especially a forward branch). The JIT only does
a single pass of branch tightening, so could have some cases of non-optimal estimates.
Note that estimates are required to be too large; it is fatal for an estimate to
be too small. During emission, we walk the IG list and instructions, encoding them
into the memory buffer provided by the VM. At this point, we might shrink an instruction
encoding, possibly because we can use a smaller branch encoding that expected.

Notably, the size of instrDescs and insGroups varies between DEBUG and non-DEBUG builds.
That means that the number of instrDescs that fit in an IG differs. This should have
no effect on generated code, as long as code properly handles overflow IG and doesn't
affect codegen based on IG size or number of contained instructions.

The problem:

During emission, we calculate an "offset adjustment" (in emitOffsAdj) that is used to
estimate the distance from a forward jump instruction to its target. This determines the
size of jump instruction encodings that we emit. For example, on x86/x64, a relative
conditional branch can have a 1 byte or a 4 byte target offset. A 1-byte offset can be
used if the target instruction is 127 bytes or fewer away. So, we estimate the distance
using the jump instruction's known offset and the estimated offset of the target, which
is the estimated offset of the instruction group (the label) calculated before emission
minus the "shrinkage" of instruction sizes that we've seen so far. This "shrinkage" is
due to over-estimation of instruction encoding sizes and branch estimated sizes.

The shrinkage in emitOffsAdjs is only computed at the beginning of an IG. As additional
instructions in an IG shrink, we don't take that into account when we reach a forward jump
instruction and try to determine if a "small" branch encoding can be used. Thus, it's
possible for a branch to be estimated to require a large 4-byte offset instead of a 1-byte
offset because the forward branch is barely over the limit, but wouldn't be if we considered
the already-shrunk instructions in the current IG. In fact, the more instructions that
shrink in an IG, the larger our target offset estimate becomes, because the source instruction
address is precise (since we know precisely how many bytes of instructions we've emitted),
but the target offset estimate is fixed.

Due to the different number of instructions in an IG between DEBUG and non-DEBUG builds,
the estimate we use for a target offset for any particular branch, given enough instruction
shrinkage and overflowing IG (such that there are a different number of IG between DEBUG and
non-DEBUG) is different. This leads to different instruction encoding between the flavors.

The solution:

The solution is simple: track instruction shrinkage as we emit instructions, and use the
latest, most accurate shrinkage adjustment when estimating branches. This gives no difference
between DEBUG and non-DEBUG because it removes the indirect dependence on IG instruction
count in recomputing emitOffsAdj.

There are asm diffs (with just a Checked x64 JIT compared to baseline Checked JIT) due to some
branches now being SHORT where previously they were LONG.

This change fixes a long-existing difference between DEBUG and non-DEBUG codegen
that recently manifest as asm diffs in instruction encoding in a Checked/Release
SuperPMI run, mostly in the tests\JIT\Directed\arglist\vararg.cs test, specifically:

```
NativeVarargTest.VarArg:TestPassingManyFloats(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoublesManaged(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyDoubles(System.Double[]):bool
NativeVarargTest.VarArg:TestPassingManyFloatsManaged(System.Single[]):bool
double1:ProcessJagged3DArray(System.Double[][])
``
Example diff:
```
215c7758454: 76 6a                      jbe     106 // checked
215c77b8ab4: 0f 86 6a 00 00 00          jbe     106 // release
```

Note that the offset ends up exactly the same, but the encoding is different.

Some context on the emitter:

The emitter collects generated machine instructions as a set of variable-sized
`instrDesc` structures in instruction groups (IG, or `insGroup`). An IG is filled
with a variable number of instrDescs into a fixed-sized buffer, which, when full
or a new label (e.g., branch target) is encountered, is copied to dynamically
allocated precise-sized memory for the set of instructions collected. If a BasicBlock
required more instructions than fit in a single IG, multiple IG are generated, and
the ones besides the first are marked with the IGF_EXTEND flag, called "emitAdd"
or overflow IG. After codegen, all the IG are filled, and we have an estimated
size for each instruction, and each IG has an estimated code offset from the
beginning of the function. The estimated sizes could be wrong if we incorrectly
estimated the size of an instruction (which in most cases could be considered a
bug, but is particularly common with our Intel vector instructions currently) or
the offset required for a branch (especially a forward branch). The JIT only does
a single pass of branch tightening, so could have some cases of non-optimal estimates.
Note that estimates are required to be too large; it is fatal for an estimate to
be too small. During emission, we walk the IG list and instructions, encoding them
into the memory buffer provided by the VM. At this point, we might shrink an instruction
encoding, possibly because we can use a smaller branch encoding that expected.

Notably, the size of instrDescs and insGroups varies between DEBUG and non-DEBUG builds.
That means that the number of instrDescs that fit in an IG differs. This should have
no effect on generated code, as long as code properly handles overflow IG and doesn't
affect codegen based on IG size or number of contained instructions.

The problem:

During emission, we calculate an "offset adjustment" (in `emitOffsAdj`) that is used to
estimate the distance from a forward jump instruction to its target. This determines the
size of jump instruction encodings that we emit. For example, on x86/x64, a relative
conditional branch can have a 1 byte or a 4 byte target offset. A 1-byte offset can be
used if the target instruction is 127 bytes or fewer away. So, we estimate the distance
using the jump instruction's known offset and the estimated offset of the target, which
is the estimated offset of the instruction group (the label) calculated before emission
minus the "shrinkage" of instruction sizes that we've seen so far. This "shrinkage" is
due to over-estimation of instruction encoding sizes and branch estimated sizes.

The shrinkage in `emitOffsAdjs` is only computed at the beginning of an IG. As additional
instructions in an IG shrink, we don't take that into account when we reach a forward jump
instruction and try to determine if a "small" branch encoding can be used. Thus, it's
possible for a branch to be estimated to require a large 4-byte offset instead of a 1-byte
offset because the forward branch is barely over the limit, but wouldn't be if we considered
the already-shrunk instructions in the current IG. In fact, the more instructions that
shrink in an IG, the larger our target offset estimate becomes, because the source instruction
address is precise (since we know precisely how many bytes of instructions we've emitted),
but the target offset estimate is fixed.

Due to the different number of instructions in an IG between DEBUG and non-DEBUG builds,
the estimate we use for a target offset for any particular branch, given enough instruction
shrinkage and overflowing IG (such that there are a different number of IG between DEBUG and
non-DEBUG) is different. This leads to different instruction encoding between the flavors.

The solution:

The solution is simple: track instruction shrinkage as we emit instructions, and use the
latest, most accurate shrinkage adjustment when estimating branches. This gives no difference
between DEBUG and non-DEBUG because it removes the indirect dependence on IG instruction
count in recomputing `emitOffsAdj`.

There are asm diffs (with just a Checked x64 JIT compared to baseline Checked JIT) due to some
branches now being SHORT where previously they were LONG.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 20, 2021
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Member Author

@kunalspathak @dotnet/jit-contrib PTAL

@kunalspathak
Copy link
Member

Possibly, it will also fix #8748?

@BruceForstall
Copy link
Member Author

JitStress failures are #49954

@BruceForstall
Copy link
Member Author

Possibly, it will also fix #8748?

Yes, it will. (I looked for this issue earlier and couldn't find it.)

Of course, it won't fix the mis-estimated instruction sizes. Is there a separate issue for that?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

printf("Block predicted offs = %08X, actual = %08X -> size adj = %d\n", ig->igOffs, emitCurCodeOffs(cp),
newOffsAdj);
}
if (emitOffsAdj != newOffsAdj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an assert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a noway_assert below for the fatal condition. We could make the non-fatal condition an assert, to ensure we notice it.

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall BruceForstall merged commit 7012cd7 into dotnet:main Mar 22, 2021
@BruceForstall BruceForstall deleted the FixDebugReleaseDiffs2 branch March 22, 2021 17:55
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Mar 23, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants