Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: add support for cold blocks. #3698

Merged
merged 1 commit into from
Jan 19, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 19, 2022

This PR adds a flag to each block that can be set via the frontend/builder
interface that indicates that the block will not be frequently
executed. As such, the compiler backend should place the block "out of
line" in the final machine code, so that the ordinary, more frequent
execution path that excludes the block does not have to jump around it.

This is useful for adding handlers for exceptional conditions
(slow-paths, guard violations) in a way that minimizes performance cost.

Fixes #2747.

@cfallin cfallin requested a review from fitzgen January 19, 2022 01:33
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Jan 19, 2022
@bjorn3
Copy link
Contributor

bjorn3 commented Jan 19, 2022

Can you add "Fixes #2747" to the PR description?

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I do have a couple nitpicks/suggestions below. I hope it can be forgiven since this is a public-facing API.

cranelift/codegen/src/ir/layout.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/ir/layout.rs Outdated Show resolved Hide resolved
cranelift/frontend/src/frontend.rs Outdated Show resolved Hide resolved
This PR adds a flag to each block that can be set via the frontend/builder
interface that indicates that the block will not be frequently
executed. As such, the compiler backend should place the block "out of
line" in the final machine code, so that the ordinary, more frequent
execution path that excludes the block does not have to jump around it.

This is useful for adding handlers for exceptional conditions
(slow-paths, guard violations) in a way that minimizes performance cost.

Fixes bytecodealliance#2747.
@cfallin cfallin merged commit ae476fd into bytecodealliance:main Jan 19, 2022
@cfallin cfallin deleted the cold-blocks branch January 19, 2022 20:58
@bjorn3
Copy link
Contributor

bjorn3 commented Jan 20, 2022

For the simple-raytracer benchmark of cg_clif the perf effect of using cold blocks seems to be within noise.

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 21, 2022
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in bytecodealliance#3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.

Depends on bytecodealliance#3708.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 21, 2022
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in bytecodealliance#3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.

Depends on bytecodealliance#3708.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 21, 2022
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in bytecodealliance#3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jan 21, 2022
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in bytecodealliance#3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.
alexcrichton pushed a commit to alexcrichton/wasmtime that referenced this pull request Feb 2, 2022
If a block is marked cold but has side-effect-free code that is only
used by side-effectful code in non-cold blocks, we will erroneously fail
to emit it, causing a regalloc failure.

This is due to the interaction of block ordering and lowering: we rely
on block ordering to visit uses before defs (except for backedges) so
that we can effectively do an inline liveness analysis and skip lowering
operations that are not used anywhere. This "inline DCE" is needed
because instruction lowering can pattern-match and merge one instruction
into another, removing the need to generate the source instruction.

Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was
oblivious to this -- it just changed the block sort order. For
efficiency reasons, we generate code in its final order directly, so it
would not be tenable to generate it in e.g. RPO first and then reorder
cold blocks to the bottom; we really do want to visit in the same order
as the final code.

This PR fixes the bug by moving the point at which cold blocks are sunk
to emission-time instead. This is cheaper than either trying to visit
blocks during lowering in RPO but add to VCode out-of-order, or trying
to do some expensive analysis to recover proper liveness. It's not clear
that the latter would be possible anyway -- the need to lower some
instructions depends on other instructions' isel results/merging
success, so we really do need to visit in RPO, and we can't simply lower
all instructions as side-effecting roots (some can't be toplevel nodes).

The one downside of this approach is that the VCode itself still has
cold blocks inline; so in the text format (and hence compile-tests) it's
not possible to see the sinking. This PR adds a test for cold-block
sinking that actually verifies the machine code. (The test also includes
an add-instruction in the cold path that would have been incorrectly
skipped prior to this fix.)

Fortunately this bug would not have been triggered by the one current
use of cold blocks in bytecodealliance#3699, because there the only operation in the
cold block was an (always effectful) call instruction. The worst-case
effect of the bug in other code would be a regalloc panic; no silent
miscompilations could result.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a way to mark blocks as cold
3 participants