Cranelift: add support for cold blocks. #3698

cfallin · 2022-01-19T01:33:49Z

This PR adds a flag to each block that can be set via the frontend/builder
interface that indicates that the block will not be frequently
executed. As such, the compiler backend should place the block "out of
line" in the final machine code, so that the ordinary, more frequent
execution path that excludes the block does not have to jump around it.

This is useful for adding handlers for exceptional conditions
(slow-paths, guard violations) in a way that minimizes performance cost.

Fixes #2747.

bjorn3 · 2022-01-19T14:18:48Z

Can you add "Fixes #2747" to the PR description?

fitzgen

LGTM!

I do have a couple nitpicks/suggestions below. I hope it can be forgiven since this is a public-facing API.

cranelift/codegen/src/ir/layout.rs

cranelift/frontend/src/frontend.rs

This PR adds a flag to each block that can be set via the frontend/builder interface that indicates that the block will not be frequently executed. As such, the compiler backend should place the block "out of line" in the final machine code, so that the ordinary, more frequent execution path that excludes the block does not have to jump around it. This is useful for adding handlers for exceptional conditions (slow-paths, guard violations) in a way that minimizes performance cost. Fixes bytecodealliance#2747.

bjorn3 · 2022-01-20T14:18:47Z

For the simple-raytracer benchmark of cg_clif the perf effect of using cold blocks seems to be within noise.

If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result. Depends on bytecodealliance#3708.

If a block is marked cold but has side-effect-free code that is only used by side-effectful code in non-cold blocks, we will erroneously fail to emit it, causing a regalloc failure. This is due to the interaction of block ordering and lowering: we rely on block ordering to visit uses before defs (except for backedges) so that we can effectively do an inline liveness analysis and skip lowering operations that are not used anywhere. This "inline DCE" is needed because instruction lowering can pattern-match and merge one instruction into another, removing the need to generate the source instruction. Unfortunately, the way that I added cold-block support in bytecodealliance#3698 was oblivious to this -- it just changed the block sort order. For efficiency reasons, we generate code in its final order directly, so it would not be tenable to generate it in e.g. RPO first and then reorder cold blocks to the bottom; we really do want to visit in the same order as the final code. This PR fixes the bug by moving the point at which cold blocks are sunk to emission-time instead. This is cheaper than either trying to visit blocks during lowering in RPO but add to VCode out-of-order, or trying to do some expensive analysis to recover proper liveness. It's not clear that the latter would be possible anyway -- the need to lower some instructions depends on other instructions' isel results/merging success, so we really do need to visit in RPO, and we can't simply lower all instructions as side-effecting roots (some can't be toplevel nodes). The one downside of this approach is that the VCode itself still has cold blocks inline; so in the text format (and hence compile-tests) it's not possible to see the sinking. This PR adds a test for cold-block sinking that actually verifies the machine code. (The test also includes an add-instruction in the cold path that would have been incorrectly skipped prior to this fix.) Fortunately this bug would not have been triggered by the one current use of cold blocks in bytecodealliance#3699, because there the only operation in the cold block was an (always effectful) call instruction. The worst-case effect of the bug in other code would be a regalloc panic; no silent miscompilations could result.

cfallin requested a review from fitzgen January 19, 2022 01:33

cfallin mentioned this pull request Jan 19, 2022

Add epoch-based interruption for cooperative async timeslicing. #3699

Merged

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Jan 19, 2022

fitzgen approved these changes Jan 19, 2022

View reviewed changes

cranelift/codegen/src/ir/layout.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/ir/layout.rs Outdated Show resolved Hide resolved

cranelift/frontend/src/frontend.rs Outdated Show resolved Hide resolved

cfallin force-pushed the cold-blocks branch from 9f9be99 to f489b83 Compare January 19, 2022 20:17

cfallin merged commit ae476fd into bytecodealliance:main Jan 19, 2022

cfallin deleted the cold-blocks branch January 19, 2022 20:58

bjorn3 mentioned this pull request Jan 20, 2022

Cranelift: Print and parse cold flag for blocks #3701

Closed

cfallin mentioned this pull request Jan 21, 2022

Cranelift: Fix cold-blocks-related lowering bug. #3709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: add support for cold blocks. #3698

Cranelift: add support for cold blocks. #3698

cfallin commented Jan 19, 2022 •

edited

Loading

bjorn3 commented Jan 19, 2022

fitzgen left a comment

bjorn3 commented Jan 20, 2022

Cranelift: add support for cold blocks. #3698

Cranelift: add support for cold blocks. #3698

Conversation

cfallin commented Jan 19, 2022 • edited Loading

bjorn3 commented Jan 19, 2022

fitzgen left a comment

Choose a reason for hiding this comment

bjorn3 commented Jan 20, 2022

cfallin commented Jan 19, 2022 •

edited

Loading