New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance by binning together opaque items instead of sorting them. #12453
Conversation
them. Today, we sort all entities added to all phases, even the phases that don't strictly need sorting, such as the opaque and shadow phases. This results in a performance loss because our `PhaseItem`s are rather large in memory, so sorting is slow. Additionally, determining the boundaries of batches is an O(n) process. This commit makes Bevy instead applicable place phase items into *bins* keyed by *bin keys*, which have the invariant that everything in the same bin is potentially batchable. This makes determining batch boundaries O(1), because everything in the same bin can be batched. Instead of sorting each entity, we now sort only the bin keys. This drops the sorting time to near-zero on workloads with few bins like `many_cubes --no-frustum-culling`. Memory usage is improved too, with batch boundaries and dynamic indices now implicit instead of explicit. The improved memory usage results in a significant win even on unbatchable workloads like `many_cubes --no-frustum-culling --vary-material-data-per-instance`, presumably due to cache effects. Not all phases can be binned; some, such as transparent and transmissive phases, must still be sorted. To handle this, this commit splits `PhaseItem` into `BinnedPhaseItem` and `SortedPhaseItem`. Most of the logic that today deals with `PhaseItem`s has been moved to `SortedPhaseItem`. `BinnedPhaseItem` has the new logic. FPS results are as follows: | Benchmark | `binning` | `main` | Speedup | | ------------------------ | --------- | ------- | ------- | | `many_cubes -nfc -vmdpi` | 4.608 | 3.373 | 36.61% | | `many_cubes -nfc` | 42.161 | 35.437 | 18.97% | | `many_foxes` | 308.943 | 303.321 | 1.85% | (`-nfc` is short for `--no-frustum-culling`; `-vmdpi` is short for `--vary-material-data-per-instance`.)
{ | ||
fn default() -> Self { | ||
Self { | ||
batchable_keys: default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to use the const values where possible since Default can't be const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec::new() and the other equivalent constructors can be const, but Default::default need to at least be inlined in some way. It may be preferable to use const construction where possible.
meshes, 3D meshes, and lights, for performance. This commit splits `VisibleEntities::entities` into three separate lists: one for lights, one for 2D meshes, and one for 3D meshes. This allows `queue_material_meshes` and similar methods to avoid examining entities that are obviously irrelevant. In particular, this separation helps scenes with many skinned meshes, as the individual bones are considered visible entities but have no rendered appearance. Internally, `VisibleEntities::entities` is a `HashMap` from the `TypeId` representing a `QueryFilter` to the appropriate `Entity` list. I had to do this because `VisibleEntities` is located within an upstream crate from the crates that provide lights (`bevy_pbr`) and 2D meshes (`bevy_sprite`). As an added benefit, this setup allows apps to provide their own types of renderable components, by simply adding a specialized `check_visibility` to the schedule. Currently, this provides minimal performance benefit in `many_foxes`, but on top of binning (bevyengine#12453) it provides a 20% end-to-end speedup on that benchmark. * `check_visibility` and `VisibleEntities` now store the three types of renderable entities--2D meshes, 3D meshes, and lights--separately. If your custom rendering code examines `VisibleEntities`, it will now need to specify which type of entity it's interested in using the `WithMesh2d`, `WithMesh`, and `WithLight` types respectively. If your app introduces a new type of renderable entity, you'll need to add an explicit call to `check_visibility` to the schedule to accommodate your new component or components.
Addressed review comments. |
This commit introduces a new component, `GpuCulling`, which, when present on a camera, skips the CPU visibility check in favor of doing the frustum culling on the GPU. This trades off potentially-increased CPU work and drawcalls in favor of cheaper culling and doesn't improve the performance of any workloads that I know of today. However, it opens the door to significant optimizations in the future by taking the necessary first step toward *GPU-driven rendering*. Enabling GPU culling for a view puts the rendering for that view into *indirect mode*. In indirect mode, CPU-level visibility checks are skipped, and all visible entities are considered potentially visible. Bevy's batching logic still runs as usual, but it doesn't directly generate mesh instance indices. Instead, it generates *instance handles*, which are indices into an array of real instance indices. Before any rendering is done, for each view, a compute shader, `cull.wgsl`, maps instance handles to instance indices, discarding any instance handles that represent meshes that are outside the visible frustum. Draws are then done using the *indirect draw* feature of `wgpu`, which instructs the GPU to read the number of actual instances from the output of that compute shader. Essentially, GPU culling works by adding a new level of indirection between the CPU's notion of instances (known as instance handles) and the GPU's notion of instances. A new `--gpu-culling` flag has been added to the `many_foxes`, `many_cubes`, and `3d_shapes` examples. Potential follow-ups include: * Split up `RenderMeshInstances` into CPU-driven and GPU-driven parts. The former, which contain fields like the transform, won't be initialized at all in when GPU culling is enabled. Instead, the transform will be directly written to the GPU in `extract_meshes`, like `extract_skins` does for joint matrices. * Implement GPU culling for shadow maps. - Following that, we can treat all cascades as one as far as the CPU is concerned, simply replaying the final draw commands with different view uniforms, which should reduce the CPU overhead considerably. * Retain bins from frame to frame so that they don't have to be rebuilt. This is a longer term project that will build on top of bevyengine#12453 and several of the tasks in bevyengine#12590, such as main-world pipeline specialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from those two small comments, still looks good to me. And I’m assuming it appears to work as intended to the best of our knowledge. I haven’t had time to test. Thanks to @IceSentry for spotting a shadow bug. :)
Review comments addressed. I updated the benchmarks with the new numbers. The differences from the previous revision are in the noise. |
|
this "broke" deterministic rendering (http://dev-docs.bevyengine.org/bevy/render/deterministic/struct.DeterministicRenderingConfig.html). There is no z-fighting with or without it, but it doesn't render the same color between run. See example |
# Objective - Since #12453, `DeterministicRenderingConfig` doesn't do anything ## Solution - Remove it --- ## Migration Guide - Removed `DeterministicRenderingConfig`. There shouldn't be any z fighting anymore in the rendering even without setting `stable_sort_z_fighting`
# Objective - Since bevyengine#12453, `DeterministicRenderingConfig` doesn't do anything ## Solution - Remove it --- ## Migration Guide - Removed `DeterministicRenderingConfig`. There shouldn't be any z fighting anymore in the rendering even without setting `stable_sort_z_fighting`
Today, we sort all entities added to all phases, even the phases that don't strictly need sorting, such as the opaque and shadow phases. This results in a performance loss because our
PhaseItem
s are rather large in memory, so sorting is slow. Additionally, determining the boundaries of batches is an O(n) process.This commit makes Bevy instead applicable place phase items into bins keyed by bin keys, which have the invariant that everything in the same bin is potentially batchable. This makes determining batch boundaries O(1), because everything in the same bin can be batched. Instead of sorting each entity, we now sort only the bin keys. This drops the sorting time to near-zero on workloads with few bins like
many_cubes --no-frustum-culling
. Memory usage is improved too, with batch boundaries and dynamic indices now implicit instead of explicit. The improved memory usage results in a significant win even on unbatchable workloads likemany_cubes --no-frustum-culling --vary-material-data-per-instance
, presumably due to cache effects.Not all phases can be binned; some, such as transparent and transmissive phases, must still be sorted. To handle this, this commit splits
PhaseItem
intoBinnedPhaseItem
andSortedPhaseItem
. Most of the logic that today deals withPhaseItem
s has been moved toSortedPhaseItem
.BinnedPhaseItem
has the new logic.Frame time results (in ms/frame) are as follows:
binning
main
many_cubes -nfc -vpi
many_cubes -nfc
many_foxes
(
-nfc
is short for--no-frustum-culling
;-vpi
is short for--vary-per-instance
.)Changelog
Changed
Migration Guide
PhaseItem
has been split intoBinnedPhaseItem
andSortedPhaseItem
. If your code has customPhaseItem
s, you will need to migrate them to one of these two types.SortedPhaseItem
requires the fewest code changes, but you may want to pickBinnedPhaseItem
if your phase doesn't require sorting, as that enables higher performance.Tracy graphs
many-cubes --no-frustum-culling
,main
branch:many-cubes --no-frustum-culling
, this branch:You can see that
batch_and_prepare_binned_render_phase
is a much smaller fraction of the time. Zooming in on that function, with yellow being this branch and red beingmain
, we see:The binning happens in
queue_material_meshes
. Again with yellow being this branch and red beingmain
:We can see that there is a small regression in
queue_material_meshes
performance, but it's not nearly enough to outweigh the large gains inbatch_and_prepare_binned_render_phase
.