Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by binning together opaque items instead of sorting them. #12453

Merged
merged 25 commits into from Mar 30, 2024

Conversation

pcwalton
Copy link
Contributor

@pcwalton pcwalton commented Mar 13, 2024

Today, we sort all entities added to all phases, even the phases that don't strictly need sorting, such as the opaque and shadow phases. This results in a performance loss because our PhaseItems are rather large in memory, so sorting is slow. Additionally, determining the boundaries of batches is an O(n) process.

This commit makes Bevy instead applicable place phase items into bins keyed by bin keys, which have the invariant that everything in the same bin is potentially batchable. This makes determining batch boundaries O(1), because everything in the same bin can be batched. Instead of sorting each entity, we now sort only the bin keys. This drops the sorting time to near-zero on workloads with few bins like many_cubes --no-frustum-culling. Memory usage is improved too, with batch boundaries and dynamic indices now implicit instead of explicit. The improved memory usage results in a significant win even on unbatchable workloads like many_cubes --no-frustum-culling --vary-material-data-per-instance, presumably due to cache effects.

Not all phases can be binned; some, such as transparent and transmissive phases, must still be sorted. To handle this, this commit splits PhaseItem into BinnedPhaseItem and SortedPhaseItem. Most of the logic that today deals with PhaseItems has been moved to SortedPhaseItem. BinnedPhaseItem has the new logic.

Frame time results (in ms/frame) are as follows:

Benchmark binning main Speedup
many_cubes -nfc -vpi 232.179 312.123 34.43%
many_cubes -nfc 25.874 30.117 16.40%
many_foxes 3.276 3.515 7.30%

(-nfc is short for --no-frustum-culling; -vpi is short for --vary-per-instance.)


Changelog

Changed

  • Render phases have been split into binned and sorted phases. Binned phases, such as the common opaque phase, achieve improved CPU performance by avoiding the sorting step.

Migration Guide

  • PhaseItem has been split into BinnedPhaseItem and SortedPhaseItem. If your code has custom PhaseItems, you will need to migrate them to one of these two types. SortedPhaseItem requires the fewest code changes, but you may want to pick BinnedPhaseItem if your phase doesn't require sorting, as that enables higher performance.

Tracy graphs

many-cubes --no-frustum-culling, main branch:
Screenshot 2024-03-12 180037

many-cubes --no-frustum-culling, this branch:
Screenshot 2024-03-12 180011

You can see that batch_and_prepare_binned_render_phase is a much smaller fraction of the time. Zooming in on that function, with yellow being this branch and red being main, we see:

Screenshot 2024-03-12 175832

The binning happens in queue_material_meshes. Again with yellow being this branch and red being main:
Screenshot 2024-03-12 175755

We can see that there is a small regression in queue_material_meshes performance, but it's not nearly enough to outweigh the large gains in batch_and_prepare_binned_render_phase.

them.

Today, we sort all entities added to all phases, even the phases that
don't strictly need sorting, such as the opaque and shadow phases. This
results in a performance loss because our `PhaseItem`s are rather large
in memory, so sorting is slow. Additionally, determining the boundaries
of batches is an O(n) process.

This commit makes Bevy instead applicable place phase items into *bins*
keyed by *bin keys*, which have the invariant that everything in the
same bin is potentially batchable. This makes determining batch
boundaries O(1), because everything in the same bin can be batched.
Instead of sorting each entity, we now sort only the bin keys. This
drops the sorting time to near-zero on workloads with few bins like
`many_cubes --no-frustum-culling`. Memory usage is improved too, with
batch boundaries and dynamic indices now implicit instead of explicit.
The improved memory usage results in a significant win even on
unbatchable workloads like `many_cubes --no-frustum-culling
--vary-material-data-per-instance`, presumably due to cache effects.

Not all phases can be binned; some, such as transparent and transmissive
phases, must still be sorted. To handle this, this commit splits
`PhaseItem` into `BinnedPhaseItem` and `SortedPhaseItem`. Most of the
logic that today deals with `PhaseItem`s has been moved to
`SortedPhaseItem`. `BinnedPhaseItem` has the new logic.

FPS results are as follows:

| Benchmark                | `binning` | `main`  | Speedup |
| ------------------------ | --------- | ------- | ------- |
| `many_cubes -nfc -vmdpi` | 4.608     | 3.373   | 36.61%  |
| `many_cubes -nfc`        | 42.161    | 35.437  | 18.97%  |
| `many_foxes`             | 308.943   | 303.321 | 1.85%   |

(`-nfc` is short for `--no-frustum-culling`; `-vmdpi` is short for
`--vary-material-data-per-instance`.)
@pcwalton pcwalton requested a review from superdump March 13, 2024 01:08
@james7132 james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times C-Breaking-Change A breaking change to Bevy's public API that needs to be noted in a migration guide labels Mar 13, 2024
@james7132 james7132 self-requested a review March 13, 2024 01:19
@alice-i-cecile alice-i-cecile added the C-Needs-Release-Note Work that should be called out in the blog due to impact label Mar 13, 2024
crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved
{
fn default() -> Self {
Self {
batchable_keys: default(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to use the const values where possible since Default can't be const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::new() and the other equivalent constructors can be const, but Default::default need to at least be inlined in some way. It may be preferable to use const construction where possible.

crates/bevy_render/src/render_phase/mod.rs Show resolved Hide resolved
crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved
pcwalton added a commit to pcwalton/bevy that referenced this pull request Mar 19, 2024
meshes, 3D meshes, and lights, for performance.

This commit splits `VisibleEntities::entities` into three separate
lists: one for lights, one for 2D meshes, and one for 3D meshes. This
allows `queue_material_meshes` and similar methods to avoid examining
entities that are obviously irrelevant. In particular, this separation
helps scenes with many skinned meshes, as the individual bones are
considered visible entities but have no rendered appearance.

Internally, `VisibleEntities::entities` is a `HashMap` from the `TypeId`
representing a `QueryFilter` to the appropriate `Entity` list. I had to
do this because `VisibleEntities` is located within an upstream crate
from the crates that provide lights (`bevy_pbr`) and 2D meshes
(`bevy_sprite`). As an added benefit, this setup allows apps to provide
their own types of renderable components, by simply adding a specialized
`check_visibility` to the schedule.

Currently, this provides minimal performance benefit in `many_foxes`,
but on top of binning (bevyengine#12453) it provides a 20% end-to-end speedup on
that benchmark.

* `check_visibility` and `VisibleEntities` now store the three types of
  renderable entities--2D meshes, 3D meshes, and lights--separately. If
  your custom rendering code examines `VisibleEntities`, it will now
  need to specify which type of entity it's interested in using the
  `WithMesh2d`, `WithMesh`, and `WithLight` types respectively. If your
  app introduces a new type of renderable entity, you'll need to add an
  explicit call to `check_visibility` to the schedule to accommodate
  your new component or components.
@pcwalton
Copy link
Contributor Author

Addressed review comments.

pcwalton added a commit to pcwalton/bevy that referenced this pull request Mar 23, 2024
This commit introduces a new component, `GpuCulling`, which, when
present on a camera, skips the CPU visibility check in favor of doing
the frustum culling on the GPU. This trades off potentially-increased
CPU work and drawcalls in favor of cheaper culling and doesn't improve
the performance of any workloads that I know of today. However, it opens
the door to significant optimizations in the future by taking the
necessary first step toward *GPU-driven rendering*.

Enabling GPU culling for a view puts the rendering for that view into
*indirect mode*. In indirect mode, CPU-level visibility checks are
skipped, and all visible entities are considered potentially visible.
Bevy's batching logic still runs as usual, but it doesn't directly
generate mesh instance indices. Instead, it generates *instance
handles*, which are indices into an array of real instance indices.
Before any rendering is done, for each view, a compute shader,
`cull.wgsl`, maps instance handles to instance indices, discarding any
instance handles that represent meshes that are outside the visible
frustum. Draws are then done using the *indirect draw* feature of
`wgpu`, which instructs the GPU to read the number of actual instances
from the output of that compute shader.

Essentially, GPU culling works by adding a new level of indirection
between the CPU's notion of instances (known as instance handles) and
the GPU's notion of instances.

A new `--gpu-culling` flag has been added to the `many_foxes`,
`many_cubes`, and `3d_shapes` examples.

Potential follow-ups include:

* Split up `RenderMeshInstances` into CPU-driven and GPU-driven parts.
  The former, which contain fields like the transform, won't be
  initialized at all in when GPU culling is enabled. Instead, the
  transform will be directly written to the GPU in `extract_meshes`,
  like `extract_skins` does for joint matrices.

* Implement GPU culling for shadow maps.

  - Following that, we can treat all cascades as one as far as the CPU
    is concerned, simply replaying the final draw commands with
    different view uniforms, which should reduce the CPU overhead
    considerably.

* Retain bins from frame to frame so that they don't have to be rebuilt.
  This is a longer term project that will build on top of bevyengine#12453 and
  several of the tasks in bevyengine#12590, such as main-world pipeline
  specialization.
Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from those two small comments, still looks good to me. And I’m assuming it appears to work as intended to the best of our knowledge. I haven’t had time to test. Thanks to @IceSentry for spotting a shadow bug. :)

@pcwalton
Copy link
Contributor Author

Review comments addressed.

I updated the benchmarks with the new numbers. The differences from the previous revision are in the noise.

@superdump
Copy link
Contributor

error: unresolved link to `BinKey`
   --> crates/bevy_render/src/render_phase/mod.rs:518:11
    |
518 |     /// [`BinKey`] by the order of binding for best performance. For example,
    |           ^^^^^^ no item named `BinKey` in scope
    |
    = help: to escape `[` and `]` characters, add '\' before them like `\[` or `\]`
    = note: `-D rustdoc::broken-intra-doc-links` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(rustdoc::broken_intra_doc_links)]`

@pcwalton pcwalton added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 29, 2024
crates/bevy_render/src/batching/mod.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/batching/mod.rs Outdated Show resolved Hide resolved
@james7132 james7132 added this pull request to the merge queue Mar 30, 2024
Merged via the queue into bevyengine:main with commit 4dadebd Mar 30, 2024
27 checks passed
@mockersf
Copy link
Member

mockersf commented Mar 30, 2024

this "broke" deterministic rendering (http://dev-docs.bevyengine.org/bevy/render/deterministic/struct.DeterministicRenderingConfig.html). There is no z-fighting with or without it, but it doesn't render the same color between run.

See example determistic. Is it because that feature doesn't make sense anymore and it should be removed?

github-merge-queue bot pushed a commit that referenced this pull request Apr 1, 2024
# Objective

- Since #12453, `DeterministicRenderingConfig` doesn't do anything

## Solution

- Remove it

---

## Migration Guide

- Removed `DeterministicRenderingConfig`. There shouldn't be any z
fighting anymore in the rendering even without setting
`stable_sort_z_fighting`
chompaa pushed a commit to chompaa/bevy that referenced this pull request Apr 5, 2024
# Objective

- Since bevyengine#12453, `DeterministicRenderingConfig` doesn't do anything

## Solution

- Remove it

---

## Migration Guide

- Removed `DeterministicRenderingConfig`. There shouldn't be any z
fighting anymore in the rendering even without setting
`stable_sort_z_fighting`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Breaking-Change A breaking change to Bevy's public API that needs to be noted in a migration guide C-Needs-Release-Note Work that should be called out in the blog due to impact C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants