Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use instancing for sprites #9597

Merged
merged 5 commits into from Sep 2, 2023
Merged

Conversation

superdump
Copy link
Contributor

@superdump superdump commented Aug 27, 2023

Objective

Solution

  • Use an instance-rate vertex buffer to store per-instance data.
  • Store color, UV offset and scale, and a transform per instance.
  • Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine 3x4 matrix and store the transpose of that in the per-instance data. This is similar to how MeshUniform uses transpose affine matrices.
  • Use a special index buffer that has batches of 6 indices referencing 4 vertices. The lower 2 bits indicate the x and y of a quad such that the corners are:
    10    11
    
    00    01
    
    UVs are implicit but get modified by UV offset and scale The remaining upper bits contain the instance index.

Benchmarks

I will compare versus main before #9236 because the results should be as good as or faster than that. Running bevymark -- 10000 16 on an M1 Max with main at e8b38925 in yellow, this PR in red:

Screenshot 2023-08-27 at 18 44 10

Looking at the median frame times, that's a 37% reduction from before.


Changelog

  • Changed: Improved sprite rendering performance by leveraging an instance-rate vertex buffer.

@superdump superdump requested a review from robtfm August 27, 2023 16:47
@superdump
Copy link
Contributor Author

Need to look through all the UI stuff and see if that needs improving too.

@alice-i-cecile alice-i-cecile added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Aug 27, 2023
@alice-i-cecile alice-i-cecile added this to the 0.12 milestone Aug 27, 2023
@Elabajaba
Copy link
Contributor

Did a quick benchmark comparing this to before and after reordering
amd 6800xt, dx12
image

@superdump
Copy link
Contributor Author

I'm applying the same changes to UI rendering. Just got to figure out the clipping stuff.

Comment on lines 775 to 780
sprite_meta.sprite_index_buffer.push(base + 2);
sprite_meta.sprite_index_buffer.push(base);
sprite_meta.sprite_index_buffer.push(base + 1);
sprite_meta.sprite_index_buffer.push(base + 1);
sprite_meta.sprite_index_buffer.push(base + 3);
sprite_meta.sprite_index_buffer.push(base + 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much context of how the sprite rendering works, but is the index buffer needed if it just contains the same offsets over and over again?

i.e. would it be possible to base the indices off of @builtin(vertex_index) inside the vertex shader and read the coordinates from a buffer? (or generate the index buffer on the GPU if one is strictly needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something called a 'post-transform cache' where indexed vertices that use the same index have their vertex shader results cached and reused. I think if you used unique built-in indices just specifying a range to the draw command, then you would invoke the vertex shader 6 times per quad instead of 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for elaborating. I had something along the lines of vertex pulling in mind, but I'm fairly certain the caching benefits are lost in that case as you mentioned.

@superdump superdump force-pushed the instanced-sprites branch 2 times, most recently from 75159ed to 32ed3e0 Compare August 28, 2023 19:48
@superdump
Copy link
Contributor Author

The UI stuff is independent and was a bit too complicated for me to figure out quickly, so I'm going to leave it for another time. This is up for review.

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
array_stride: 80,
step_mode: VertexStepMode::Instance,
attributes: vec![
// @location(0) i_model_x: vec4<f32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// @location(0) i_model_x: vec4<f32>,
// @location(1) i_col0_tx: vec4<f32>,

just for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were actually meant to match their names in sprite.wgsl. Fixed.

crates/bevy_sprite/src/render/sprite.wgsl Show resolved Hide resolved
@superdump superdump added this pull request to the merge queue Sep 2, 2023
Merged via the queue into bevyengine:main with commit 4fdea02 Sep 2, 2023
21 checks passed
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective

- Supercedes bevyengine#8872 
- Improve sprite rendering performance after the regression in bevyengine#9236 

## Solution

- Use an instance-rate vertex buffer to store per-instance data.
- Store color, UV offset and scale, and a transform per instance.
- Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine
3x4 matrix and store the transpose of that in the per-instance data.
This is similar to how MeshUniform uses transpose affine matrices.
- Use a special index buffer that has batches of 6 indices referencing 4
vertices. The lower 2 bits indicate the x and y of a quad such that the
corners are:
  ```
  10    11

  00    01
  ```
UVs are implicit but get modified by UV offset and scale The remaining
upper bits contain the instance index.

## Benchmarks

I will compare versus `main` before bevyengine#9236 because the results should be
as good as or faster than that. Running `bevymark -- 10000 16` on an M1
Max with `main` at `e8b38925` in yellow, this PR in red:

![Screenshot 2023-08-27 at 18 44
10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316)

Looking at the median frame times, that's a 37% reduction from before.

---

## Changelog

- Changed: Improved sprite rendering performance by leveraging an
instance-rate vertex buffer.

---------

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 23, 2024
# Objective

Remove color specialization from `SpritePipeline` after it became
useless in #9597

## Solution

Removed the `COLORED` flag from the pipeline key and removed the
specializing the pipeline over it.

---

## Changelog

### Removed
- `SpritePipelineKey` no longer contains the `COLORED` flag. The flag
has had no effect on how the pipeline operates for a while.

## Migration Guide

- The raw values for the `HDR`, `TONEMAP_IN_SHADER` and `DEBAND_DITHER`
flags have changed, so if you were constructing the pipeline key from
raw `u32`s you'll have to account for that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants