-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pack multiple meshes into vertex and index buffers. #13218
base: main
Are you sure you want to change the base?
Conversation
The underlying allocation algorithm is [`offset-allocator`], which is a port of [Sebastian Aaltonen's `OffsetAllocator`]. It's a fast, simple hard real time allocator in the two-level segregated fit family. Allocations are divided into two categories: *regular* and *large*. Regular allocations go into one of the shared slabs managed by an allocator. Large allocations get their own individual slabs. Due to platform limitations, on WebGL 2 all vertex buffers are considered large allocations that get their own slabs; however, index buffers can still be packed together. The slab size is 32 MB by default, but the developer can adjust it manually. The mesh bin key and compare data have been reworked so that the slab IDs are compared first. That way, meshes that the same vertex and index buffers tend to be drawn together. Note that this only works well for opaque meshes; transparent meshes must be sorted into draw order, so there's less opportunity for grouping. The purpose of packing meshes together is to reduce the number of times vertex and index buffers have to be re-bound, which is expensive. In the future, we'd like to use *multi-draw*, which allows us to draw multiple meshes with a single drawcall, as long as they're in the same buffers. Thus, this patch paves the way toward multi-draw, and with it a GPU-driven pipeline Even without multi-draw, this patch results in significant performance improvements. For me, the command submission time (i.e. GPU time plus driver and `wgpu` overhead) for Bistro goes from 4.07ms to 1.42ms without shadows (2.8x speedup); with shadows it goes from 6.91ms to 2.62ms (2.45x speedup). The number of vertex and index buffer switches in Bistro is reduced from approximately 3,600 to 927, with the vast majority of the remaining switches due to the transparent pass. [`offset-allocator`]: https://github.com/pcwalton/offset-allocator/ [Sebastian Aaltonen's `OffsetAllocator`]: https://github.com/sebbbi/OffsetAllocator/
Awesome! I will try to read through and test this over the weekend. Will this also give us a second perf improvement when order-independent-transparency lands? That should mean we can basically stop sorting, right? |
@NthTensor Yes, I would think so. |
/// position*, we must only group meshes with identical vertex buffer layouts | ||
/// into the same buffer. | ||
#[derive(Clone, PartialEq, Eq, Hash, Debug)] | ||
pub enum GpuAllocationClass { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if it makes sense to add other GpuAllocationClass
textures, samplers, pipeline layout, etc? Mostly looking at rerun's code as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add those as we come to them. YAGNI :)
/// The mesh. | ||
/// | ||
/// Although we don't have multidraw capability yet, we place this at the | ||
/// end to maximize multidraw opportunities in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does order matter here? It's not repr(c) or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PartialOrd
compares from first struct field to last struct field. So fields at the top of the struct will generally be placed together more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the code involving the allocator itself is a bit beyond me, but I don't see any issues reading it over. Very pleased with the performance trying this out on my fairly low quality hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impl GpuAllocator { | ||
/// Returns the slab that the given allocation is stored in. | ||
pub fn buffer(&self, allocation: &GpuAllocation) -> &Buffer { | ||
&self.slabs[&allocation.slab_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of panicking when encountering an invalid allocation, we could return an Option<&Buffer> here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I prefer to not panic, but I'm not sure how we would handle the option here in the render
system. It looks like this only panics if there's a (unrecoverable) memory error in this or the allocator. Perhaps we could unwrap for a more useful error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it fails it always indicates a bug, because we should be spilling into large allocations if a small allocation can't handle it. So I'm not sure returning None
would be helpful.
Updated to main and fixed the 2D meshes problem. It was a simple mistake when porting the logic from 3D over: in the indexed path, for the |
Comments addressed |
Blocking until 0.14 is shipped. |
This is ready to go, but I think it would be best to wait until 0.15 and not merge for 0.14. The reason is that the memory usage heuristics haven't been well tuned yet. We'll only know what the best heuristics are through a cycle of testing. |
I would like to nominate this for the release notes. I think the performance gains are significant enough that users would enjoy reading about it. (Not to mention I've been watching this in the background every since |
Agreed :) In the future, feel free to just add the label yourselves: it's easy to make the editorial call to split or lump things during the final release notes process. |
The underlying allocation algorithm is
offset-allocator
, which is a port of Sebastian Aaltonen'sOffsetAllocator
. It's a fast, simple hard real time allocator in the two-level segregated fit family.Allocations are divided into two categories: regular and large. Regular allocations go into one of the shared slabs managed by an allocator. Large allocations get their own individual slabs. Due to platform limitations, on WebGL 2 all vertex buffers are considered large allocations that get their own slabs; however, index buffers can still be packed together. The slab size is 32 MB by default, but the developer can adjust it manually.
The mesh bin key and compare data have been reworked so that the slab IDs are compared first. That way, meshes in the same vertex and index buffers tend to be drawn together. Note that this only works well for opaque meshes; transparent meshes must be sorted into draw order, so there's less opportunity for grouping.
The purpose of packing meshes together is to reduce the number of times vertex and index buffers have to be re-bound, which is expensive. In the future, we'd like to use multi-draw, which allows us to draw multiple meshes with a single drawcall, as long as they're in the same buffers. Thus, this patch paves the way toward multi-draw, and with it a GPU-driven pipeline.
Even without multi-draw, this patch results in significant performance improvements. For me, the command submission time (i.e. GPU time plus driver and
wgpu
overhead) for Bistro goes from 4.07ms to 1.42ms without shadows (2.8x speedup); with shadows it goes from 6.91ms to 2.62ms (2.45x speedup). The number of vertex and index buffer switches in Bistro is reduced from approximately 3,600 to 927, with the vast majority of the remaining switches due to the transparent pass.Bistro, without shadows. Yellow is this PR; red is
main
.Bistro, with shadows. Yellow is this PR; red is
main
.Changelog
Added
Migration Guide
GpuMesh
are nowGpuAllocation
s instead ofBuffer
s, to facilitate packing multiple meshes in the same buffer. To fetch the buffer corresponding to aGpuAllocation
, use thebuffer()
method in the newGpuAllocator
resource. Note that the allocation may be located anywhere in the buffer; use theoffset()
method to determine its location.