Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meshlet rendering (initial feature) #10164

Merged
merged 467 commits into from Mar 25, 2024
Merged

Conversation

JMS55
Copy link
Contributor

@JMS55 JMS55 commented Oct 17, 2023

Objective

  • Implements a more efficient, GPU-driven (Gpu Driven Rendering By Default #1342) rendering pipeline based on meshlets.
  • Meshes are split into small clusters of triangles called meshlets, each of which acts as a mini index buffer into the larger mesh data. Meshlets can be compressed, streamed, culled, and batched much more efficiently than monolithic meshes.

image
image

Misc

@JMS55 JMS55 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective Ask for help! labels Oct 17, 2023
@JMS55 JMS55 added this to the 0.13 milestone Oct 17, 2023
@superdump
Copy link
Contributor

Awesome!

Two-pass occlusion culling should not be tied to meshlets specifically. We should just do that separately.

@JMS55
Copy link
Contributor Author

JMS55 commented Oct 18, 2023

I don't see how they could.

The meshlet system is already setup to do occlusion culling, basically. I already upload the bounding sphere information per meshlet, and have a culling and indirect draw pipeline in place.

I don't see how we could share occlusion culling passes with meshlet meshes and regular meshes without basically duplicating all of the meshlet system and indirect draw setup, bounding info, etc, but changing it to operate on whole meshes instead of meshlets which is just more inefficient. If we were going that route, there would be no point in doing all the work for meshlets in the first place.

@BeastLe9enD
Copy link
Contributor

Gate meshoptimizer dependency behind a cargo feature, or rewrite it in Rust.

If you are looking for a pure-rust implementation, have a look at https://github.com/yzsolt/meshopt-rs

@JMS55
Copy link
Contributor Author

JMS55 commented Oct 21, 2023

Unfortunately meshopt-rs does not support the meshlet APIs (it's ported from an older version of meshoptimizer), which is why I'm using a different crate.

@JMS55
Copy link
Contributor Author

JMS55 commented Oct 25, 2023

Plan for materials (assuming opaque meshes only, will extend to other passes later):

  1. In extract_meshlet_meshes, create a Vec of all meshlet mesh entities
  2. In a new Queue system, as part of MaterialPlugin for each material, for each entity from step one check if it's using the material via RenderMaterialInstances, and if so, push the material index (gotten from the hashmap in step 3) to a new Vec in MeshletGpuScene
  3. In a new PrepareAssets system that runs after prepare_materials::<M>, as part of MaterialPlugin for each material, upload changed opaque materials to MeshletGpuScene
    3a. Queue new render pipeline for the material, store pipeline ID + material bind group in a Vec
    3b. Use a HashMap to store index of material in the Vec, hashing based on material AssetId
    3c. Could skip materials not used by any meshlet mesh, at the cost of extra queries
  4. Optionally (may improve GPU coherence, at the cost of CPU time) in prepare_meshlet_per_frame_resources, sort per-frame-buffers by the material index before uploading
  5. In prepare_meshlet_per_frame_resources, instead of a single DrawIndexedIndirect item, make draw_command_buffer hold M items equal to the total material count
    5a. Each DrawIndexedIndirect::base_index needs to be equal to the total count of possible indices (after instancing) for that material
    5b. We can skip draws for materials that aren't used by any meshlet mesh entities
  6. In cull_meshlets, for each meshlet, index into the draw command array based on the material index to get the DrawIndexedIndirect for the meshlet's material, and then write to location draw_index_buffer_start + draw_indirect_command.base_index + offset in the index buffer

@JMS55
Copy link
Contributor Author

JMS55 commented Oct 28, 2023

This PR is usable atm, and has large chunks of code ready. The plan is once 0.12 is released, I'll start opening smaller PRs with parts of these changes. The goal is to incrementally merge chunks of code for meshlet rendering, instead of one big PR with all the changes.

github-merge-queue bot pushed a commit that referenced this pull request Oct 31, 2023
# Objective
- Work towards GPU-driven culling
(#10164)

## Solution
- Pass the view frustum to the shader view uniform

---

## Changelog
- View Frustums are now extracted to the render world and made available
to shaders
@JMS55
Copy link
Contributor Author

JMS55 commented Nov 1, 2023

Plan to support (occlusion culling, visibility buffer, shadows, forward + prepass, deferred):

  • Each view will get a visibility buffer (not to be confused with the vbuffer texture)
    • Stores visible/not visible in view as a boolean for each meshlet, persisted across two frames (previous and current)
    • Instances of meshlets (thread_meshlet) can get their previous visibility via previous_visibility[previous_visibility_id[thread_id]], where previous_visibility_id is uploaded on the CPU during the current frame, using a resource holding the data from the previous frame (might be a better way to do this without the extra indirection and previous_visibility_id buffer...)
  • MeshletVBufferNode - Render depth, vbuffer (thread_id + triangle_id), and material depth
    • PreviousOccluderPreparePass - Take meshlets visible last frame, as indicated by previous_visibility, write index buffer to render them
    • VBufferRenderPass1 - Render the 3 node outputs using a single draw_indirect_indexed()
    • GenerateHzbPass - Take depth buffer generated so far, downscale several times to create a hierarchical depth buffer
    • CullPass - Take all meshlets, frustum cull, occlusion cull against hzb, write visibility to visibility buffer for next frame, and if visible write index buffer to render them
    • VBufferRenderPass2 - Render the 3 node outputs using a single draw_indirect_indexed()
  • MeshletShadowMapNode/MeshletPrepassNode/MeshletOpaque3dMainPassNode
    • For each material, draw a single triangle/quad using the material depth trick, reconstruct vertex properties, and then shade the fragment

The messiest part is having to use a previous/next visibility buffer with separate indices, instead of being able to assume [object/instance/entity] index is constant across frames, and use a single read_write visibility buffer. Will have to think more on how to do this.

@JMS55
Copy link
Contributor Author

JMS55 commented Nov 1, 2023

Additional complication: Vertex data can't be provided to the fragment shader via the vertex output. Fragment shader will need to read the vbuffer pixel, and load all the meshlet and then vertex data. This means we need to modify the fragment shader.

I'll probably have to re-write shaders using naga_oil somehow to load the vbuffer data to construct the VertexOutput, instead of directly reading it as the fragment input.

@JMS55
Copy link
Contributor Author

JMS55 commented Nov 5, 2023

Useful reference for visbuffer barycentrics, partial derivatives, and the other complicated stuff: https://github.com/JuanDiegoMontoya/Frogfood/blob/main/data/shaders/visbuffer/VisbufferResolve.frag.glsl

@JMS55
Copy link
Contributor Author

JMS55 commented Nov 5, 2023

Next step is to emit "material id" to a depth texture. Visbuffer fragment shader will output material ID to a R16Uint color attachment, and then an extra fullscreen triangle render pass will read that texture and write to a depth target.

ameknite pushed a commit to ameknite/bevy that referenced this pull request Nov 6, 2023
# Objective
- Work towards GPU-driven culling
(bevyengine#10164)

## Solution
- Pass the view frustum to the shader view uniform

---

## Changelog
- View Frustums are now extracted to the render world and made available
to shaders
Copy link
Contributor

@IceSentry IceSentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've definitely didn't check that all the meshlet code is correct, but it works. I've tried to make sure to the best of my ability that it doesn't affect unrelated code if meshlets aren't used.

I'm approving this with the context that it is an experimental feature that will still be iterated on because I think merging this now will make collaboration easier.

crates/bevy_pbr/src/material.rs Outdated Show resolved Hide resolved
examples/3d/meshlet.rs Outdated Show resolved Hide resolved
JMS55 and others added 3 commits March 17, 2024 23:26
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: vero <email@atlasdostal.com>
Co-authored-by: vero <email@atlasdostal.com>
Copy link
Member

@cart cart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely exciting work! The approach (while complicated) is surprisingly straightforward given what its doing and what we get. Very curious to see how this performs if you've already done testing (not blocking: I know theres a lot of changes planned).

Makes me want to consider "fragment shader abstractions" so people can write custom fragment shaders that "just work" with meshlets / deferred / prepass / normal forward without needing a bunch of ifdefs + import knowledge. Doesn't seem like a particularly challenging problem to solve given that its mostly import differences and a bit of setup code. (obviously not something we should handle in this PR)

crates/bevy_pbr/src/meshlet/meshlet_bindings.wgsl Outdated Show resolved Hide resolved
crates/bevy_pbr/src/lib.rs Show resolved Hide resolved
@JMS55
Copy link
Contributor Author

JMS55 commented Mar 22, 2024

Performance varies wildly depending on the scene. There's considerable base overhead, but it should scale up to much higher density scenes, especially with lots of occlusion. It's really meant for rendering really dense, ridiculously high resolution scenes where the goal is to render 1 triangle per pixel, so you're never limited by lack of geometric detail. We're not there yet due to lack of lods or software raster though. So performance can be quite good in certain scenes with lots of occlusion atm, but it can't render higher resolution stuff like it's intended to, and on simpler scenes the base overhead dominates compared to bevy's standard renderer. I don't have specific numbers to give ATM, I'll do a lot more in depth comparisons once we have LODs and software raster in.

I agree on the material shader needing some abstraction in the future. You kind of can do that already with our existing ifdefs, but it's not the best. One complication is that derivatives need to be manually calculated when using the visbuffer. We'd probably need some kind of automatic chain rule applier that differentiates functions, like Slang has.

Copy link
Contributor

You added a new feature but didn't update the readme. Please run cargo run -p build-templated-pages -- update features to update it, and commit the file change.

@JMS55 JMS55 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 23, 2024
@alice-i-cecile alice-i-cecile added this pull request to the merge queue Mar 25, 2024
Merged via the queue into bevyengine:main with commit 4f20faa Mar 25, 2024
29 of 30 checks passed
@JMS55 JMS55 mentioned this pull request Mar 26, 2024
39 tasks
@atlv24 atlv24 mentioned this pull request Apr 15, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Needs-Release-Note Work that should be called out in the blog due to impact C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective Ask for help! S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Responded
Development

Successfully merging this pull request may close these issues.

None yet