Optimize render asset extraction by reusing heap allocations#23758
Optimize render asset extraction by reusing heap allocations#23758alice-i-cecile merged 2 commits intobevyengine:mainfrom
Conversation
allocations in asset extraction
andriyDev
left a comment
There was a problem hiding this comment.
Wow this makes quite the positive improvement in that benchmark! Nice!
As an aside, I would love a new system param which is like Scratch<T> which is a local that automatically clears when the system runs. It feels so good to reuse those allocations, and this is exactly the motivation!
Would love this - I've been wanting something like this for a long time! Can't remember if I have an open issue for it. It should ideally also keep track of occupancy, and shrink when it's over-allocated for several frames. Iirc I have something like this implemented in the meshlet code, but manually, instead of a system param. |
|
Won't this increase memory usage? Extracting assets seems to be low-frequency and modifying a large number of assets per frame appears to be uncommon, so I think reusing memory here isn't very important. |
|
We could put it behind a flag in the plugin? It's not super low frequency, but it's not high frequency either - it really depends on the game. And asset streaming later on will further complicate things (no idea what that's going to look like). |
|
I'd prefer to avoid more feature flags here unless there's a really strong tradeoff to be made here. My intuition is that reusing allocations is the right choice here, but I wanted to discuss things a bit rather than quietly ignoring the objection. IMO we should do this because it makes rendering performance more stable, both between games and across frames. |
|
If memory usage is a concern we could shrink the collections to some reasonable limit after use. |
|
…ine#23758) # Objective - Address CPU overhead in the render asset extraction phase. - Prior to this change, `extract_render_asset` and `extract_erased_render_asset` created several fresh `HashSet` and `Vec` collections every frame. - Furthermore, these collections were being sent to the render world via `Commands::insert_resource`, which causes the previous frame's resource (and its allocated capacity) to be dropped and deallocated. - This results in unnecessary pressure on the global allocator and lower CPU cache efficiency due to constant memory re-initialization. ## Solution - Converted the temporary `needs_extracting` `HashSet` into a `Local` system parameter. This allows the collection to retain its capacity between system runs. - Switched from using `Commands::insert_resource` to using `ResMut<ExtractedAssets<A>>`. By calling `.clear()` on the existing resource instead of replacing it, we reuse the heap-allocated buffers for the `extracted` Vec and the `removed`, `modified`, and `added` HashSets. ## Testing - Introduced a micro-benchmark simulating N unique asset modifications per frame to trigger the extraction logic. ``` cargo bench -p benches --bench render -- extract_render_asset ``` --- ## Showcase ### Criterion Table ``` extract_render_asset/allocations/10 time: [8.6633 µs 8.7004 µs 8.7432 µs] thrpt: [1.1437 Melem/s 1.1494 Melem/s 1.1543 Melem/s] change: time: [−0.3467% +0.2334% +0.8050%] (p = 0.42 > 0.05) thrpt: [−0.7986% −0.2329% +0.3479%] No change in performance detected. extract_render_asset/allocations/100 time: [11.256 µs 11.297 µs 11.340 µs] thrpt: [8.8183 Melem/s 8.8519 Melem/s 8.8838 Melem/s] change: time: [−13.487% −12.672% −11.536%] (p = 0.00 < 0.05) thrpt: [+13.041% +14.511% +15.590%] Performance has improved. extract_render_asset/allocations/1000 time: [34.867 µs 35.037 µs 35.222 µs] thrpt: [28.392 Melem/s 28.541 Melem/s 28.681 Melem/s] change: time: [−42.567% −41.359% −40.454%] (p = 0.00 < 0.05) thrpt: [+67.938% +70.530% +74.116%] Performance has improved. extract_render_asset/allocations/10000 time: [302.54 µs 305.30 µs 308.22 µs] thrpt: [32.445 Melem/s 32.755 Melem/s 33.053 Melem/s] change: time: [−39.411% −37.491% −35.597%] (p = 0.00 < 0.05) thrpt: [+55.272% +59.976% +65.047%] Performance has improved. extract_render_asset/allocations/100000 time: [4.7822 ms 4.9347 ms 5.1028 ms] thrpt: [19.597 Melem/s 20.265 Melem/s 20.911 Melem/s] change: time: [−9.8539% −6.8588% −3.4015%] (p = 0.00 < 0.05) thrpt: [+3.5212% +7.3639% +10.931%] Performance has improved. ```
Objective
extract_render_assetandextract_erased_render_assetcreated several freshHashSetandVeccollections every frame.Commands::insert_resource, which causes the previous frame's resource (and its allocated capacity) to be dropped and deallocated.Solution
needs_extractingHashSetinto aLocalsystem parameter. This allows the collection to retain its capacity between system runs.Commands::insert_resourceto usingResMut<ExtractedAssets<A>>. By calling.clear()on the existing resource instead of replacing it, we reuse the heap-allocated buffers for theextractedVec and theremoved,modified, andaddedHashSets.Testing
Showcase
Criterion Table