Amortize the cost of freeing entities#22658
Conversation
|
Ah, this is playing with the order entities are allocated in, which some tests still depend on. I'll fix that real quick. |
chescock
left a comment
There was a problem hiding this comment.
Looks good! We really need doc comments for quick_free, and I'm going to wait for those before approving, but the rest of my comments are just nits and possibilities.
| /// This is in contrast to the [`RemoteAllocator`], which may be cloned freely. | ||
| pub(super) struct Allocator { | ||
| shared: Arc<SharedAllocator>, | ||
| quick_free: ArrayVec<Entity, 64>, |
There was a problem hiding this comment.
Why not use an ordinary Vec? You can use Vec::with_capacity(64) and then check quick_free.len() == quick_free.capacity(). Storing these on a separate heap allocation instead of inline in World might even be better for cache locality, since most uses of World won't need this data.
There was a problem hiding this comment.
I was concerned about cache locality, but you're not wrong. IMO, we either do a bigger heap Vec or a smaller/similarly sized ArrayVec. I'm totally fine with either; just depends on what we are more interested in optimizing.
There was a problem hiding this comment.
Yeah, I'm also fine with either! This is just another "prefer the simpler thing", where std types like Vec feel simpler than crates like ArrayVec.
There was a problem hiding this comment.
I just tried boxing it and doubling its capacity:
- Much worse perf for freeing less than 64 entities.
- Much better perf for freeing between 64 and 128 entities.
- Broke even for 1000 entities.
So I don't see a strong case to not box it, so now it's boxed. I did try just a Vec but that made performance ~50% worse unfortunately.
alice-i-cecile
left a comment
There was a problem hiding this comment.
No objections in principle and this is generally well-made. More docs are always good though, and @chescock's review suggestions are unsurprisingly excellent.
# Objective The biggest drawback of bevyengine#18670 was that it made freeing `Entity`'s back to the allocator 4x slower. That meant a 20% regression in despawn performance. This PR vastly improves the performance of the entity allocator for freeing entities. ## Solution Add a local free list in pace in the main entity allocator. This is an `ArrayVec` called `quick_free`. When an entity is freed, add it to the `quick_free`. If it is full, flush the array to the full shared allocator. Currently the array has length 64, taking 512 bytes. Since this is directly included in the already massive `World` type, I don't think this is an issue, and I would guess boxing it would hurt performance here. It also means that there will be at most 64 freed entities that simply can't be allocated. This reduces the worst case maximum entity count from 4,294,967,296 to 4,294,967,232 (big deal). This also adds a new `free_many` function that is very fast compared to doing them one by one. ## Testing - CI and benches. --- ## Showcase Here are some rough benchmarks on my M2 MAX: ```txt group post_quick_free_list pre_quick_free_list pre_remote_reservation ----- -------------------- ------------------- ---------------------- entity_allocator_free/10000_entities 1.00 29.7±0.48µs ? ?/sec 1.31 38.9±0.97µs ? ?/sec 1.00 29.8±0.85µs ? ?/sec entity_allocator_free/100_entities 1.00 393.3±26.21ns ? ?/sec 1.35 531.8±26.34ns ? ?/sec 1.14 446.7±11.32ns ? ?/sec entity_allocator_free/1_entities 1.00 4.6±2.17ns ? ?/sec 42.27 195.3±32.49ns ? ?/sec 4.25 19.6±8.67ns ? ?/sec entity_allocator_free_bulk/10000_entities 1.00 8.7±0.36µs ? ?/sec entity_allocator_free_bulk/100_entities 1.00 240.9±31.01ns ? ?/sec entity_allocator_free_bulk/1_entities 1.00 206.8±39.95ns ? ?/sec ``` Looking at the cost of freeing 1,000 entities, this makes the new allocator exactly as fast as the pre-bevyengine#18670 one, 30% faster than main. The new `free_many` takes 8.7µs to free 1,000 entities where the optimized `free` takes `29.7`, so another big win there. This should make up the 20% regression to despawning. It might be even faster than pre-bevyengine#18670 if we increase 64 to 128 or something, but I think that's unnecessary. This could also much improve performance for despawning scenes if we can find a way to make use of `free_many`, but that's a different task.
Objective
The biggest drawback of #18670 was that it made freeing
Entity's back to the allocator 4x slower. That meant a 20% regression in despawn performance. This PR vastly improves the performance of the entity allocator for freeing entities.Solution
Add a local free list in pace in the main entity allocator. This is an
ArrayVeccalledquick_free. When an entity is freed, add it to thequick_free. If it is full, flush the array to the full shared allocator.Currently the array has length 64, taking 512 bytes. Since this is directly included in the already massive
Worldtype, I don't think this is an issue, and I would guess boxing it would hurt performance here. It also means that there will be at most 64 freed entities that simply can't be allocated. This reduces the worst case maximum entity count from 4,294,967,296 to 4,294,967,232 (big deal).This also adds a new
free_manyfunction that is very fast compared to doing them one by one.Testing
Showcase
Here are some rough benchmarks on my M2 MAX:
Looking at the cost of freeing 1,000 entities, this makes the new allocator exactly as fast as the pre-#18670 one, 30% faster than main. The new
free_manytakes 8.7µs to free 1,000 entities where the optimizedfreetakes29.7, so another big win there.This should make up the 20% regression to despawning. It might be even faster than pre-#18670 if we increase 64 to 128 or something, but I think that's unnecessary. This could also much improve performance for despawning scenes if we can find a way to make use of
free_many, but that's a different task.