Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Extract parallel queue abstraction #7348

Merged
merged 13 commits into from Feb 19, 2024

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Jan 24, 2023

Objective

There's a repeating pattern of ThreadLocal<Cell<Vec<T>>> which is very useful for low overhead, low contention multithreaded queues that have cropped up in a few places in the engine. This pattern is surprisingly useful when building deferred mutation across multiple threads, as noted by it's use in ParallelCommands.

However, ThreadLocal<Cell<Vec<T>>> is not only a mouthful, it's also hard to ensure the thread-local queue is replaced after it's been temporarily removed from the Cell.

Solution

Wrap the pattern into bevy_utils::Parallel<T> which codifies the entire pattern and ensures the user follows the contract. Instead of fetching indivdual cells, removing the value, mutating it, and replacing it, Parallel::get returns a ParRef<'a, T> which contains the temporarily removed value and a reference back to the cell, and will write the mutated value back to the cell upon being dropped.

I would like to use this to simplify the remaining part of #4899 that has not been adopted/merged.


Changelog

TODO

@james7132 james7132 added C-Code-Quality A section of code that is hard to understand or change A-Utils Utility functions and types labels Jan 24, 2023
@alice-i-cecile alice-i-cecile added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Jan 24, 2023
@alice-i-cecile
Copy link
Member

#6817 similarly abstracts the deferred mutation pattern, but seems to do so in a fully compatible way.

@hymm hymm self-requested a review January 24, 2023 16:31
@james7132
Copy link
Member Author

#6817 similarly abstracts the deferred mutation pattern, but seems to do so in a fully compatible way.

The main difference here is that this is a standalone type. It's used in check_visibility for internal parallelism, and I want to use it inside a component.

@TheRawMeatball
Copy link
Member

Note: Parallel::get can still be called multiple times, producing multiple aliased ParRefs and still produce improper results, which may be a bit of a footgun.

This could be avoided by replacing the get method with a scope method that takes a impl FnOnce(&mut T) similar to how the existing ParallelCommands impl works. I personally think this would be worth the extra boilerplate, especially considering how it also lets us avoid having yet another odd smart pointer with subpar ergonomics on top of removing this footgun.

@james7132
Copy link
Member Author

james7132 commented Jan 24, 2023

This could be avoided by replacing the get method with a scope method that takes a impl FnOnce(&mut T) similar to how the existing ParallelCommands impl works. I personally think this would be worth the extra boilerplate, especially considering how it also lets us avoid having yet another odd smart pointer with subpar ergonomics on top of removing this footgun.

This makes it difficult to work with several of these together, with a triple nested scope in that case. We can have both, but only having the scope is probably not 100% viable.

@TheRawMeatball
Copy link
Member

Fair enough, didn't consider such a case.

Copy link
Member

@JoJoJet JoJoJet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a very useful abstraction, it's a nice way of making thread locals usable. I like the name too -- it feels like a very natural way to describe a set of values stored in parallel across threads.

I am a bit wary of ParRef<>, though. Requiring the drop impl to apply changes to a smart pointer feels strange and overengineered, and it is potentially footgunny as Nile mentioned. Perhaps we should replace .get() with .scope(), and just take the L on nested scopes.

Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally in favor of this. I was considering a pr to do this exact thing. But I'd like to see some benches before approving.

crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
@james7132 james7132 added this to the 0.11 milestone Feb 27, 2023
@james7132 james7132 requested a review from JoJoJet April 11, 2023 00:32
Copy link
Member

@JoJoJet JoJoJet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer now that the API is based on scope. I'm on board now, but there's a few API details I'm not sure about.

crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
@hymm hymm self-requested a review April 16, 2023 02:27
@cart cart modified the milestones: 0.11, 0.12 Jul 7, 2023
@cart
Copy link
Member

cart commented Jul 7, 2023

Moving to 0.12 as there are still open conversations.

crates/bevy_utils/src/parallel_queue.rs Show resolved Hide resolved
crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/system/commands/parallel_scope.rs Outdated Show resolved Hide resolved
crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main thing for me to approve is just to see a bench that check_visibility still performs the same.

github-merge-queue bot pushed a commit that referenced this pull request Oct 1, 2023
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by #7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including #9950's changes, so this may actually
result in even faster speedups once that's merged in.
regnarock pushed a commit to regnarock/bevy that referenced this pull request Oct 13, 2023
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective
`extract_meshes` can easily be one of the most expensive operations in
the blocking extract schedule for 3D apps. It also has no fundamentally
serialized parts and can easily be run across multiple threads. Let's
speed it up by parallelizing it!

## Solution
Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in
conjunction with `Query::par_iter` to build a set of thread-local
queues, and collect them after going wide.

## Performance
Using `cargo run --profile stress-test --features trace_tracy --example
many_cubes`. Yellow is this PR. Red is main.

`extract_meshes`:


![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b)

An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.

Note: this is still not including bevyengine#9950's changes, so this may actually
result in even faster speedups once that's merged in.
Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>
@james7132 james7132 added this to the 0.14 milestone Feb 12, 2024
@james7132
Copy link
Member Author

james7132 commented Feb 12, 2024

main thing for me to approve is just to see a bench that check_visibility still performs the same.

Comparing many_cubes on main vs this PR, it does seem like we're not losing anything. There shouldn't be a performance gain of any kind from this PR, so I'm chalking up the difference to a change in execution environment between the runs. Either way, the difference isn't too big to drive suspicion here IMO. (yellow is this PR, red is main)

image

@hymm
Copy link
Contributor

hymm commented Feb 14, 2024

Ran it myself just to check, since the gap seems sort of large. The 2 runs ended up almost identical for me.
image

crates/bevy_utils/src/parallel_queue.rs Outdated Show resolved Hide resolved
let cell = self.locals.get_or_default();
let mut value = cell.take();
let ret = f(&mut value);
cell.set(value);
Copy link
Member

@JoJoJet JoJoJet Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a panic occurs in the scope and gets caught, any commands added during the scope will be lost. This is probably desirable but I think it's worth mentioning in a comment.

james7132 and others added 2 commits February 18, 2024 01:08
Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>
@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Feb 19, 2024
Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice little bit of abstraction :)

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Feb 19, 2024
Merged via the queue into bevyengine:main with commit e34fb68 Feb 19, 2024
27 of 28 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Feb 25, 2024
# Objective
#7348 added `bevy_utils::Parallel` and replaced the usage of the
`ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also
using it in `extract_meshes`.

## Solution
Refactor the system to use `Parallel` instead.
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
# Objective
There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very
useful for low overhead, low contention multithreaded queues that have
cropped up in a few places in the engine. This pattern is surprisingly
useful when building deferred mutation across multiple threads, as noted
by it's use in `ParallelCommands`.

However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also
hard to ensure the thread-local queue is replaced after it's been
temporarily removed from the `Cell`.

## Solution
Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the
entire pattern and ensures the user follows the contract. Instead of
fetching indivdual cells, removing the value, mutating it, and replacing
it, `Parallel::get` returns a `ParRef<'a, T>` which contains the
temporarily removed value and a reference back to the cell, and will
write the mutated value back to the cell upon being dropped.

I would like to use this to simplify the remaining part of bevyengine#4899 that
has not been adopted/merged.

---

## Changelog
TODO

---------

Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
# Objective
bevyengine#7348 added `bevy_utils::Parallel` and replaced the usage of the
`ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also
using it in `extract_meshes`.

## Solution
Refactor the system to use `Parallel` instead.
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
# Objective
There's a repeating pattern of `ThreadLocal<Cell<Vec<T>>>` which is very
useful for low overhead, low contention multithreaded queues that have
cropped up in a few places in the engine. This pattern is surprisingly
useful when building deferred mutation across multiple threads, as noted
by it's use in `ParallelCommands`.

However, `ThreadLocal<Cell<Vec<T>>>` is not only a mouthful, it's also
hard to ensure the thread-local queue is replaced after it's been
temporarily removed from the `Cell`.

## Solution
Wrap the pattern into `bevy_utils::Parallel<T>` which codifies the
entire pattern and ensures the user follows the contract. Instead of
fetching indivdual cells, removing the value, mutating it, and replacing
it, `Parallel::get` returns a `ParRef<'a, T>` which contains the
temporarily removed value and a reference back to the cell, and will
write the mutated value back to the cell upon being dropped.

I would like to use this to simplify the remaining part of bevyengine#4899 that
has not been adopted/merged.

---

## Changelog
TODO

---------

Co-authored-by: Joseph <21144246+JoJoJet@users.noreply.github.com>
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
# Objective
bevyengine#7348 added `bevy_utils::Parallel` and replaced the usage of the
`ThreadLocal<Cell<Vec<...>>>` in `check_visibility`, but we were also
using it in `extract_meshes`.

## Solution
Refactor the system to use `Parallel` instead.
@james7132 james7132 deleted the parallel branch March 10, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Utils Utility functions and types C-Code-Quality A section of code that is hard to understand or change D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants