Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the multi-threaded executor at the end of each system task. #11906

Merged
merged 11 commits into from
Feb 26, 2024

Conversation

chescock
Copy link
Contributor

Objective

The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion.

See #8304

Solution

Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.

While this should be faster in theory, I don't actually know how to measure the performance impact myself.

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times A-Tasks Tools for parallel and async work S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels Feb 16, 2024
@NthTensor
Copy link
Contributor

I will profile this later today. Are you sure the formatting of your SAFETY comments is correct?

@james7132 james7132 added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Feb 16, 2024
@hymm
Copy link
Contributor

hymm commented Feb 16, 2024

ci failure on run-examples looks real.

@chescock
Copy link
Contributor Author

I think I understand the bug that caused the assertion failed: state.ready_systems.is_clear() CI failure. When a system is skipped by run conditions and marks dependents as ready, those dependent systems aren't spawned by that call to spawn_system_tasks. If that's the last system, it stops spawning and ends early.

The existing code handled that case by running spawn_system_tasks again immediately if the if self.num_running_systems > 0 check failed. I hadn't realized that was the purpose of that check.

I'm not sure what the cleanest way to fix that is. I should have time to work on it tonight, but not before then. Sorry!

@chescock
Copy link
Contributor Author

@NthTensor

I will profile this later today.

Thanks, but it's not actually working yet, so no need just yet. The CI caught a race condition I had missed.

Are you sure the formatting of your SAFETY comments is correct?

Nope, not at all!

... Yup, the ones in should_run are definitely wrong. I put the new code in between the old SAFETY comment and the unsafe function call. I was confused because I didn't see an unsafe keyword there, but I guess it was all in an unsafe function.

I'll fix those. Were those the ones you meant, or are there other issues, too?

@chescock
Copy link
Contributor Author

Okay, I fixed the bug the CI caught, and added a unit test for it!

("Dependencies / check-bans" seems to be failing for every PR and doesn't mention anything I changed, so I plan to ignore that failure.)

@hymm
Copy link
Contributor

hymm commented Feb 17, 2024

Seems promising. Ran the many foxes example with RUST_LOG=bevy_ecs::schedule::schedule=info,bevy_app=info,bevy_render=info. This gets rid of the most of the spans smaller than the schedule level.

Main schedule is saving 200us, extract is 70us and render is around 70us. Over a 8ms frame time.

main
image
extract
image
render
image

I'm going to run more of the examples add see how things look. I think running 3d_scene will be interesting to see how the overhead is reduced. The render schedule seeing less reduction makes me think that this helped mostly with reducing some context switching. Rendering is more bottlenecked by long running systems.

edit: Note for myself to check cpu usage. Also panic handling.

@james7132 james7132 added this to the 0.14 milestone Feb 17, 2024
@@ -202,7 +202,7 @@ impl FakeTask {
/// For more information, see [`TaskPool::scope`].
#[derive(Debug)]
pub struct Scope<'scope, 'env: 'scope, T> {
executor: &'env async_executor::LocalExecutor<'env>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need to change these lifetimes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The futures passed to scope.spawn may run the executor and spawn more futures, so they need &scope. That only lives for 'scope, which is shorter than 'env.

The change to this file makes the API match the one in task_pool.rs.

(The difference between the API in task_pool and single_threaded_task_pool led to some fun and confusing compiler errors. I somehow had rust-analyzer using a configuration with task_pool but cargo build using a configuration with single_threaded_task_pool, so it worked in the IDE but failed on the command line. And the lifetime errors it reports don't point at code in single_threaded_task_pool, so it took me a while to figure out that this file even existed.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more complete review is forthcoming, I'm particularly concerned about the soundness of the changes, so I may need some time to more thoroughly go through it.

With that said, I am seeing similar performance gains to what @hymm is seeing, so this is definitely looking pretty promising by itself.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
@james7132 james7132 removed the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Feb 19, 2024
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, though the potential for aliasing on Conditions is a bit concerning.

Ideally, we wouldn't yield back to the async executor or the OS, but we can leave that for another PR.

@@ -483,12 +555,14 @@ impl MultiThreadedExecutor {
}

// Evaluate the system set's conditions.
// SAFETY: We have exclusive access to ExecutorState, so no other thread is touching these conditions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not guaranteed given the function signature and the safety invariants, this likely needs to be propagated through the invariants of this function.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
@chescock
Copy link
Contributor Author

I added two commits to try to address @james7132's review comments, then rebased the whole thing to resolve merge conflicts.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely requires another check to ensure the performance hasn't degraded from the extra mutex, and we can come back and replace it with the old SyncUnsafeCells if we can reasonably prove out the safety invariants, but this otherwise looks good to me.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
systems,
mut conditions,
} = SyncUnsafeSchedule::new(schedule);
let environment = &Environment::new(self, schedule, world);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let environment = &Environment::new(self, schedule, world);
let environment = Environment::new(self, schedule, world);

Is this borrow needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context needs a borrow because Environment owns the Mutex<Conditions>, and so that we only copy one pointer into each task instead of the whole environment. And doing the borrow here avoids having to spell out environment: &environment when constructing the Context.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked this against main, the new mutex does seem to eat into the improvements we saw earlier. Tested on many_foxes. Yellow is this PR, red is main.

main
image

render
image

There are still notable gains however. As @hymm noted, this most heavily impacts schedules with primarily small systems with little work to do.

First
image

PreUpdate
image

Last
image

What's interesting here is that there's almost 4x the number of calls into the multithreaded executor, but significantly improved the overall distribution of the time spent. On average cutting the time spent by 50%, and eliminating most of the long tail instances that take 100+us per run.
image


Overall, LGTM. We can try to remove the mutex and see if we can make any improvements
in a follow-up PR, but I'm interested in trying to minimize yielding to the task executor.

system: &BoxedSystem,
) {
// tell the executor that the system finished
self.environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have to be done in this PR, but we have the SystemResult here, and we may be able to lock the executor, sending this through the completion queue when we can just pass it in via a function argument may be wasteful and may contribute to contention on the queue.

Perhaps we could only add the result to the queue if and only if we failed to get the lock on the executor state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, just tried this myself, seems to deadlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a race condition if one thread fails to get the lock, then the other thread exits the executor, and then the first thread pushes to the queue.

chescock and others added 6 commits February 22, 2024 13:32
This allows it to run immediately instead of needing to wait for the main thread to wake up.
Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.
Co-authored-by: James Liu <contact@jamessliu.com>
@chescock
Copy link
Contributor Author

Rebased to fix merge conflicts.

the new mutex does seem to eat into the improvements we saw earlier.

Oh, wow, I wasn't expecting an uncontended mutex to actually cost anything! One option to get back to one mutex with only safe code to would be to create a Mutex<(&mut ExecutorState, Conditions<'_>)>, but you'd have to pull the rest of ExecutorState into another struct to split the borrow. (It doesn't sound like that's worth doing in this PR, though.)

@james7132
Copy link
Member

Oh, wow, I wasn't expecting an uncontended mutex to actually cost anything!

Even if it's not under contention, there is still going to be a syscall to make sure that on the OS side there is no issue.

It doesn't sound like that's worth doing in this PR, though.

Agreed. These are wins even with the impact from the extra mutex, and we can incrementally improve the results in a later PR.

@alice-i-cecile alice-i-cecile added the M-Needs-Release-Note Work that should be called out in the blog due to impact label Feb 22, 2024
Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. The changes are pretty conservative since it's mostly just yeeting most of the logic behind a mutex. I took the opportunity to do a pass over the safety comments. Some of them weren't quite correct before.

One thing to note is that I think this will deadlock when there aren't any taskpool threads now, since we're not awaiting in the ticking code anymore. Not sure it's relevant, but should maybe be added to the migration guide.

I want to do a little more pref testing with varying thread counts before approving. Should get to that this weekend.

sender: Sender<SystemResult>,
/// Receives system completion events.
receiver: Receiver<SystemResult>,
/// The running state, protected by a mutex so that the .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is incomplete.

let systems = environment.systems;

let state = self.state.get_mut().unwrap();
if state.apply_final_deferred {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think apply_final_deferred needs to be in the state. We should just be able to keep it on MultithreadedExecutor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll move that.

(I'm surprised that's the only one; I wasn't looking at whether the fields were used mutably, just whether they were read outside the lock. I figured there would be something else immutable in there, but even num_systems gets changed.)

*panic_payload = Some(payload);
}
}
self.tick_executor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this recursive? If it is we should maybe have a comment, since it's not obvious. I think it might be possible to hit the recursion limit in a large enough schedule. Probably unlikely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it spawns new tasks for each system and returns.

if self.exclusive_running {
return;
}

let mut conditions = context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to add a safety condition that no other borrows of conditions exist? instead of use a mutex. I think that's trivially true because of how conditions are only evaluated behind the mutex.

Should also be added that evaluate_and_fold_conditions is missing a safety comment that "The conditions that are run cannot have any active borrows."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be possible. @james7132 said "We can try to remove the mutex and see if we can make any improvements
in a follow-up PR", so I'm not planning to change that in this PR.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
self.ready_systems.set(system_index, false);

// SAFETY: `can_run` returned true, which means that:
// - It must have called `update_archetype_component_access` for each run condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling update_archetype_component_access really a safety condition? Seems more like it just needs to be done for correctness. i.e. If you don't call it the run conditions won't evaulate correctly, but they also won't access data they're not allowed to.

This doesn't need to be fixed in this PR. There are a bunch of safety comments with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_archetype_component access is used to validate that the world matches, so it needs to be called before run_unsafe if you haven't already verified that the world is the same one used to initialize the system.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved
let system_meta = &self.system_task_metadata[system_index];

#[cfg(feature = "trace")]
let system_span = system_meta.system_task_span.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we even need this span anymore. We'd see this task take longer if there was contention on the channel, but would there still be significant overhead here? What does ConcurrentQueue do if there's contention? Will it park the thread like a channel would? We should check this in tracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, I made this span less useful by pulling the sending of the completion message out of it. Sorry; I was trying to exclude tick_executor() and I had put those in the same function.

The existing code was doing a non-blocking try_send on the channel, which calls ConcurrentQueue::push and then does some bookkeeping. So the thread wouldn't have parked before and won't park now. It looks like there is a busy-wait in push, though, so that may be slow under contention.

Do you want any changes here? I could take the tick_executor() call out of system_completed() and then put system_completed() back in the tracing span. Or I could remove the span.

Copy link
Contributor

@hymm hymm Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave as is for now. If we're not seeing much different here between the system run time we should probably remove it later. I originally added the span because there were sometimes gaps between systems running. So adding this showed what was happening in those gaps.

&mut self,
scope: &Scope<'_, 'scope, ()>,
context: &Context<'_, '_, '_>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuffing all this stuff into Context is a nice cleanup of the function signature. If for some reason we don't merge this, we should still do this.

@@ -202,7 +202,7 @@ impl FakeTask {
/// For more information, see [`TaskPool::scope`].
#[derive(Debug)]
pub struct Scope<'scope, 'env: 'scope, T> {
executor: &'env async_executor::LocalExecutor<'env>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.

github-merge-queue bot pushed a commit that referenced this pull request Feb 24, 2024
…tch the multi-threaded version. (#12073)

# Objective

`Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope`
have different signatures depending on whether the `multi-threaded`
feature is enabled. The single-threaded version has a stricter signature
that prevents sending the `Scope` itself to spawned tasks.

## Solution

Changed the lifetime constraints in the single-threaded signatures from
`'env` to `'scope` to match the multi-threaded version.

This was split off from #11906.
@hymm
Copy link
Contributor

hymm commented Feb 25, 2024

Ran this 3d scene and saw a somewhat ambiguous result for Render schedule. The other schedules seemed faster, but rendering is often the bottleneck so want to be carefull here.

image

I think the regression is just due to there no longer being the weird fast path hump, so we're probably ok here.

tested with RUST_LOG=bevy_ecs::schedule::schedule=info,bevy_app=info,bevy_render=info cargo run --profile stress-test --example 3d_scene -F trace_tracy which removes the system level tracing spans.

Looks like I was wrong about it deadlocking if there wasn't any compute threads. I added

        .add_plugins(DefaultPlugins.set(TaskPoolPlugin {
            task_pool_options: TaskPoolOptions {
                min_total_threads: 0,
                max_total_threads: 0,
                compute: TaskPoolThreadAssignmentPolicy {
                    min_threads: 0,
                    max_threads: 0,
                    percent: 0.0,
                },
                ..default()
            }
        }))

and it ran ok. I confirmed with tracy that there weren't any compute threads.

@hymm hymm added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Feb 25, 2024
@james7132 james7132 added this pull request to the merge queue Feb 25, 2024
@james7132 james7132 removed this pull request from the merge queue due to a manual request Feb 25, 2024
@james7132 james7132 added this pull request to the merge queue Feb 26, 2024
Merged via the queue into bevyengine:main with commit c4caebb Feb 26, 2024
27 of 28 checks passed
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
…tch the multi-threaded version. (bevyengine#12073)

# Objective

`Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope`
have different signatures depending on whether the `multi-threaded`
feature is enabled. The single-threaded version has a stricter signature
that prevents sending the `Scope` itself to spawned tasks.

## Solution

Changed the lifetime constraints in the single-threaded signatures from
`'env` to `'scope` to match the multi-threaded version.

This was split off from bevyengine#11906.
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
…engine#11906)

# Objective

The multi-threaded executor currently runs in a dedicated task on a
single thread. When a system finishes running, it needs to notify that
task and wait for the thread to be available and running before the
executor can process the completion.

See bevyengine#8304

## Solution

Run the multi-threaded executor at the end of each system task. This
allows it to run immediately instead of needing to wait for the main
thread to wake up. Move the mutable executor state into a separate
struct and wrap it in a mutex so it can be shared among the worker
threads.

While this should be faster in theory, I don't actually know how to
measure the performance impact myself.

---------

Co-authored-by: James Liu <contact@jamessliu.com>
Co-authored-by: Mike <mike.hsu@gmail.com>
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
…tch the multi-threaded version. (bevyengine#12073)

# Objective

`Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope`
have different signatures depending on whether the `multi-threaded`
feature is enabled. The single-threaded version has a stricter signature
that prevents sending the `Scope` itself to spawned tasks.

## Solution

Changed the lifetime constraints in the single-threaded signatures from
`'env` to `'scope` to match the multi-threaded version.

This was split off from bevyengine#11906.
msvbg pushed a commit to msvbg/bevy that referenced this pull request Feb 26, 2024
…engine#11906)

# Objective

The multi-threaded executor currently runs in a dedicated task on a
single thread. When a system finishes running, it needs to notify that
task and wait for the thread to be available and running before the
executor can process the completion.

See bevyengine#8304

## Solution

Run the multi-threaded executor at the end of each system task. This
allows it to run immediately instead of needing to wait for the main
thread to wake up. Move the mutable executor state into a separate
struct and wrap it in a mutex so it can be shared among the worker
threads.

While this should be faster in theory, I don't actually know how to
measure the performance impact myself.

---------

Co-authored-by: James Liu <contact@jamessliu.com>
Co-authored-by: Mike <mike.hsu@gmail.com>
NiseVoid pushed a commit to NiseVoid/bevy that referenced this pull request Jul 8, 2024
…engine#11906)

The multi-threaded executor currently runs in a dedicated task on a
single thread. When a system finishes running, it needs to notify that
task and wait for the thread to be available and running before the
executor can process the completion.

See bevyengine#8304

Run the multi-threaded executor at the end of each system task. This
allows it to run immediately instead of needing to wait for the main
thread to wake up. Move the mutable executor state into a separate
struct and wrap it in a mutex so it can be shared among the worker
threads.

While this should be faster in theory, I don't actually know how to
measure the performance impact myself.

---------

Co-authored-by: James Liu <contact@jamessliu.com>
Co-authored-by: Mike <mike.hsu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events A-Tasks Tools for parallel and async work C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! M-Needs-Release-Note Work that should be called out in the blog due to impact S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
Status: Merged PR
Development

Successfully merging this pull request may close these issues.

6 participants