Run the multi-threaded executor at the end of each system task. #11906

chescock · 2024-02-16T18:28:09Z

Objective

The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion.

See #8304

Solution

Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.

While this should be faster in theory, I don't actually know how to measure the performance impact myself.

NthTensor · 2024-02-16T18:49:15Z

I will profile this later today. Are you sure the formatting of your SAFETY comments is correct?

hymm · 2024-02-16T18:55:28Z

ci failure on run-examples looks real.

chescock · 2024-02-16T18:57:15Z

I think I understand the bug that caused the assertion failed: state.ready_systems.is_clear() CI failure. When a system is skipped by run conditions and marks dependents as ready, those dependent systems aren't spawned by that call to spawn_system_tasks. If that's the last system, it stops spawning and ends early.

The existing code handled that case by running spawn_system_tasks again immediately if the if self.num_running_systems > 0 check failed. I hadn't realized that was the purpose of that check.

I'm not sure what the cleanest way to fix that is. I should have time to work on it tonight, but not before then. Sorry!

chescock · 2024-02-16T19:07:58Z

@NthTensor

I will profile this later today.

Thanks, but it's not actually working yet, so no need just yet. The CI caught a race condition I had missed.

Are you sure the formatting of your SAFETY comments is correct?

Nope, not at all!

... Yup, the ones in should_run are definitely wrong. I put the new code in between the old SAFETY comment and the unsafe function call. I was confused because I didn't see an unsafe keyword there, but I guess it was all in an unsafe function.

I'll fix those. Were those the ones you meant, or are there other issues, too?

chescock · 2024-02-17T03:07:19Z

Okay, I fixed the bug the CI caught, and added a unit test for it!

("Dependencies / check-bans" seems to be failing for every PR and doesn't mention anything I changed, so I plan to ignore that failure.)

hymm · 2024-02-17T07:22:18Z

Seems promising. Ran the many foxes example with RUST_LOG=bevy_ecs::schedule::schedule=info,bevy_app=info,bevy_render=info. This gets rid of the most of the spans smaller than the schedule level.

Main schedule is saving 200us, extract is 70us and render is around 70us. Over a 8ms frame time.

main

extract

render

I'm going to run more of the examples add see how things look. I think running 3d_scene will be interesting to see how the overhead is reduced. The render schedule seeing less reduction makes me think that this helped mostly with reducing some context switching. Rendering is more bottlenecked by long running systems.

edit: Note for myself to check cpu usage. Also panic handling.

hymm · 2024-02-17T21:10:52Z

crates/bevy_tasks/src/single_threaded_task_pool.rs

@@ -202,7 +202,7 @@ impl FakeTask {
 /// For more information, see [`TaskPool::scope`].
 #[derive(Debug)]
 pub struct Scope<'scope, 'env: 'scope, T> {
-    executor: &'env async_executor::LocalExecutor<'env>,


why did you need to change these lifetimes?

The futures passed to scope.spawn may run the executor and spawn more futures, so they need &scope. That only lives for 'scope, which is shorter than 'env.

The change to this file makes the API match the one in task_pool.rs.

(The difference between the API in task_pool and single_threaded_task_pool led to some fun and confusing compiler errors. I somehow had rust-analyzer using a configuration with task_pool but cargo build using a configuration with single_threaded_task_pool, so it worked in the IDE but failed on the command line. And the lifetime errors it reports don't point at code in single_threaded_task_pool, so it took me a while to figure out that this file even existed.)

Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.

james7132

A more complete review is forthcoming, I'm particularly concerned about the soundness of the changes, so I may need some time to more thoroughly go through it.

With that said, I am seeing similar performance gains to what @hymm is seeing, so this is definitely looking pretty promising by itself.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

james7132

Generally looks good to me, though the potential for aliasing on Conditions is a bit concerning.

Ideally, we wouldn't yield back to the async executor or the OS, but we can leave that for another PR.

james7132 · 2024-02-19T03:29:55Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

@@ -483,12 +555,14 @@ impl MultiThreadedExecutor {
            }

            // Evaluate the system set's conditions.
+            // SAFETY: We have exclusive access to ExecutorState, so no other thread is touching these conditions


This is not guaranteed given the function signature and the safety invariants, this likely needs to be propagated through the invariants of this function.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

chescock · 2024-02-20T03:50:15Z

I added two commits to try to address @james7132's review comments, then rebased the whole thing to resolve merge conflicts.

james7132

This likely requires another check to ensure the performance hasn't degraded from the extra mutex, and we can come back and replace it with the old SyncUnsafeCells if we can reasonably prove out the safety invariants, but this otherwise looks good to me.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

james7132 · 2024-02-20T06:17:34Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

-            systems,
-            mut conditions,
-        } = SyncUnsafeSchedule::new(schedule);
+        let environment = &Environment::new(self, schedule, world);


Suggested change

let environment = &Environment::new(self, schedule, world);

let environment = Environment::new(self, schedule, world);

Is this borrow needed?

Context needs a borrow because Environment owns the Mutex<Conditions>, and so that we only copy one pointer into each task instead of the whole environment. And doing the borrow here avoids having to spell out environment: &environment when constructing the Context.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

james7132

Checked this against main, the new mutex does seem to eat into the improvements we saw earlier. Tested on many_foxes. Yellow is this PR, red is main.

main

render

There are still notable gains however. As @hymm noted, this most heavily impacts schedules with primarily small systems with little work to do.

First

PreUpdate

Last

What's interesting here is that there's almost 4x the number of calls into the multithreaded executor, but significantly improved the overall distribution of the time spent. On average cutting the time spent by 50%, and eliminating most of the long tail instances that take 100+us per run.

Overall, LGTM. We can try to remove the mutex and see if we can make any improvements
in a follow-up PR, but I'm interested in trying to minimize yielding to the task executor.

james7132 · 2024-02-22T05:36:41Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+        system: &BoxedSystem,
+    ) {
+        // tell the executor that the system finished
+        self.environment


This doesn't have to be done in this PR, but we have the SystemResult here, and we may be able to lock the executor, sending this through the completion queue when we can just pass it in via a function argument may be wasteful and may contribute to contention on the queue.

Perhaps we could only add the result to the queue if and only if we failed to get the lock on the executor state.

Nevermind, just tried this myself, seems to deadlock.

Yeah, there's a race condition if one thread fails to get the lock, then the other thread exits the executor, and then the first thread pushes to the queue.

This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.

…are runnable.

… store a single pointer to it in Context.

Co-authored-by: James Liu <contact@jamessliu.com>

chescock · 2024-02-22T19:18:51Z

Rebased to fix merge conflicts.

the new mutex does seem to eat into the improvements we saw earlier.

Oh, wow, I wasn't expecting an uncontended mutex to actually cost anything! One option to get back to one mutex with only safe code to would be to create a Mutex<(&mut ExecutorState, Conditions<'_>)>, but you'd have to pull the rest of ExecutorState into another struct to split the borrow. (It doesn't sound like that's worth doing in this PR, though.)

james7132 · 2024-02-22T19:57:30Z

Oh, wow, I wasn't expecting an uncontended mutex to actually cost anything!

Even if it's not under contention, there is still going to be a syscall to make sure that on the OS side there is no issue.

It doesn't sound like that's worth doing in this PR, though.

Agreed. These are wins even with the impact from the extra mutex, and we can incrementally improve the results in a later PR.

hymm

Nice job. The changes are pretty conservative since it's mostly just yeeting most of the logic behind a mutex. I took the opportunity to do a pass over the safety comments. Some of them weren't quite correct before.

One thing to note is that I think this will deadlock when there aren't any taskpool threads now, since we're not awaiting in the ticking code anymore. Not sure it's relevant, but should maybe be added to the migration guide.

I want to do a little more pref testing with varying thread counts before approving. Should get to that this weekend.

hymm · 2024-02-23T03:58:28Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

-    sender: Sender<SystemResult>,
-    /// Receives system completion events.
-    receiver: Receiver<SystemResult>,
+    /// The running state, protected by a mutex so that the .


This sentence is incomplete.

hymm · 2024-02-23T04:21:42Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+        let systems = environment.systems;
+
+        let state = self.state.get_mut().unwrap();
+        if state.apply_final_deferred {


I don't think apply_final_deferred needs to be in the state. We should just be able to keep it on MultithreadedExecutor

Sure, I'll move that.

(I'm surprised that's the only one; I wasn't looking at whether the fields were used mutably, just whether they were read outside the lock. I figured there would be something else immutable in there, but even num_systems gets changed.)

hymm · 2024-02-23T04:37:44Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+                *panic_payload = Some(payload);
+            }
+        }
+        self.tick_executor();


Is this recursive? If it is we should maybe have a comment, since it's not obvious. I think it might be possible to hit the recursion limit in a large enough schedule. Probably unlikely.

No, it spawns new tasks for each system and returns.

hymm · 2024-02-23T04:57:26Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

        if self.exclusive_running {
            return;
        }

+        let mut conditions = context


Would it work to add a safety condition that no other borrows of conditions exist? instead of use a mutex. I think that's trivially true because of how conditions are only evaluated behind the mutex.

Should also be added that evaluate_and_fold_conditions is missing a safety comment that "The conditions that are run cannot have any active borrows."

That should be possible. @james7132 said "We can try to remove the mutex and see if we can make any improvements
in a follow-up PR", so I'm not planning to change that in this PR.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

hymm · 2024-02-23T05:38:25Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+                self.ready_systems.set(system_index, false);
+
+                // SAFETY: `can_run` returned true, which means that:
+                // - It must have called `update_archetype_component_access` for each run condition.


Is calling update_archetype_component_access really a safety condition? Seems more like it just needs to be done for correctness. i.e. If you don't call it the run conditions won't evaulate correctly, but they also won't access data they're not allowed to.

This doesn't need to be fixed in this PR. There are a bunch of safety comments with this.

update_archetype_component access is used to validate that the world matches, so it needs to be called before run_unsafe if you haven't already verified that the world is the same one used to initialize the system.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

hymm · 2024-02-23T05:54:21Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

+        let system_meta = &self.system_task_metadata[system_index];
+
+        #[cfg(feature = "trace")]
+        let system_span = system_meta.system_task_span.clone();


I wonder if we even need this span anymore. We'd see this task take longer if there was contention on the channel, but would there still be significant overhead here? What does ConcurrentQueue do if there's contention? Will it park the thread like a channel would? We should check this in tracy.

Oh, I see, I made this span less useful by pulling the sending of the completion message out of it. Sorry; I was trying to exclude tick_executor() and I had put those in the same function.

The existing code was doing a non-blocking try_send on the channel, which calls ConcurrentQueue::push and then does some bookkeeping. So the thread wouldn't have parked before and won't park now. It looks like there is a busy-wait in push, though, so that may be slow under contention.

Do you want any changes here? I could take the tick_executor() call out of system_completed() and then put system_completed() back in the tracing span. Or I could remove the span.

I think it's fine to leave as is for now. If we're not seeing much different here between the system run time we should probably remove it later. I originally added the span because there were sometimes gaps between systems running. So adding this showed what was happening in those gaps.

hymm · 2024-02-23T05:56:46Z

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

        &mut self,
-        scope: &Scope<'_, 'scope, ()>,
+        context: &Context<'_, '_, '_>,


Stuffing all this stuff into Context is a nice cleanup of the function signature. If for some reason we don't merge this, we should still do this.

hymm · 2024-02-23T06:01:23Z

crates/bevy_tasks/src/single_threaded_task_pool.rs

@@ -202,7 +202,7 @@ impl FakeTask {
 /// For more information, see [`TaskPool::scope`].
 #[derive(Debug)]
 pub struct Scope<'scope, 'env: 'scope, T> {
-    executor: &'env async_executor::LocalExecutor<'env>,


Might be worth pulling out of this pr then. This is a riskier change and if for some reason it gets reverted we should still keep the changes to this file.

Co-authored-by: Mike <mike.hsu@gmail.com>

…ied while running the schedule.

…tch the multi-threaded version. (#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from #11906.

hymm · 2024-02-25T20:48:05Z

Ran this 3d scene and saw a somewhat ambiguous result for Render schedule. The other schedules seemed faster, but rendering is often the bottleneck so want to be carefull here.

I think the regression is just due to there no longer being the weird fast path hump, so we're probably ok here.

tested with RUST_LOG=bevy_ecs::schedule::schedule=info,bevy_app=info,bevy_render=info cargo run --profile stress-test --example 3d_scene -F trace_tracy which removes the system level tracing spans.

Looks like I was wrong about it deadlocking if there wasn't any compute threads. I added

        .add_plugins(DefaultPlugins.set(TaskPoolPlugin {
            task_pool_options: TaskPoolOptions {
                min_total_threads: 0,
                max_total_threads: 0,
                compute: TaskPoolThreadAssignmentPolicy {
                    min_threads: 0,
                    max_threads: 0,
                    percent: 0.0,
                },
                ..default()
            }
        }))

and it ran ok. I confirmed with tracy that there weren't any compute threads.

…tch the multi-threaded version. (bevyengine#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from bevyengine#11906.

…engine#11906) # Objective The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 ## Solution Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <contact@jamessliu.com> Co-authored-by: Mike <mike.hsu@gmail.com>

…tch the multi-threaded version. (bevyengine#12073) # Objective `Scope::spawn`, `Scope::spawn_on_external`, and `Scope::spawn_on_scope` have different signatures depending on whether the `multi-threaded` feature is enabled. The single-threaded version has a stricter signature that prevents sending the `Scope` itself to spawned tasks. ## Solution Changed the lifetime constraints in the single-threaded signatures from `'env` to `'scope` to match the multi-threaded version. This was split off from bevyengine#11906.

…engine#11906) # Objective The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 ## Solution Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <contact@jamessliu.com> Co-authored-by: Mike <mike.hsu@gmail.com>

…engine#11906) The multi-threaded executor currently runs in a dedicated task on a single thread. When a system finishes running, it needs to notify that task and wait for the thread to be available and running before the executor can process the completion. See bevyengine#8304 Run the multi-threaded executor at the end of each system task. This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads. While this should be faster in theory, I don't actually know how to measure the performance impact myself. --------- Co-authored-by: James Liu <contact@jamessliu.com> Co-authored-by: Mike <mike.hsu@gmail.com>

james7132 requested review from hymm and james7132 February 16, 2024 18:53

james7132 added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Feb 16, 2024

james7132 added this to the 0.14 milestone Feb 17, 2024

hymm reviewed Feb 17, 2024

View reviewed changes

james7132 reviewed Feb 19, 2024

View reviewed changes

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Outdated Show resolved Hide resolved

james7132 removed the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Feb 19, 2024

james7132 reviewed Feb 19, 2024

View reviewed changes

chescock force-pushed the executor-mutex branch from 543215c to 2ba70eb Compare February 20, 2024 02:46

james7132 reviewed Feb 20, 2024

View reviewed changes

james7132 approved these changes Feb 22, 2024

View reviewed changes

chescock and others added 6 commits February 22, 2024 13:32

Run the multi-threaded executor at the end of each system task.

421c299

This allows it to run immediately instead of needing to wait for the main thread to wake up. Move the mutable executor state into a separate struct and wrap it in a mutex so it can be shared among the worker threads.

If systems are skipped, immediately check whether their dependencies …

7fb531d

…are runnable.

Move executor tracing span inside lock guard, and cache it.

02bb40a

Hold Conditions in a Mutex to avoid the need for unsafe synchronization.

241d659

Combine SyncUnsafeSchedule with the rest of the environment, and only…

a7aebd6

… store a single pointer to it in Context.

Context doesn't need to be pub

8b5a37c

Co-authored-by: James Liu <contact@jamessliu.com>

chescock force-pushed the executor-mutex branch from 74c03e4 to 8b5a37c Compare February 22, 2024 19:04

alice-i-cecile added the M-Needs-Release-Note Work that should be called out in the blog due to impact label Feb 22, 2024

hymm reviewed Feb 23, 2024

View reviewed changes

chescock and others added 4 commits February 23, 2024 14:29

Apply suggestions from code review

f11d484

Co-authored-by: Mike <mike.hsu@gmail.com>

Update crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

f403b7b

Co-authored-by: Mike <mike.hsu@gmail.com>

Move apply_final_deferred out of ExecutorState since it's never modif…

6099675

…ied while running the schedule.

Finish incomplete comment.

f5a00b1

chescock mentioned this pull request Feb 24, 2024

Loosen lifetime requirements for single-threaded Scope::spawn to match the multi-threaded version. #12073

Merged

Remove unnecessary lifetime annotations.

708acac

hymm approved these changes Feb 25, 2024

View reviewed changes

hymm added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Feb 25, 2024

james7132 added this pull request to the merge queue Feb 25, 2024

james7132 removed this pull request from the merge queue due to a manual request Feb 25, 2024

james7132 added this pull request to the merge queue Feb 26, 2024

Merged via the queue into bevyengine:main with commit c4caebb Feb 26, 2024
27 of 28 checks passed

This was referenced Mar 23, 2024

Feature request (post-Stageless): "inline" flag for systems #7208

Closed

Inline "small" systems into the multithreaded executor #7693

Closed

james7132 mentioned this pull request Apr 4, 2024

Next Task Optimization for the Mulithreaded Executor #12869

Open

	let environment = &Environment::new(self, schedule, world);
	let environment = Environment::new(self, schedule, world);

Run the multi-threaded executor at the end of each system task. #11906

Run the multi-threaded executor at the end of each system task. #11906

Conversation

chescock commented Feb 16, 2024

Objective

Solution

NthTensor commented Feb 16, 2024

hymm commented Feb 16, 2024

chescock commented Feb 16, 2024

chescock commented Feb 16, 2024

chescock commented Feb 17, 2024

hymm commented Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james7132 left a comment

Choose a reason for hiding this comment

james7132 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chescock commented Feb 20, 2024

james7132 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james7132 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chescock commented Feb 22, 2024

james7132 commented Feb 22, 2024

hymm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hymm Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hymm commented Feb 25, 2024

hymm commented Feb 17, 2024 •

edited

Loading

hymm Feb 23, 2024 •

edited

Loading