Task System for Bevy #318

aclysma · 2020-08-24T05:07:40Z

Task System for Bevy

Currently bevy depends on rayon for multithreaded dispatching of systems. @lachlansneff and @aclysma have iterated on a prototype with feedback from @kabergstrom and @cart. This issue is meant to:

Capture some of the discussion in discord in #rendering - some of it was about the prototype and some of it was longer term
Invite further feedback on the near-term and long-term plans

Why Replace Rayon?

Rayon has long-standing performance issues with smaller workloads on machines with more than a few cores. (cpu usage #111)
Rayon is not async-friendly
Rayon is somewhat of a closed box, it owns the threads and it would be more difficult to upstream changes that are tuned for games as it’s a general purpose library
Rayon has a lot more in it than we really need. Our alternative has fewer dependencies, compiles faster, and is substantially less code

What Would the Alternative Be?

@lachlansneff and @aclysma implemented a prototype using multitask. It’s a small executor based on async-task, which is used within async-std. The dependencies are:

├── multitask
│   ├── async-task
│   ├── concurrent-queue
│   │   └── cache-padded
│   └── fastrand
├── num_cpus
│   └── libc
├── parking
└── pollster

The API does three things:

Fire-and-forget `static tasks
Fork-join for non-static tasks
Chunked parallel iteration of slices

We have a prototype of ParallelExecutor using it instead of rayon, allowing us to remove rayon as a dependency from bevy. This repo is a prototype and we intend to add it as a module within bevy directly. (bevy_tasks) This will allow us to do more sophisticated instrumentation and profiling than if we were using an externally-managed thread pool.

Prototype integrated with bevy is here (🚧still under construction🚧): https://github.com/lachlansneff/bevy/tree/bevy-tasks/crates/bevy_tasks/src
Our original prototype is also available if anyone wants to play with this outside of bevy: https://github.com/lachlansneff/small-task/tree/master/src

Advantages/Disadvantages

Advantages:

Less code, less dependencies
More async friendly
Allows us to control and customize it more for bevy
Solves issue cpu usage #111

Disadvantages:

Rayon has more features

Tradeoffs:

Rayon has a large community of maintainers, users, and downstream crates. This can be both good and bad

Short Term Plan

Finish a PR to add bevy_tasks as a module, remove rayon as a dependency, and update ParallelExecutor to work with bevy_tasks. In principle these steps are already done, but we may want to polish a bit first.

We have a feature branch for this underway here: https://github.com/lachlansneff/bevy/tree/bevy-tasks

Long Term Plan

Thread management is clearly a key problem that needs to be solved in a scalable game engine. In my opinion there are three main uses of threads in a modern, ECS-based game engine, from high-level to low-level:

Large systems that are relatively long running that own their own thread, pipelined with the “main” simulation thread. (See https://github.com/aclysma/renderer_prototype/blob/master/docs/pipelining.png)
Dispatching individual ECS systems
Jobs being triggered by the systems
a. Some systems might wait for tasks to complete before returning
b. Other systems might start jobs to be finished later in the frame or in a future frame

We plan to apply this solution to #2 now, and longer term expose the same underlying system to solve #3. (#1 is out of scope for now, but might also be able to use this system)

We discussed #3 in the #rendering channel in discord:

@lachlansneff suggested creating a single global threadpool as an atomic static so that it’s always easy and fast to access
@aclysma suggested adding it as a resource as this is more consistent with the rest of the engine
@cart: “my default answer is "lets use a resource", but I’m flexible if the context dictates that something static is better”
@kabergstrom suggested separating IO tasks from compute tasks. The rationale being that IO tasks do not occupy a thread for long as they generally have short wake/sleep cycles. Additionally IO sometimes has latency requirements
@aclysma suggested binning tasks as:
- IO: High priority tasks that are expected to spend very little time “awake” (example: feeding an audio buffer)
- Compute: Tasks that must be completed to render the current frame (example: visibility)
- Async Compute: Tasks that may span multiple frames and don’t have latency requirements (example: pathfinding)
@kabergstrom suggested that generally we would partition threads to process specific bins, not oversubscribing threads
- Example: A system with 16 logical cores might give 4 to IO, 4 to async compute, and 8 to (same-frame) compute
- There was general agreement that this needs to be tunable as games will have different requirements (i.e. games that are streaming in the environment vs. games that can load everything up front)
@aclysma suggested that when we do have binned thread pools, we will want to set affinities to physical cores
Consensus was that for now we will put thread pools in resources as we can new type each specific pool as separate resources
Having an atomic static method of accessing the thread pool may prove to be a valuable option in the future, but if we add it we will probably want to retain the ability to have separate buckets for different types of tasks (i.e. it would be more like an atomic static global table of task pools rather than a single task pool - or maybe 3 atomic static task pools)

Configuration of Task Buckets

@lachlansneff and @aclysma also discussed the need to assign threads to the proposed buckets (IO, async compute, and compute). We considered several approaches:

Some sort of callback that puts the problem completely on the end-user. We would probably want this to be optional - which would mean having some sort of reasonable default if they did not specify a callback
Some sort of explicit configuration that can be targeted towards particular hardware devices (i.e. a config file that explicitly lists devices (i.e. iphone12, pixel4) and an exact distribution of threads to use)
- In general, we may need a solution for other systems in the future to tune performance based on the device (examples: LOD distances, pool sizes, limits on number of things spawned, disabling/scaling rendering features, etc. If we add something like this in the future, policy on task/thread distribution would be good to add)
Could provide a rough policy
- Stupid simple default: 1 IO thread, 1 async compute thread, max(1, N-2) compute threads
- Slightly more advanced: %/min/max IO threads, %/min/max async compute threads, min compute threads
- Example:
  - N = number of logical cores
  - NumIOCores = Clamp(N * 0.1, 1, 4)
    - 10%, at least 1 no more than 4
  - NumAsyncComputeCores = Clamp(N * 0.1, 1, 4)
    - 10%, at least 1, no more than 4
  - NumComputeCores = Max(1, N - NumIOCores - NumAsyncComputeCores)
    - Implicitly 80% of cores, at least 1
If threads get oversubscribed because there are <=2 cores, we will rely on the OS to fairly timeslice them.
- We actually don’t want to try to do this ourselves because the OS is able to preempt a task even if it is long running - this approach should be more resilient against any of the pools being starved
@lachlansneff and @aclysma agreed that we should go for the “slightly more advanced” policy now with the option to implement a callback for custom configuration later. We will implement in a way that would allow both methods to coexist.

Potential Future Improvements to bevy_tasks

Better ergonomics with Tasks/TaskPools/Scopes in multithreaded code
- A few things require &mut that we might not want to use &mut. For example Scope::spawn. Would be nice if Scope was cloneable and could be passed into futures being spawned.
- Might also be nice to have TaskPool have an Arc which would make that easier to pass around too.
More doc comments
Improved Panic Handling: If a task panics in scope() and spawn(), we want well-thought-out ways of surfacing the panic to the caller. We may be getting this somewhat for free by using multitask, but it’s worth having an intentional design around it (i.e. awaiting a panicked task will panic)

Next Steps

Gather feedback for the short term plan and find consensus on if we will proceed (PR bevy_tasks and use it to replace rayon)
Gather additional feedback for the longer term plan (a general approach to multithreaded tasks and how bevy_tasks might help us with it)

The text was updated successfully, but these errors were encountered:

GabLotus · 2020-08-24T19:47:16Z

This is awesome I'm so glad we have people like you contributing to bevy

aclysma · 2020-08-25T07:50:58Z

It seems like there is consensus to move this forward. I think we are quite close to having a PR ready :)

We had a bit more discussion on discord tonight from a number of participants, mainly about how this would work within a browser:

We did not have a clear answer for if this would work out-of-the-box, but did have a general sense that async support might be helpful.
@lachlansneff suggested that it would be helpful to gather all the restrictions the web would impose and if/how bevy can be designed around them without hobbling the engine for non-web platforms.
- We might be able to get advice from members of the amethyst community on this
We should be prepared to change the task pool implementation depending on the platform. Partly in case a web-compatible approach puts too many restrictions on native platforms, and partly to accommodate the shifting requirements of the web and potential multiple browser implementations and quirks.

chemicstry · 2020-08-25T11:16:49Z

You can see an example here how amethyst integrated web workers into rayon. Rayon allows specifying spawn_handler which we use to execute rayon inside web worker.

The main blocker for amethyst was join (or install, or anything that waits for other threads) calls on the main thread, which are not allowed on the web. Main usage of the join was waiting for all the dispatched systems to finish executing. The only way to solve this is making join an async function. There are multiple ways to implement this:

Wait for async atomics to land in browsers. There is also a polyfill, but it involves spawning web worker threads for each await. Although it could work as a temporary solution until browsers implement the real thing.
Implement a similar polyfill like async atomics, but directly into executor. This would involve web worker postMessage to the main thread to wake async task. It would be faster than the polyfill as no new threads are spawned, but is incompatible with the upcoming async atomics proposal.

The implications of async join are quite outreaching. The whole call stack up until executor join has to be async. In amethyst this involved application, whole gamestate code including user code and then finally ECS dispatcher. The shallower this stack is the less code needs to be async.

lachlansneff · 2020-08-25T17:38:58Z

@chemicstry I looked into the proposal for waitAsync and I'm not sure it'll actually be useful here. I was thinking that it would close over a function that calls atomics.wait at some point, and then turn that into a promise, so you could use atomics.wait on the main thread. However, it looks like it's just a way of calling atomics.wait in javascript in an async context, so not too useful as far as I can tell.

One way of completely sidestepping the "no blocking the main thread" problem is to only run bevy in workers when on the web. Those can block no problem. We can render with an offscreen canvas.

chemicstry · 2020-08-25T21:40:13Z

@lachlansneff async is the only way to achieve "blocking" execution in web context. Any wasm execution must immediately return to prevent blocking main thread and this inevitably loses stack, so you can't resume execution. winit hacks this by throwing exception and leaking all stack, afterwards all execution is carried by browser events.

You can emulate this async behavior like I mentioned in 2nd solution, but it would be less performant than async atomics because of worker message overhead.

We explored the option of running everything in web workers, but it has its downsides:

winit does not support OffscreenCanvas. See: Web worker support rust-windowing/winit#1518
There is still no audio support from web workers.
Most of the window API does not exist, even though some browsers provide it. Hence this is not available in wasm-bindgen. This includes things like performance.now().
No window, input events.

The list is probably longer. It would require a lot of glue to make everything run in web workers, but if this could be implemented as a separate glue logic instead of a bunch of cfg macros, then it may very well be a desired solution. Although, I would still say that avoiding locks in main thread and relying on ECS for ensuring safe shared access is a good pattern from performance and code quality perspective.

lwansbrough · 2020-08-25T22:06:39Z

@chemicstry I think most people are in agreement that Bevy should be focusing on targeting the web as it stands to exist sometime around when Bevy hits 1.0 (or more generally: targeting some future state of the web.) This means avoiding development of suboptimal solutions in an attempt to make something work today.

So with that in mind, and recognizing that you did a lot of research in this area:

winit not supporting offscreen canvas today is not a big deal. By the time WebGPU is available in offscreen canvas I’ll bet some things have changed. Yes events will have to be propagated to the worker until there is support for that, which is a bit of a bummer.
AudioWorklets may be a solution to web worker audio. The architecture for this may be outlined here (not sure if it’s an exact fit.) Probably worth examining the state of things today/near future. https://developers.google.com/web/updates/2018/06/audio-worklet-design-pattern#webaudio_powerhouse_audio_worklet_and_sharedarraybuffer

No disagreement here regarding solutions that avoid this problem altogether!

GabLotus added C-Enhancement A new feature A-Rendering Drawing game state to the screen labels Aug 24, 2020

lachlansneff mentioned this issue Aug 28, 2020

Task System for Bevy #384

Merged

cart closed this as completed in #384 Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task System for Bevy #318

Task System for Bevy #318

aclysma commented Aug 24, 2020

GabLotus commented Aug 24, 2020

aclysma commented Aug 25, 2020

chemicstry commented Aug 25, 2020 •

edited

lachlansneff commented Aug 25, 2020 •

edited

chemicstry commented Aug 25, 2020

lwansbrough commented Aug 25, 2020

Task System for Bevy #318

Task System for Bevy #318

Comments

aclysma commented Aug 24, 2020

Task System for Bevy

Why Replace Rayon?

What Would the Alternative Be?

Advantages/Disadvantages

Advantages:

Disadvantages:

Tradeoffs:

Short Term Plan

Long Term Plan

Configuration of Task Buckets

Potential Future Improvements to bevy_tasks

Next Steps

GabLotus commented Aug 24, 2020

aclysma commented Aug 25, 2020

chemicstry commented Aug 25, 2020 • edited

lachlansneff commented Aug 25, 2020 • edited

chemicstry commented Aug 25, 2020

lwansbrough commented Aug 25, 2020

chemicstry commented Aug 25, 2020 •

edited

lachlansneff commented Aug 25, 2020 •

edited