Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Locking From Hot Paths #2710

Closed
cwfitzgerald opened this issue Jun 2, 2022 · 20 comments · Fixed by #3626
Closed

Remove Locking From Hot Paths #2710

cwfitzgerald opened this issue Jun 2, 2022 · 20 comments · Fixed by #3626
Labels
area: performance How fast things go type: enhancement New feature or request

Comments

@cwfitzgerald
Copy link
Member

Is your feature request related to a problem? Please describe.

This formalizes #2272 into a issue.

Our current system takes global locks per resource. This means that if one method needs mutable access to a particular type, it needs to lock all resources of that type as mutable. Because every method needs read or write access to some resource, it's very easy to write multithreaded code that is essentially serialized.

The worst case of this is renderpass encoding. In an ideal world, this would be fully parallelizable, but currently, because of us needing a write lock, is serial.

This is one of the ways that wgpu has major disadvantages over the underlying api.

Describe the solution you'd like

We should switch to a wait free datastructure on the hot path with per-object locking or more fine grained locking.

Describe alternatives you've considered
The arcinization was another alternative, but we have shifted away from that style;

@cwfitzgerald cwfitzgerald added type: enhancement New feature or request area: performance How fast things go labels Jun 2, 2022
@cwfitzgerald cwfitzgerald added this to the Release 0.13 milestone Jun 2, 2022
@Imberflur
Copy link
Contributor

related issues #1485, #2023, #2394, #2395

@John-Nagle
Copy link

That identifies the problem better. See my test case for this here: https://github.com/John-Nagle/render-bench

@John-Nagle
Copy link

It's an ongoing problem. The overall effect is that WGPU is S-L-O-W for anything which loads its content while rendering.
A major claimed advantage of Vulkan over OpenGL is that you can have one CPU loading content into the GPU while another manages the rendering. WGPU loses that advantage.

@John-Nagle
Copy link

Discussion on Reddit: https://www.reddit.com/r/rust_gamedev/comments/11b0brr/were_not_really_game_yet/

If someone needs render-bench updated to use later versions of Rend3 and WGPU, please let me know. That's a good test for this problem.

@gents83
Copy link
Contributor

gents83 commented Feb 26, 2023

@cwfitzgerald if someone would like to take this is it suggested to start from your work or to restart from scratch (you mentioned that you had new ideas, maybe it could be worth to write down the intentions)?
I could try to take this, but I didn't want to do something in a way that is not what you thought 😁

PS: this is how I tackle the same issue in my engine, with Arc<RwLock> on single Resource and keeping a simple Storage.

@John-Nagle
Copy link

Nice. I'm glad to see someone interested in tackling this.

@cwfitzgerald
Copy link
Member Author

cwfitzgerald commented Feb 27, 2023

@gents83 I'm glad to hear other people are too! So a bit of a brain dump with relation to intentions, history, and the previous PR.

History

This problem has been been around (and something we're aware of) since the very beginning days of wgpu. It is complicated in part because it involves very core structures in wgpu. It is also complicated due to our users use cases:

  • Firefox has very steep safety requirements due to the potentially massive attack surface. They will take performance losses in exchange for increased safety.
  • All native apps are generally the opposite. They want performance above all else and as long as the code works.

The first manifested solution to this was the "arcinization". This converted all resources to Arc. This effort died due to the author of the PR disappearing and not publishing their changes in any really salvageable way. #2272 was what we could find of it.

The second manifested solution was a rewrite of the storages in #2791. This replaced the storage container with overly-clever lock-free wait-free-access datastructure. This PR actually worked (mostly), but the core datastructure was deeply unsafe and a lot of stuff went on my my personal life at the same time, so it fell notably behind to the point it basically needed rewriting.

What to Do

I've been thinking about this and I think my resistance to arcinization (hence my solution with the fancy datastructure) is unfounded. I was very worried about cache locality, but we don't do things like "iterate over every texture" in any meaningful way. Because of this worry, which I believe to be unfounded, and the many downsides to such a fancy datastructure, I now thing that arcinization is the way to go.

As such, instead of having IDs be a index, epoch, and backend all packed into a u64, we should have them be an Arc<T>. In short: we should continue arcinizing.

There are many problems to face though. We rely on the global locks we have to provide the ability to mutate stuff. Most of our internal datastructures are immutable once created, with some exceptions:

  • Device has multiple locks inside it.
  • Buffers and Textures have their initialization state tracked mutably. (This will need to go behind a lock)
  • Command buffers are completely mutable and need a lock around them.

Additionally, the trackers use the current index to store state about the resources in a dense array - relying on the fact that the index is allocated using a freelist to keep the numbers dense.. Getting rid of this index would pose significant problems. As such, I think we need to preserve the index, but only use it for this kind of dense datastructure.

Warning: Scope Creep

When working on this problem it is very very tempting to try to roll other changes into the same PR - both PRs suffered from this, so I just wanted to extend a warning to resist the urge to solve even more problems in the first PR.

Epilogue

This is necessarily a big task with widespread ramifications, and as such I strongly encourage you to keep in close contact with us on matrix so we can discuss direction and make sure you stay unblocked about codebase knowledge. I'm also sure this post is missing a ton of information you need to do this, so this will be a good vector for this.

@John-Nagle
Copy link

John-Nagle commented Feb 27, 2023

It's encouraging to see this. From my perspective, safety is important. I'm writing a metaverse client, which is a browser for virtual worlds. Like a web browser, it loads content created by thousands of different people. Some of that content is malformed, and some is hostile. I would encourage keeping this simple and safe. Even if costs some CPU time. If we have solid concurrency, we can get performance back through parallelism.

Debugging race conditions and leaks in storage management is difficult for all concerned. The whole point of Rust is to avoid that class of problem.

@gents83
Copy link
Contributor

gents83 commented Mar 4, 2023

@cwfitzgerald I had finally the time to look a bit to the code.

At a first look, in order to try to not change too much at once, what can be done is:

  • Keep indexes
  • Remove the RwLock from the Storage
  • Having Storage holding Arc

Then we've to manage locks and we can:

  • Use lifeguard\token mechanism (not on the Storage but on T itself this time) or
  • Use RwLock around what needs locking

Sounds correct? Or your plan is to remove completely all IDs, Epoch, etc immediately?

And - other question - since CommandBuffers are completely mutable Arcanazing them would make them unable to do move operation like into_baked(). How did you imagine to handle this case?

@cwfitzgerald
Copy link
Member Author

At a first look, in order to try to not change too much at once, what can be done is:
- Keep indexes
- Remove the RwLock from the Storage
- Having Storage holding Arc

Now that I'm thinking about it, storage needs to be Mutex<Vec<Arc<T>>>. When we need to access a given index T, we need to lock the mutex, but only long enough to clone the arc. Then we can release the mutex and let everyone else through. Basically we only need to lock while the storage itself is being actively changed. There's still critical sections, but they're on the order of cycles as opposed to milliseconds.

Then we've to manage locks and we can:
- Use lifeguard\token mechanism (not on the Storage but on T itself this time) or
- Use RwLock around what needs locking

Tbh I'm not positive what lifeguards are for. But the later sounds right. Do note that if there isn't a significant Read/Write dichotomy, or the critical sections are very short, prefer a regular mutex.

  • And - other question - since CommandBuffers are completely mutable Arcanazing them would make them unable to do move operation like into_baked(). How did you imagine to handle this case?

Command buffers are a mess in general - I believe they're removed from the storages while being baked - so Arc::try_unwrap should succeed.

@gents83
Copy link
Contributor

gents83 commented Mar 10, 2023

@cwfitzgerald I didn't get if you want to get rid of lifeguards or not :)
Since we are using Arc there are some checks that will be automatically tracked by the Arc itself and that will not allow the Arc::try_unwrap() or the resource destroy.
Here an example where I do not find it useful:
image
but I want to get yout POV on this topic in general

@gents83
Copy link
Contributor

gents83 commented Mar 13, 2023

@cwfitzgerald & @kvark in https://github.com/gents83/wgpu you find the first rough iteration of Arcanizing
(mainly everything in 1 CL - except for a lock-fix)

In this first part I just:

  • Added Arc and Mutex\RwLock where needed
  • Used Arc::try_unwrap() and Mutex::into_inner() to release stuff that should be released and moved around.
  • Removed Token
    I didn't do yet any optimization or removed any global "array guard" that are used everywhere locking storages more than needed.
    Doing this first step is just to verify with you if the direction is anyway correct or if you've anything different in mind :)

I would like to do as next step:

  • Listen to your feedbacks\suggestions\changes needed
  • Access storages only to get the Arc and then allow others to access it (no more guards if not needed to lock the content for the entire function)
  • Understand if we still need lifeguard and refcount or if we can get rid of them, now that we're using Arc
  • Testing, re-testing, stress-testing
  • Profile
  • Go through the @John-Nagle case scenario

@John-Nagle
Copy link

When there's a branch of Rend3 that uses this I'd like to test it.

@cwfitzgerald
Copy link
Member Author

@gents83 alright, took a look.

By and large this looks good, all the Mutexes are in the right place. We might consider adjusting them, but nothing concerning there. You did, however, Arc a different thing than I expected. The construction in those commits is Arc<A::Texture> inside wgpu_core::Texture - what I was expecting was Arc<wgpu_core::Texture>. The main reason for this was to facilitate locking the storages for as short of time as possible. Most of the operations we do are on the wgpu_core type, as we often need the entirety of the structure.

@gents83
Copy link
Contributor

gents83 commented Mar 16, 2023

@cwfitzgerald I was sort of forced to do both actually 😂 the Arcwgpu::core::Texture in order to get stuff from Storage and lock It only when needed and instead the Arc<A:: Texture> because those are moved sometime before that the actual Resource is being released/is still alive and not yet unregistered from Storage itself
If you're ok I can anyway continue it, trying to clean it up and optimizing it and then stress testing it, or do you want me to do some core changes?

@John-Nagle
Copy link

Encouraged by all this effort.

My own experience in my metaverse viewer is that I can get 60 FPS and can download content at 200Mb/s, but not both at once. If the content loader and decompresser, which is running in multiple background threads, is allowed to run at high speed, the process of putting content into the GPU stalls out the main thread and drops the frame rate to 20FPS. Worst case, 12 FPS. So I have a live use case for this.

@gents83
Copy link
Contributor

gents83 commented Apr 15, 2023

@John-Nagle it would be nice if you could give a try to https://github.com/gents83/wgpu to start testing Arcanization it in your use case scenario 👍

@John-Nagle
Copy link

@John-Nagle it would be nice if you could give a try to https://github.com/gents83/wgpu to start testing Arcanization it in your use case scenario +1

I'm using WGPU via Rend3. I need a Rend3 that uses the new version.

My test program is here: https://github.com/John-Nagle/render-bench
It requires Rend3.

Did https://github.com/gents83/wgpu get merged back into main wgpu?

@gents83
Copy link
Contributor

gents83 commented Apr 22, 2023

Did https://github.com/gents83/wgpu get merged back into main wgpu?

Nope, I didn't have had time to fully close it yet and I have to debug still some tests where stuff is not correctly released on closure due to a contention lock.
Hopefully I will continue on this starting from mid/end next week

@John-Nagle
Copy link

OK, good to know. I'm looking forward to trying this in Rend3. If the system can sustain 60 FPS while content loading is in progress, we finally have a solution for big worlds that won't fit in the GPU all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants