Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock #152

Closed
rukai opened this issue May 7, 2019 · 5 comments · Fixed by #154
Closed

Deadlock #152

rukai opened this issue May 7, 2019 · 5 comments · Fixed by #154
Labels
type: bug Something isn't working

Comments

@rukai
Copy link
Contributor

rukai commented May 7, 2019

OS: Arch Linux
GPU: GTX 960
Driver: Nvidia proprietary driver, version 418

I am hitting deadlocks when I take the compute example and run it multiple times in parallel.
https://github.com/rukai/wgpu/blob/217a76eaa6bdf1468319c0b7147acb2a8a594aae/examples/deadlock/main.rs
It will only complete 1-4 iterations before reaching a deadlock.

However if I were to run the same code sequentially, the device will fail to initialize on the 64th iteration every time.
This can be reduced down to just the device initialization code.
https://github.com/rukai/wgpu/blob/217a76eaa6bdf1468319c0b7147acb2a8a594aae/examples/initialization_failed/main.rs

You can easily run these examples by cloning https://github.com/rukai/wgpu/tree/bugs
and then doing:

cargo run --release --features vulkan --bin deadlock
cargo run --release --features vulkan --bin initialization_failed
@m4b
Copy link
Contributor

m4b commented May 7, 2019

confirmed both of these appear problematic for me as well, on archlinux + intel GPU Intel(R) HD Graphics 620 (Kaby Lake GT2) (type: IntegratedGpu)

running initialization_failed I get some validation errors, and device is lost first iteration:

m4b@efrit ::  [ /tmp/wgpu/examples ] cargo run --features vulkan --bin initialization_failed
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `/tmp/wgpu/target/debug/initialization_failed`
iteration: 0
Xlib:  extension "NV-GLX" missing on display ":0".
ERROR 2019-05-07T05:10:16Z: gfx_backend_vulkan: [Validation]  [ VUID_Undefined ] Object: VK_NULL_HANDLE (Type = 0) | vkWaitForFences: parameter fenceCount must be greater than 0.
VUID_Undefined(ERROR / SPEC): msgNum: 0 - vkWaitForFences: parameter fenceCount must be greater than 0.
    Objects: 1
        [0] 0, type: 0, name: NULL
ERROR 2019-05-07T05:10:16Z: gfx_backend_vulkan: [anv] ../mesa-18.3.1/src/intel/vulkan/anv_queue.c:538: drm_syncobj_wait failed: Invalid argument (VK_ERROR_DEVICE_LOST)
INTEL-MESA: error: ../mesa-18.3.1/src/intel/vulkan/anv_queue.c:538: drm_syncobj_wait failed: Invalid argument (VK_ERROR_DEVICE_LOST)
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DeviceLost(DeviceLost)', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

and confirmed, deadlock appears to deadlock after some time:

m4b@efrit ::  [ /tmp/wgpu/examples ] cargo run --features vulkan --bin deadlock
   Compiling examples v0.1.0 (/tmp/wgpu/examples)
    Finished dev [unoptimized + debuginfo] target(s) in 3.16s
     Running `/tmp/wgpu/target/debug/deadlock`
0
500
250
375
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
Times: [25, 25, 25]
376
1
Times: [25, 25, 25]
501
Times: [25, 25, 25]
251
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
377
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
252
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
502
Times: [25, 25, 25]
2
Times: [25, 25, 25]
378
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
253
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
503
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
3
Times: [25, 25, 25]
379
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
254
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
504
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
4
Times: [25, 25, 25]
380
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
255
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
505
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
5
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
381
Times: [25, 25, 25]
256
Times: [25, 25, 25]
506
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
6
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
382
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
257
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
507
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
7
Times: [25, 25, 25]
383
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
258
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
508
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
8
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
384
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
259
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
509
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
9
Times: [25, 25, 25]
Times: [25, 25, 25]
Times: [25, 25, 25]
260
510
385
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
10
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
511
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
11
Times: [25, 25, 25]
261
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
386
Times: [25, 25, 25]
512
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
12
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
262
Times: [25, 25, 25]
387
Times: [25, 25, 25]
513
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
13
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
263
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
514
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
388
Times: [25, 25, 25]
14
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
264
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
515
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
389
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
15
Times: [25, 25, 25]
265
Times: [25, 25, 25]
516
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
390
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
16
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
266
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
517
Times: [25, 25, 25]
391
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
17
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
267
Times: [25, 25, 25]
518
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
392
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
18
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
268
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
519
Xlib:  extension "NV-GLX" missing on display ":0".
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
393
Times: [25, 25, 25]
19
Xlib:  extension "NV-GLX" missing on display ":0".
Times: [25, 25, 25]
Xlib:  extension "NV-GLX" missing on display ":0".

@kvark kvark added the type: bug Something isn't working label May 7, 2019
@kvark
Copy link
Member

kvark commented May 7, 2019

Thank you for the testcase! It looks to be exposing 2 different issues:

  1. initializing the device for 64th time fails. This likely indicates that we aren't de-initializing the device properly on destruction. There is still work to do in order to properly release all resources.
  2. deadlock sometimes happen related to the buffer unmap callback. This is a fairly recent regression, happening before the callbacks are issued during device maintain(), which has the device locked for writing, so the callbacks aren't able to lock it for reading. It's still unclear to me why this works even in a single thread, but it is clear that we are doing wrong here, and we can do better.

@kvark
Copy link
Member

kvark commented May 7, 2019

@rukai the deadlock problem is addressed in #154. It would be great to have that logic upstreamed, perhaps as a test that is only enabled when a real backend is enabled. Would you be willing to try making such a PR?

The other problem is now moved out into #155.

@rukai
Copy link
Contributor Author

rukai commented May 7, 2019

Are you asking me to add a unittest consisting of https://github.com/rukai/wgpu/blob/217a76eaa6bdf1468319c0b7147acb2a8a594aae/examples/deadlock/main.rs ?

@kvark
Copy link
Member

kvark commented May 7, 2019 via email

bors bot added a commit that referenced this issue May 9, 2019
157: Add multithreaded_compute test r=kvark a=rukai

As requested in #152 I have opened a PR to add the repro as a test case.

I used [rusty fork](https://github.com/AltSysrq/rusty-fork) to allow setting a timeout.
Rusty fork also runs each test in a separate process.

Open to any suggestions on how to organize tests etc.
I could add #156 if you want?
Maybe name the tests by issue number?
If we add a test for every issue, breaking changes would become really annoying :/

Co-authored-by: Rukai <rubickent@gmail.com>
bors bot added a commit that referenced this issue May 10, 2019
154: Move callbacks out of the locking path r=kvark a=kvark

Fixes #152 
This change fixes the deadlocks discovered by @rukai . It enforces the following invariants through the code:
  1. if we enter Rust code from FFI, we assume nothing is locked. The invariant was previously not true when we unmapped a buffer in a mapping callback.
  2. the HUB storages are always locked in the same order. This was not followed in a few places, but still needs to be enforced by #66 later down the road.

Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>
@bors bors bot closed this as completed in #154 May 10, 2019
mitchmindtree pushed a commit to mitchmindtree/wgpu that referenced this issue Feb 23, 2020
152: Update to latest winit (0.20.0-alpha6) r=kvark a=grovesNL

Updated to latest winit to fix examples:
- `RedrawRequested` is now used for rendering
- `EventsCleared` replaced by `MainEventsCleared` which requests redraw
- Removed Metal auto-capture and escape key handling from hello-triangle to simplify the example slightly (it doesn't use `framework`, so we should try to make it as simple as possible IMO)

Should we also remove the `feature = gl` parts of the examples to simplify them until GL is reenabled?

Co-authored-by: Joshua Groves <josh@joshgroves.com>
kvark pushed a commit to kvark/wgpu that referenced this issue Jun 3, 2021
152: Update to latest winit (0.20.0-alpha6) r=kvark a=grovesNL

Updated to latest winit to fix examples:
- `RedrawRequested` is now used for rendering
- `EventsCleared` replaced by `MainEventsCleared` which requests redraw
- Removed Metal auto-capture and escape key handling from hello-triangle to simplify the example slightly (it doesn't use `framework`, so we should try to make it as simple as possible IMO)

Should we also remove the `feature = gl` parts of the examples to simplify them until GL is reenabled?

Co-authored-by: Joshua Groves <josh@joshgroves.com>
RandyMcMillan pushed a commit to RandyMcMillan/wgpu that referenced this issue Jun 19, 2024
Fix flex layout cross-alignment when not filled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants