Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soundness issue (2) #160

Closed
Narsil opened this issue Jul 3, 2023 · 3 comments · Fixed by #166
Closed

Soundness issue (2) #160

Narsil opened this issue Jul 3, 2023 · 3 comments · Fixed by #166

Comments

@Narsil
Copy link
Contributor

Narsil commented Jul 3, 2023

Hey I discovered another potential serious issue:

        let dev0 = CudaDevice::new(0).unwrap();
        let slice = dev0.htod_copy(vec![1.0; 10]).unwrap();
        let dev1 = CudaDevice::new(1).unwrap();
        drop(dev1);
        drop(dev0);
        drop(slice);

This panicks with the following stacktrace:

thread 'tests::dummy' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_CONTEXT_IS_DESTROYED, "context is destroyed")', src/driver/safe/core.rs:152:72
stack backtrace:
   0: rust_begin_unwind
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
   1: core::panicking::panic_fmt
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1687:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1089:23
   4: <cudarc::driver::safe::core::CudaSlice<T> as core::ops::drop::Drop>::drop
             at ./src/driver/safe/core.rs:152:13
   5: core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice<f64>>
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ptr/mod.rs:490:1
   6: core::mem::drop
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/mem/mod.rs:979:24
   7: cudarc::tests::dummy
             at ./src/lib.rs:107:9
   8: cudarc::tests::dummy::{{closure}}
             at ./src/lib.rs:99:16
   9: core::ops::function::FnOnce::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5
  10: core::ops::function::FnOnce::call_once
             at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/ops/function.rs:250:5

IIUC this is because the stream here : https://github.com/coreylowman/cudarc/blob/main/src/driver/safe/core.rs#L189 is already gone.

Somehow this doesn't fail if I don't create CudaDevice::new(1).

Might be linked to : #108

Narsil added a commit to Narsil/cudarc that referenced this issue Jul 4, 2023
- Adds a test running on multi GPU setup.
- Destroying the stream on drop is optional.
@coreylowman
Copy link
Owner

coreylowman commented Jul 5, 2023

I wonder if this is a misuse of the primary context api? Currently the drop for CudaDevice calls cuDevicePrimaryCtxRelease. However, in multi device setting, the CudaDevice's cu_primary_ctx (maybe should just be named cu_ctx) won't necessarily be the primary context anymore?

edit: Actually I guess each device (0 and 1) would have their own separate primary contexts? 🤔

@coreylowman
Copy link
Owner

Noting that CudaDevice::new(0).unwrap() will return an Arc<CudaDevice>, and when you first call drop(dev0), the underlying CudaDevice won't be dropped yet because it is cloned in the CudaSlice (so the refcount before the first drop is 2, then after the drop it goes down to 1).

So I believe the stream should still exist?

@coreylowman
Copy link
Owner

Okay yeah I think this is related to #161 - once you call result::ctx::set_current() after creating the second device, when the slice is freed in drop(slice), the free_async assumes you are using the most recent cuda context, which will be the 2nd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants