Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state #16980

Merged

Commits on May 8, 2024

  1. [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state

    Prior to this commit, the RAII handler in `NDArray` would always
    attempt to free a cuda memory allocation on destruction.  However, the
    call to `cudaFree` may throw an exception.  If this happens during
    stack unwinding due to a previously-thrown exception, this causes the
    program to immediately terminate, making it difficult to identify the
    source of the original error.
    
    This can commonly occur if an async compute kernel performs an illegal
    memory access.  An exception is thrown from the next cuda API call
    following the asynchronous error, causing the stack to unwind.  If the
    stack contains any `NDArray` instances which reference cuda
    allocations, the destructor of these `NDArray` instances will attempt
    to free memory, triggering the segfault.
    
    This commit updates the `CUDADeviceAPI::FreeDataSpace` function to
    check if the program is currently unwinding the stack due to a thrown
    exception, while the cuda driver has been left in an unrecoverable
    state.  If this occurs, no attempt to free memory is made, as all cuda
    API calls will result in an error, and the original exception is
    allowed to propagate.
    
    If the cuda driver is in an unrecoverable state, but no exception is
    currently unwinding the stack, then this may be the first cuda API
    call to occur after the asynchronous error.  In this case, the
    `cudaFree` call is still performed, which throws the initial
    exception.
    Lunderberg committed May 8, 2024
    Configuration menu
    Copy the full SHA
    e1f63ea View commit details
    Browse the repository at this point in the history