New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA 11.2: Support the built-in Stream Ordered Memory Allocator #4537
Conversation
This comment has been minimized.
This comment has been minimized.
Bingo. It's due to the NVRTC program name being empty. UPDATE: #4538 fixes it. |
@leofang thanks for working on this, I've raised the question about destroying the stream before all memory allocated on it is freed, I'll let you know as soon as I get an answer about that. As for |
@leofang the following is allowed: cudaMallocAsync(&ptr, size, stream);
cudaStreamDestroy(stream);
cudaFreeAsync(ptr); // or cudaFree(ptr); I believe you're fine on implementing it as you see fits best to CuPy. |
cc @jrhemstad @harrism @nsakharnykh (who may have thoughts on this 🙂) |
@pentschev @jrhemstad I don't think this works. In Python I got >>> import cupy as cp
>>> s = cp.cuda.Stream()
>>> s.ptr
94049875546000
>>> ptr = cp.cuda.runtime.mallocAsync(100, s.ptr)
>>> del s
>>> cp.cuda.runtime.freeAsync(ptr, 94049875546000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cupy_backends/cuda/api/runtime.pyx", line 638, in cupy_backends.cuda.api.runtime.freeAsync
cpdef freeAsync(intptr_t ptr, intptr_t stream):
File "cupy_backends/cuda/api/runtime.pyx", line 643, in cupy_backends.cuda.api.runtime.freeAsync
check_status(status)
File "cupy_backends/cuda/api/runtime.pyx", line 253, in cupy_backends.cuda.api.runtime.check_status
raise CUDARuntimeError(status)
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidResourceHandle: invalid resource handle C: // nvcc -std=c++11 -arch=sm_75 test_mallocAsync.cu -o test_mallocAsync
#include <cstdio>
#include <cstdlib>
int main() {
double* a;
cudaStream_t s;
int error = cudaStreamCreate(&s);
if (error != 0) {
printf("failed!\n");
exit(1);
}
printf("stream ok!\n");
cudaDeviceSynchronize();
error = cudaMallocAsync(&a, 100, s);
if (error != 0) {
printf("failed!\n");
exit(1);
}
printf("malloc ok!\n");
cudaDeviceSynchronize();
error = cudaStreamDestroy(s);
if (error != 0) {
printf("failed!\n");
exit(1);
}
printf("stream destory ok!\n");
cudaDeviceSynchronize();
error = cudaFreeAsync(a, s);
if (error != 0) {
printf("failed: %i ", error);
printf("%s\n", cudaGetErrorString((cudaError_t)error));
exit(1);
}
printf("free ok!\n");
cudaDeviceSynchronize();
return 0;
}
/* output: */
// stream ok!
// malloc ok!
// stream destory ok!
// failed: 709 context is destroyed |
(If the stream is hold valid before calling |
This isn't valid.
You can't use a stream after it's been destroyed. There's nothing unique about
That's a use-after-free error. It's like trying to use a pointer after it's been |
cc: @maxpkatz |
Thanks for quick reply, @jrhemstad! So, the only legit way to free a stream-ordered memory after the stream is destroyed (see Issue No.2 in the PR description) is to call |
@leofang sorry for my original answer in #4537 (comment), it is incomplete at best. In addition to EDIT: added comment on synchronization of both streams. |
No. You can free on another stream so long as it was somehow ordered with the original stream the pointer was allocated on. There's a number of ways you can do this. Here's one example using events:
No, because you lose a lot of the benefit of using |
Ah...OK thanks for speedy replies @pentschev @jrhemstad!!! So to follow up further: If at some point all other non-default streams are destroyed, I can always call |
You technically can with the legacy default stream as it implicitly synchronous with other streams (except non-blocking streams!), but I don't suggest going this route. You cannot with the per-thread default stream as it is not implicitly synchronous with other streams. |
OK @jrhemstad I think this is the final question (I hope!)
So does it mean the driver does not guarantee the correct stream order and it's users' responsibility to ensure it (via For example, for the case I just described (no other streams are alive), if before I destroy the stream on which the memory was allocated I add an event to wait on the PTDS, I can then free it on the PTDS right? Something like cudaMallocAsync(&x, 1024, s);
// do something with x on s...
cudaEventCreate(&e);
cudaEventRecord(e, s);
cudaStreamDestroy(s);
cudaStreamWaitEvent(2, e, 0); // wait on PTDS (2)
cudaFreeAsync(x, 2); // free on PTDS (UPDATE use 2 for PTDS...) |
Yes. The per-thread default stream is effectively the same as any other stream.
Yep, that works. That said, event creation and even recording events is relatively expensive, so it's not something you want to do all the time if you can avoid it. @harrism wrote a nice benchmark of event overheads recently: https://github.com/harrism/cuda_event_benchmark |
Thanks for the pointers @jrhemstad @harrism! It is very nice to see these performance benchmarks. I think we should add RMM to CuPy's interoperable list 🙂 @emcastillo I split the PR into two, so this PR only exposes the necessary APIs (async malloc & free). In the next PR (#4592) the handling of |
Jenkins, test this please |
Jenkins CI test (for commit 325571a, target branch master) failed with status FAILURE. |
CI errored out due to |
Jenkins, test this please |
Jenkins CI test (for commit 325571a, target branch master) failed with status FAILURE. |
Jenkins, test this please |
Jenkins CI test (for commit f21fe82, target branch master) failed with status FAILURE. |
Jenkins, test this please |
Jenkins CI test (for commit f2f8f4f, target branch master) failed with status FAILURE. |
Jenkins, test this please |
Jenkins never started....? Jenkins, test this please |
pfnCI, test this please |
Jenkins CI test (for commit f2f8f4f, target branch master) failed with status FAILURE. |
CI failures are known and unrelated. |
Jenkins, test this please |
Jenkins CI test (for commit f2f8f4f, target branch master) succeeded! |
Thanks @emcastillo and all! |
While this is working, I mark it as Work in Progress as there are some issues to be discussed with our NVIDIA friends 🙂
May be blocked by #4443 (?)
This PR exposes CUDA's new Stream Ordered Memory Allocator added since 11.2 to CuPy. A new memory type,
MemoryAsync
, is added, which is backed bycudaMallocAsync()
andcudaFreeAsync()
.To use this feature, one simply sets the allocator to
malloc_async
, similar to what's done for managed memory:On older CUDA (<11.2) or unsupported devices, using this new allocator will raise an error at runtime.
(I didn't add the support with a customized mempool
cudaMemPool*()
/cudaMallocFromPoolAsync()
-- which could be the next PR -- as it's unclear to me the benefit of using non-default mempools. Also, note that there is no API to expose any current information of the mempool, so it wouldn't be compatible with CuPy'sMemoryPool
API, such asused_bytes()
etc.)Currently observed issues
I think nothing is wrong with my implementation, most likely these are from CUDA 😁
MemoryAsync
we will also need to hold the reference to the stream (object), not just its pointer.nvprof python my_script.py
will fail ifmalloc_async
is used in the workload:We need to confirm if this is nvprof's problem/limitation (very likely it is), as it could be annoying to our users.
TODO
docs/source/reference/memory.rst
?cc: @jakirkham @pentschev @maxpkatz Could you help address the three observed issues? 🙂