New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA 11 Test: TestFftAllocate
#3777
Comments
Oddly, the CI in chainer/chainer-test#593 seems to be ok with this particular test... |
I couldn't reproduce with P100/V100. Could you try |
Yeah tried that too, still reproducible. |
@leofang is that failing for you on master? I can't reproduce that on a V100 either. |
@pentschev @kmaehashi I changed to another new machine with one GTX 2080 Ti installed (the previous machine has two), and I still get the same error:
I suspect CUDA 11 raises |
As a quick check, I changed the loop here cupy/tests/cupy_tests/fft_tests/test_fft.py Line 333 in edad3ab
from 10 to 7 and it will pass. CUFFT_INTERNAL_ERROR is raised in the second attempt to allocate a plan:Lines 301 to 305 in edad3ab
|
Were you able to confirm if that's because the GPU goes OOM? |
@kmaehashi @pentschev I have a cleaner reproducer. It is triggered by OOM, but before releasing the memory ( In [1]: import cupy as cp
In [2]: a = cp.cuda.memory.alloc(10*1024**3) # this occupies more than 90% of memory on a 2080 Ti
In [3]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError Traceback (most recent call last)
<ipython-input-3-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
275 # set plan, work_area, gpus, streams, and events
276 if not use_multi_gpus:
--> 277 self._single_gpu_get_plan(plan, nx, fft_type, batch)
278 else:
279 self._multi_gpu_get_plan(
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
304 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
305 &work_size)
--> 306 check_result(result)
307
308 work_area = memory.alloc(work_size)
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
145 cpdef inline check_result(int result):
146 if result != 0:
--> 147 raise CuFFTError(result)
148
149
CuFFTError: CUFFT_ALLOC_FAILED
In [4]: del a # <------ key!
In [5]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError Traceback (most recent call last)
<ipython-input-5-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
275 # set plan, work_area, gpus, streams, and events
276 if not use_multi_gpus:
--> 277 self._single_gpu_get_plan(plan, nx, fft_type, batch)
278 else:
279 self._multi_gpu_get_plan(
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
304 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
305 &work_size)
--> 306 check_result(result)
307
308 work_area = memory.alloc(work_size)
~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
145 cpdef inline check_result(int result):
146 if result != 0:
--> 147 raise CuFFTError(result)
148
149
CuFFTError: CUFFT_INTERNAL_ERROR Note that in both cases the error is raised at line 306 as seen in |
Thanks for the reproducer! Should we do something like synchronization or wait after calling |
I tested and it would not help. I also noticed the same plan takes 3 GB in CUDA 10.0 and 3.74 GB in CUDA 11.0. Almost certain a regression. |
@pentschev Are you able to reproduce this if you increase the loop count (since you're on V100)? |
Sorry for the late reply. Yes, I'm able to reproduce increasing the count, I incremented in ranges of 10, and it starts failing at 40: ==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________
self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>
def test_fft_allocate(self):
# Check CuFFTError is not raised when the GPU memory is enough.
# See https://github.com/cupy/cupy/issues/1063
# TODO(mizuno): Simplify "a" after memory compaction is implemented.
a = []
for i in range(40):
a.append(cupy.empty(100000000))
del a
b = cupy.empty(100000007, dtype=cupy.float32)
> cupy.fft.fft(b)
tests/cupy_tests/fft_tests/test_fft.py:337:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:578: in fft
return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:187: in _fft
a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:157: in _fft_c2c
a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:114: in _exec_fft
plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> raise CuFFTError(result)
E cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
cupy/cuda/cufft.pyx:147: CuFFTError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR But as seen above, it fails on the actual FFT computation, which is different than increasing it to 50, where it fails during allocation: ==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________
self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>
def test_fft_allocate(self):
# Check CuFFTError is not raised when the GPU memory is enough.
# See https://github.com/cupy/cupy/issues/1063
# TODO(mizuno): Simplify "a" after memory compaction is implemented.
a = []
for i in range(50):
> a.append(cupy.empty(100000000))
tests/cupy_tests/fft_tests/test_fft.py:334:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/_creation/basic.py:22: in empty
return cupy.ndarray(shape, dtype, order=order)
cupy/core/core.pyx:193: in cupy.core.core.ndarray.__init__
self.data = memory.alloc(self.size * itemsize)
cupy/cuda/memory.pyx:569: in cupy.cuda.memory.alloc
return get_allocator()(size)
cupy/cuda/memory.pyx:1241: in cupy.cuda.memory.MemoryPool.malloc
cpdef MemoryPointer malloc(self, size_t size):
cupy/cuda/memory.pyx:1262: in cupy.cuda.memory.MemoryPool.malloc
return mp.malloc(size)
cupy/cuda/memory.pyx:930: in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
return self._malloc(rounded_size)
cupy/cuda/memory.pyx:950: in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
mem = self._try_malloc(size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> raise OutOfMemoryError(
E cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes).
cupy/cuda/memory.pyx:1201: OutOfMemoryError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes). |
Thanks, @pentschev! Could you file a ticket internally? Would it help if I post a reproducer in C? |
Sorry @leofang , I was OOO but am back now. A C reproducer would be of immense help if you have the time to write one! :) |
@pentschev Here you go. This is the C version of #3777 (comment). // compile with "nvcc cufft_bug.c -lcufft -o cufft_bug"
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>
int main() {
size_t bytes = 10*1024*1024*1024UL;
size_t work_size = 0;
float* a = NULL;
int result = 0;
cufftHandle h;
result = cufftCreate(&h);
assert(result == 0);
result = cufftSetAutoAllocation(h, 0);
assert(result == 0);
// allocate >90% of memory on GTX 2080 Ti
result = cudaMalloc((void**)(&a), bytes);
assert(result == 0);
// expect to get result = 2 (CUFFT_ALLOC_FAILED)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("return code is %i\n", result);
cudaFree(a);
// on CUDA 10.0, this works (result = 0)
// on CUDA 11.0, the return code is 5 (CUFFT_INTERNAL_ERROR)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("after free: return code is %i\n", result);
assert(result == 0);
return 0;
} |
@leofang did you try this on CUDA 10.2 or just CUDA 10.0? I don't have access with any 10.0 machines now to try, but it fails for me on 10.2 as well. |
@pentschev Just 10.0. I could try to find a 10.2 machine and confirm, but not immediately. |
@leofang I just installed CUDA 10.0 on one of our machines -- although with 11.0 driver version 450.51.06 -- and I tested, CUDA 10.0, 10.2 and 11.0 on that machine. For me it fails for both 10.0 and 10.2, but passes for 11.0, this is on a 32GB V100. It seems the results I get are the exact opposite from what you do, or am I interpreting something wrong? |
@pentschev I think you got the exactly opposite results indeed -- I am using GTX 2080 Ti with driver 450.57. |
I'm really not sure how to proceed here, it seems that these results are too conflicting to provide a good picture of what's going on. Do you have access to any other GPUs to see what kind of results you get? I unfortunately have only access to V100s so I can't see what happens anywhere else. I can also try filing the bug report, but it will be difficult to then provide any further information with the results we see today. |
@pentschev Can we file the report and state our (conflicting) test outcomes? I might be able to access different GPUs (V100), but I don't have control over their driver versions, so I can only test single CUDA versions. |
I also get the failure on CUDA 11.0.3 on a GTX 1080 Ti with driver 450.51.06. For CUDA 10.2 with the 440.100 driver, I would get a failure with an output of the C example provided by @leofang on 11.0
|
I went ahead and installed the Ubuntu packages
|
I directly installed 10.2 via the runfile on Ubuntu and tested on the same machine (GTX 2080 Ti with driver 450.57), and also got the error code 5. Looks like pretty consistent. |
Sorry for the late reply here, and thank you both for the additional data point. I just filed a bug report for this, the internal reference number for it is 3125482. I'll keep you posted once I know more about it. |
I heard back from cuFFT developers, and the issue has been prioritized for the CUDA 11.2 release, I don't have further details at this time though. |
Thanks, Peter! Keep us posted. |
I was provided with a preview build of cuFFT 11.2 and verified the issue has been fixed there. Thanks @leofang for the report and reproducer, it has been very helpful! |
Thanks, Peter! If I catch that on 11.2 I'll reopen then. |
1. Skip the allocate test due to a cuFFT bug (cupy#3777) 2. Fix @multi_gpu 3. Split callback tests
@pentschev I tested on CUDA 11.2 and it seems to be fixed 🙂 However, if I instead use the async mempool from CUDA 11.2, the very same code with a simple change // nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>
int main() {
size_t bytes = 10*1024*1024*1024UL;
size_t work_size = 0;
float* a = NULL;
int result = 0;
cufftHandle h;
result = cufftCreate(&h);
assert(result == 0);
result = cufftSetAutoAllocation(h, 0);
assert(result == 0);
// allocate >90% of memory on GTX 2080 Ti
result = cudaMallocAsync((void**)(&a), bytes, 0);
assert(result == 0);
// expect to get result = 2 (CUFFT_ALLOC_FAILED)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("return code is %i\n", result);
cudaFreeAsync(a, 0);
// on CUDA 10.0, this works (result = 0)
// on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("after free: return code is %i\n", result);
assert(result == 0);
return 0;
} This is discovered when testing #4537. |
When you say it fails again, do you mean you see the exact same issue as we used to see with |
Yes, I think this is more or less the same issue, but this time with |
I was able to upgrade my workstation to 11.2 and test your code @leofang , I think what you're missing is a |
Thanks @pentschev! I can confirm this eliminates the error, but why is it needed?
|
Hmm, that's a good point, I actually thought that was what's missing because I noticed using the |
Can someone summarize what I should be looking at? |
Hi @jrhemstad Sorry for confusion. This was an old thread I'm revisiting to test with the new async malloc stuff. The test code in question is here: #3777 (comment). It seems @pentschev agrees with me that this is a valid code, but we don't know if the problem is in cuFFT not trying harder to allocate memory, or the async mempool not behaving correctly to release memory. |
I'm not familiar with cuFFT or what |
The thing is that in that sample |
No, that's the primary difference in The implicit sync performed by |
Thank you @jrhemstad @pentschev. OK, I think it's reasonable that even with stream 0 or 1 cudaFreeAsync does not sync. It's kinda like the order of precedence matters here (a sync stream vs an async API 😁) which would be best to be documented (so is the fact that cudaFree syncs -- it's like a grapevine that we learn all the time from SO, the NV developer forum, etc, but never gets explicitly acknowledged in the official doc). However, this brings me to my next question: For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators? (ex: cudaMalloc) The context with cuFFT is this: We noticed cufftMakePlan* internally allocates some scratch memory even if we call cufftSetAutoAllocation to manage the working area ourselves. This is why we tested it (and discovered a bug in CUDA 11) because creating a cuFFT plan could lead to OOM. Now, the "bug" here is that even I return all memory back to the (async) pool, it seems cuFFT still has difficulty in allocating memory (which should not happen), which leads to my question above. |
Sorry, silly mobile UI with my fat thumb |
Would you have expected a
You can use |
You're right. I confused myself with device sync vs stream sync.
Though I still think it's an important implementation detail (like legacy default stream) that's worth documenting. People ask this for all time.
@jrhemstad I got |
|
Thanks for pointer, @jrhemstad! This really should have been documented/mentioned on the same page for Stream Ordered Memory Allocator...😢 For example, a quick mention under Unfortunately, though, it still doesn't fix the error after shrinking the size of the default mempool. Looks like a stream synchronization is the only way to expose the unused memory, at least for cuFFT's use. // nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>
int main() {
size_t bytes = 10*1024*1024*1024UL;
size_t work_size = 0;
float* a = NULL;
int result = 0;
cufftHandle h;
result = cufftCreate(&h);
assert(result == 0);
result = cufftSetAutoAllocation(h, 0);
assert(result == 0);
// allocate >90% of memory on GTX 2080 Ti
result = cudaMallocAsync((void**)(&a), bytes, 0);
assert(result == 0);
// expect to get result = 2 (CUFFT_ALLOC_FAILED)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("return code is %i\n", result);
result = cudaFreeAsync(a, 0);
assert(result == 0);
cudaMemPool_t pool;
result = cudaDeviceGetDefaultMemPool(&pool, 0);
assert(result == 0);
result = cudaMemPoolTrimTo(pool, 0);
assert(result == 0);
// without syncing the stream, the next call (which allocates memory internally) would fail
// cudaStreamSynchronize(0);
// on CUDA 10.0, this works (result = 0)
// on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
printf("after free: return code is %i\n", result);
assert(result == 0);
return 0;
} |
I'm guessing the memory allocated by Note that the synchronization wouldn't be necessary if A simpler example is something like this:
The |
The first
@jrhemstad @pentschev I think this does not explain the "bug", because as I said the memory isn't in use after
this would explain it (and is also what I was asking and wanted to know). Thank you, Jake, for confirming that we need a stream synchronization to actually release the memory. I wouldn't have thought about it, at least the SOMA documentation didn't say this is expected...😢 |
Sorry @leofang , I dropped the ball here. Is there something that's still blocking you w.r.t. cuFFT with |
Thanks for checking back @pentschev, I think everything is fine with the new async allocator (except that its documentation needs to be clearer 😂). |
I built the latest master and fixed #3757 with #3775, and the only error I got from all FFT tests we have is this:
The text was updated successfully, but these errors were encountered: