CUDA 11 Test: `TestFftAllocate` #3777

leofang · 2020-08-13T04:58:13Z

I built the latest master and fixed #3757 with #3775, and the only error I got from all FFT tests we have is this:

$ pytest tests/cupy_tests/fft_tests/test_fft.py
========================================================================= test session starts =========================================================================
platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/leofang/cupy_cuda11, configfile: setup.cfg
collected 717 items                                                                                                                                                   

tests/cupy_tests/fft_tests/test_fft.py ........................................................................................................................ [ 16%]
............................................................................................................................................................... [ 38%]
............................................................................................................................................................... [ 61%]
............................................................................................................................................................... [ 83%]
.....................................................................................................................F..                                        [100%]

============================================================================== FAILURES ===============================================================================
__________________________________________________________________ TestFftAllocate.test_fft_allocate __________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(10):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:336: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cupy/fft/fft.py:567: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
cupy/fft/fft.py:182: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
cupy/fft/fft.py:152: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
cupy/fft/fft.py:109: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
======================================================================= short test summary info =======================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
=================================================================== 1 failed, 716 passed in 13.16s ====================================================================

The text was updated successfully, but these errors were encountered:

leofang · 2020-08-13T05:14:50Z

Oddly, the CI in chainer/chainer-test#593 seems to be ok with this particular test...

kmaehashi · 2020-08-14T07:18:17Z

I couldn't reproduce with P100/V100. Could you try python -m pytest tests/cupy_tests/fft_tests/test_fft.py -k test_fft_allocate to isolate the issue? (maybe it depends on test order?)

leofang · 2020-08-14T11:05:21Z

Yeah tried that too, still reproducible.

pentschev · 2020-08-14T13:01:54Z

@leofang is that failing for you on master? I can't reproduce that on a V100 either.

leofang · 2020-08-14T19:05:09Z

@pentschev @kmaehashi I changed to another new machine with one GTX 2080 Ti installed (the previous machine has two), and I still get the same error:

$ pytest tests/cupy_tests/fft_tests/test_fft.py 
========================================================================= test session starts =========================================================================
platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/leo/dev/cupy_cuda11, configfile: setup.cfg
collected 717 items                                                                                                                                                   

tests/cupy_tests/fft_tests/test_fft.py ........................................................................................................................ [ 16%]
................................................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss............... [ 38%]
..........................................................................................................................................................sssss [ 61%]
sssssssssssssssssss............................................................................................................................................ [ 83%]
.....................................................................................................................F..                                        [100%]

============================================================================== FAILURES ===============================================================================
__________________________________________________________________ TestFftAllocate.test_fft_allocate __________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(10):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:337: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cupy/fft/fft.py:567: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
cupy/fft/fft.py:182: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
cupy/fft/fft.py:152: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
cupy/fft/fft.py:109: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
======================================================================= short test summary info =======================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
============================================================= 1 failed, 596 passed, 120 skipped in 45.28s =============================================================

I suspect CUDA 11 raises CUFFT_INTERNAL_ERROR instead of CUFFT_ALLOC_FAILED for OOM, but I need to do more tests to confirm.

leofang · 2020-08-14T19:17:13Z

As a quick check, I changed the loop here

cupy/tests/cupy_tests/fft_tests/test_fft.py

Line 333 in edad3ab

for i in range(10):

from 10 to 7 and it will pass. CUFFT_INTERNAL_ERROR is raised in the second attempt to allocate a plan:

cupy/cupy/cuda/cufft.pyx

Lines 301 to 305 in edad3ab

    
           if result == 2: 
        
               cupy.get_default_memory_pool().free_all_blocks() 
        
               with nogil: 
        
                   result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch, 
        
                                            &work_size)

pentschev · 2020-08-14T20:42:31Z

from 10 to 7 and it will pass. CUFFT_INTERNAL_ERROR is raised in the second attempt to allocate a plan:

Were you able to confirm if that's because the GPU goes OOM?

leofang · 2020-08-19T04:18:07Z

@kmaehashi @pentschev I have a cleaner reproducer. It is triggered by OOM, but before releasing the memory (del a) cuFFT behaves correctly (throwing CUFFT_ALLOC_FAILED).

In [1]: import cupy as cp

In [2]: a = cp.cuda.memory.alloc(10*1024**3)  # this occupies more than 90% of memory on a 2080 Ti

In [3]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError                                Traceback (most recent call last)
<ipython-input-3-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
    275             # set plan, work_area, gpus, streams, and events
    276             if not use_multi_gpus:
--> 277                 self._single_gpu_get_plan(plan, nx, fft_type, batch)
    278             else:
    279                 self._multi_gpu_get_plan(

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
    304                 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
    305                                          &work_size)
--> 306         check_result(result)
    307 
    308         work_area = memory.alloc(work_size)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
    145 cpdef inline check_result(int result):
    146     if result != 0:
--> 147         raise CuFFTError(result)
    148 
    149 

CuFFTError: CUFFT_ALLOC_FAILED

In [4]: del a   # <------ key!

In [5]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError                                Traceback (most recent call last)
<ipython-input-5-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
    275             # set plan, work_area, gpus, streams, and events
    276             if not use_multi_gpus:
--> 277                 self._single_gpu_get_plan(plan, nx, fft_type, batch)
    278             else:
    279                 self._multi_gpu_get_plan(

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
    304                 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
    305                                          &work_size)
--> 306         check_result(result)
    307 
    308         work_area = memory.alloc(work_size)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
    145 cpdef inline check_result(int result):
    146     if result != 0:
--> 147         raise CuFFTError(result)
    148 
    149 

CuFFTError: CUFFT_INTERNAL_ERROR

Note that in both cases the error is raised at line 306 as seen in TestFftAllocate.

kmaehashi · 2020-08-19T06:40:50Z

Thanks for the reproducer! Should we do something like synchronization or wait after calling free_all_blocks() and before creating a plan...?

leofang · 2020-08-19T17:31:15Z

I tested and it would not help. I also noticed the same plan takes 3 GB in CUDA 10.0 and 3.74 GB in CUDA 11.0. Almost certain a regression.

leofang · 2020-08-26T15:12:30Z

@pentschev Are you able to reproduce this if you increase the loop count (since you're on V100)?

pentschev · 2020-08-30T21:56:09Z

Sorry for the late reply. Yes, I'm able to reproduce increasing the count, I incremented in ranges of 10, and it starts failing at 40:

==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(40):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:337:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:578: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:187: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:157: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:114: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

But as seen above, it fails on the actual FFT computation, which is different than increasing it to 50, where it fails during allocation:

==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(50):
>           a.append(cupy.empty(100000000))

tests/cupy_tests/fft_tests/test_fft.py:334:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/_creation/basic.py:22: in empty
    return cupy.ndarray(shape, dtype, order=order)
cupy/core/core.pyx:193: in cupy.core.core.ndarray.__init__
    self.data = memory.alloc(self.size * itemsize)
cupy/cuda/memory.pyx:569: in cupy.cuda.memory.alloc
    return get_allocator()(size)
cupy/cuda/memory.pyx:1241: in cupy.cuda.memory.MemoryPool.malloc
    cpdef MemoryPointer malloc(self, size_t size):
cupy/cuda/memory.pyx:1262: in cupy.cuda.memory.MemoryPool.malloc
    return mp.malloc(size)
cupy/cuda/memory.pyx:930: in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
    return self._malloc(rounded_size)
cupy/cuda/memory.pyx:950: in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
    mem = self._try_malloc(size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise OutOfMemoryError(
E   cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes).

cupy/cuda/memory.pyx:1201: OutOfMemoryError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes).

leofang · 2020-08-31T23:41:40Z

Thanks, @pentschev! Could you file a ticket internally? Would it help if I post a reproducer in C?

pentschev · 2020-09-07T09:30:53Z

Sorry @leofang , I was OOO but am back now. A C reproducer would be of immense help if you have the time to write one! :)

leofang · 2020-09-08T16:58:15Z

@pentschev Here you go. This is the C version of #3777 (comment).

// compile with "nvcc cufft_bug.c -lcufft -o cufft_bug"

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMalloc((void**)(&a), bytes);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    cudaFree(a);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is 5 (CUFFT_INTERNAL_ERROR)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

pentschev · 2020-09-09T22:38:07Z

@leofang did you try this on CUDA 10.2 or just CUDA 10.0? I don't have access with any 10.0 machines now to try, but it fails for me on 10.2 as well.

leofang · 2020-09-10T04:35:00Z

@pentschev Just 10.0. I could try to find a 10.2 machine and confirm, but not immediately.

pentschev · 2020-09-10T08:33:30Z

@leofang I just installed CUDA 10.0 on one of our machines -- although with 11.0 driver version 450.51.06 -- and I tested, CUDA 10.0, 10.2 and 11.0 on that machine. For me it fails for both 10.0 and 10.2, but passes for 11.0, this is on a 32GB V100. It seems the results I get are the exact opposite from what you do, or am I interpreting something wrong?

leofang · 2020-09-10T18:30:25Z

@pentschev I think you got the exactly opposite results indeed -- I am using GTX 2080 Ti with driver 450.57.

pentschev · 2020-09-11T22:31:49Z

@pentschev I think you got the exactly opposite results indeed -- I am using GTX 2080 Ti with driver 450.57.

I'm really not sure how to proceed here, it seems that these results are too conflicting to provide a good picture of what's going on. Do you have access to any other GPUs to see what kind of results you get? I unfortunately have only access to V100s so I can't see what happens anywhere else. I can also try filing the bug report, but it will be difficult to then provide any further information with the results we see today.

leofang · 2020-09-11T23:03:03Z

@pentschev Can we file the report and state our (conflicting) test outcomes? I might be able to access different GPUs (V100), but I don't have control over their driver versions, so I can only test single CUDA versions.

grlee77 · 2020-09-14T17:26:10Z

I also get the failure on CUDA 11.0.3 on a GTX 1080 Ti with driver 450.51.06.

For CUDA 10.2 with the 440.100 driver, I would get a failure with an OutOfMemoryError when running this test case, but after switching to 11.0 I see the message about CUFFT_INTERNAL_ERROR instead.

output of the C example provided by @leofang on 11.0

return code is 2
after free: return code is 5
cufft_bug: fft_alloc.cpp:37: int main(): Assertion `result == 0' failed.
Aborted (core dumped)

grlee77 · 2020-09-14T18:07:12Z

I went ahead and installed the Ubuntu packages cuda-minimal-build-10-0, cuda-minimal-build-10-1, cuda-minimal-build-10-2 as well as the corresponding CUFFT packages. When using these along with the recent driver 450.51.06 and the C example above, I got the same return code is 5 error on 10.1 and 10.2 that was seen on 11.0, but the example succeeds on 10.0 with:

return code is 2
after free: return code is 0

leofang · 2020-09-14T19:21:12Z

I directly installed 10.2 via the runfile on Ubuntu and tested on the same machine (GTX 2080 Ti with driver 450.57), and also got the error code 5. Looks like pretty consistent.

pentschev · 2020-09-14T22:19:43Z

Sorry for the late reply here, and thank you both for the additional data point. I just filed a bug report for this, the internal reference number for it is 3125482. I'll keep you posted once I know more about it.

pentschev · 2020-09-15T11:43:38Z

I heard back from cuFFT developers, and the issue has been prioritized for the CUDA 11.2 release, I don't have further details at this time though.

leofang · 2020-09-15T15:33:48Z

Thanks, Peter! Keep us posted.

pentschev · 2020-10-27T09:50:48Z

I was provided with a preview build of cuFFT 11.2 and verified the issue has been fixed there. Thanks @leofang for the report and reproducer, it has been very helpful!

leofang · 2020-11-13T15:08:23Z

Thanks, Peter! If I catch that on 11.2 I'll reopen then.

1. Skip the allocate test due to a cuFFT bug (cupy#3777) 2. Fix @multi_gpu 3. Split callback tests

leofang · 2021-01-11T04:16:38Z

@pentschev I tested on CUDA 11.2 and it seems to be fixed 🙂

However, if I instead use the async mempool from CUDA 11.2, the very same code with a simple change cudaMalloc -> cudaMallocAsync & cudaFree -> cudaFreeAsync fails again...

// nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMallocAsync((void**)(&a), bytes, 0);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    cudaFreeAsync(a, 0);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

This is discovered when testing #4537.

pentschev · 2021-01-11T14:55:54Z

When you say it fails again, do you mean you see the exact same issue as we used to see with cudaMalloc/cudaFree on 11.1 and earlier versions?

leofang · 2021-01-11T15:07:42Z

Yes, I think this is more or less the same issue, but this time with cudaMallocAsync /cudaFreeAsync on 11.2. Another minor difference is that the error code is now CUFFT_ALLOC_FAILED instead of CUFFT_INTERNAL_ERROR. Perhaps I didn't use the async allocation correctly?

pentschev · 2021-01-11T20:34:37Z

I was able to upgrade my workstation to 11.2 and test your code @leofang , I think what you're missing is a cudaStreamSynchronize(0) after cudaFreeAsync(a, 0), that has resolved the issue for me. Could you check on your end too?

leofang · 2021-01-11T20:41:54Z

I think what you're missing is a cudaStreamSynchronize(0) after cudaFreeAsync(a, 0), that has resolved the issue for me. Could you check on your end too?

Thanks @pentschev! I can confirm this eliminates the error, but why is it needed?

In CUDA 11.2: Support the built-in Stream Ordered Memory Allocator #4537 (comment) @jrhemstad said it's OK to rely on the legacy stream as it's implicitly synchronous.
The doc does not say cudaStreamSynchronize must follow cudaFreeAsync in order to make the memory available, nor does it make sense to always do so 😅

pentschev · 2021-01-11T21:04:06Z

Hmm, that's a good point, I actually thought that was what's missing because I noticed using the cuda{Malloc,Free}Async version was much faster than the sync counterparts and instinctively tried adding the sync step without even considering that was stream 0 and should be sync, but now I totally agree with you it shouldn't be necessary. I don't have an answer as to why that resolves the issue, I hope @jrhemstad sees something we're overlooking, perhaps?

jrhemstad · 2021-01-11T21:15:28Z

Can someone summarize what I should be looking at?

leofang · 2021-01-11T21:20:28Z

Hi @jrhemstad Sorry for confusion. This was an old thread I'm revisiting to test with the new async malloc stuff. The test code in question is here: #3777 (comment). It seems @pentschev agrees with me that this is a valid code, but we don't know if the problem is in cuFFT not trying harder to allocate memory, or the async mempool not behaving correctly to release memory.

jrhemstad · 2021-01-11T22:03:35Z

I'm not familiar with cuFFT or what cufftMakePlan1d does. It doesn't look like the pointer a interacts with the cuFFT calls at all, so I'm not sure what is expected here.

pentschev · 2021-01-11T22:17:22Z

The thing is that in that sample cudaFree(a) was implying a synchronization, which doesn't happen if we switch to cudaFreeAsync(a, 0). Shouldn't cudaFreeAsync imply a synchronization as long as it's on the default stream, as Leo did in that sample?

jrhemstad · 2021-01-11T22:32:04Z

The thing is that in that sample cudaFree(a) was implying a synchronization, which doesn't happen if we switch to cudaFreeAsync(a, 0). Shouldn't cudaFreeAsync imply a synchronization as long as it's on the default stream, as Leo did in that sample?

No, that's the primary difference in cudaFreeAsync is that it doesn't do a device sync.

The implicit sync performed by cudaFree isn't documented anywhere. It's implementation defined behavior that people have come to rely on, which prevented removing the sync from cudaFree. This is (partly) why cudaFreeAsync was added to allow doing a free that doesn't sync the whole device.

leofang · 2021-01-12T02:47:08Z

Thank you @jrhemstad @pentschev. OK, I think it's reasonable that even with stream 0 or 1 cudaFreeAsync does not sync. It's kinda like the order of precedence matters here (a sync stream vs an async API 😁) which would be best to be documented (so is the fact that cudaFree syncs -- it's like a grapevine that we learn all the time from SO, the NV developer forum, etc, but never gets explicitly acknowledged in the official doc).

However, this brings me to my next question: For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators? (ex: cudaMalloc)

The context with cuFFT is this: We noticed cufftMakePlan* internally allocates some scratch memory even if we call cufftSetAutoAllocation to manage the working area ourselves. This is why we tested it (and discovered a bug in CUDA 11) because creating a cuFFT plan could lead to OOM.

Now, the "bug" here is that even I return all memory back to the (async) pool, it seems cuFFT still has difficulty in allocating memory (which should not happen), which leads to my question above.

leofang · 2021-01-12T02:47:42Z

Sorry, silly mobile UI with my fat thumb

jrhemstad · 2021-01-12T16:27:54Z

OK, I think it's reasonable that even with stream 0 or 1 cudaFreeAsync does not sync. It's kinda like the order of precedence matters here (a sync stream vs an async API ) which would be best to be documented (so is the fact that cudaFree syncs -- it's like a grapevine that we learn all the time from SO, the NV developer forum, etc, but never gets explicitly acknowledged in the official doc).

Would you have expected a cudaMemcpyAsync on the legacy default stream to sync the device? For any stream-ordered CUDA operation, doing work on the legacy default stream doesn't sync the device, it just implicitly orders with other streams, i.e., work on streams before/after an operation on the legacy default stream cannot overlap. That doesn't mean the legacy default stream operation blocks the device until all previous work as completed.

cudaFree happened to do a device synchronize, but this was never documented because it was and is an implementation detail. It has become something that many have knowingly or unknowingly come to rely on, but it still remains an implementation detail and hence why it is undocumented.

For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators?

You can use cudaMemPoolTrimTo(0) to release any unused memory held by the pool.

leofang · 2021-01-12T17:17:45Z

Would you have expected a cudaMemcpyAsync on the legacy default stream to sync the device? For any stream-ordered CUDA operation, doing work on the legacy default stream doesn't sync the device, it just implicitly orders with other streams, i.e., work on streams before/after an operation on the legacy default stream cannot overlap. That doesn't mean the legacy default stream operation blocks the device until all previous work as completed.

You're right. I confused myself with device sync vs stream sync.

cudaFree happened to do a device synchronize, but this was never documented because it was and is an implementation detail. It has become something that many have knowingly or unknowingly come to rely on, but it still remains an implementation detail and hence why it is undocumented.

Though I still think it's an important implementation detail (like legacy default stream) that's worth documenting. People ask this for all time.

For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators?

You can use cudaMemPoolTrimTo(0) to release any unused memory held by the pool.

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

jrhemstad · 2021-01-12T17:47:02Z

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

cudaDeviceGetDefaultMemPool will get you the handle to the default memory pool.

leofang · 2021-01-12T18:13:16Z

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

cudaDeviceGetDefaultMemPool will get you the handle to the default memory pool.

Thanks for pointer, @jrhemstad! This really should have been documented/mentioned on the same page for Stream Ordered Memory Allocator...😢 For example, a quick mention under cudaMallocAsync/cudaFreeAsync would be fine.

Unfortunately, though, it still doesn't fix the error after shrinking the size of the default mempool. Looks like a stream synchronization is the only way to expose the unused memory, at least for cuFFT's use.

// nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMallocAsync((void**)(&a), bytes, 0);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    result = cudaFreeAsync(a, 0);
    assert(result == 0);

    cudaMemPool_t pool;
    result = cudaDeviceGetDefaultMemPool(&pool, 0);
    assert(result == 0);

    result = cudaMemPoolTrimTo(pool, 0);
    assert(result == 0);

    // without syncing the stream, the next call (which allocates memory internally) would fail
    // cudaStreamSynchronize(0);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

jrhemstad · 2021-01-12T19:40:55Z

Unfortunately, though, it still doesn't fix the error after shrinking the size of the default mempool. Looks like a stream synchronization is the only way to expose the unused memory, at least for cuFFT's use.

cudaMemPoolTrimTo will only release memory if it's no longer in use. It doesn't do any synchronization.

I'm guessing the memory allocated by cudaMallocAsync is still in use by the first cufftMakePlan1d call by the time you get to the second call. So it makes sense you'd need the synchronization to ensure that memory is actually free.

Note that the synchronization wouldn't be necessary if cufftMakePlan1d were internally calling cudaMallocAsync (on the same stream the memory was freed by cudaFreeAsync) instead of cudaMalloc.

A simpler example is something like this:

cudaMallocAsync( &p0, /*100% of available memory*/, s0);
kernel<<<..., s0>>>(p0);
cudaFreeAsync(p0, s0);
cudaMalloc(p1, /* 100% of available memory*/);

The cudaMalloc will likely fail without any synchronization between the cudaFreeAsync and cudaMalloc.

leofang · 2021-01-12T21:09:34Z

I'm guessing the memory allocated by cudaMallocAsync is still in use by the first cufftMakePlan1d call by the time you get to the second call. So it makes sense you'd need the synchronization to ensure that memory is actually free.

The first cufftMakePlan1d call fails due to OOM, so it's reasonable to expect that it cleans up before returning, and that the only place holding memory is from cudaMallocAsync. (The allocated memory is not used anywhere actually.)

cudaMemPoolTrimTo will only release memory if it's no longer in use. It doesn't do any synchronization.

@jrhemstad @pentschev I think this does not explain the "bug", because as I said the memory isn't in use after cudaFreeAsync so should be released. But,

The cudaMalloc will likely fail without any synchronization between the cudaFreeAsync and cudaMalloc.

this would explain it (and is also what I was asking and wanted to know). Thank you, Jake, for confirming that we need a stream synchronization to actually release the memory. I wouldn't have thought about it, at least the SOMA documentation didn't say this is expected...😢

pentschev · 2021-01-26T16:53:11Z

Sorry @leofang , I dropped the ball here. Is there something that's still blocking you w.r.t. cuFFT with cudaFreeAsync?

leofang · 2021-02-01T05:42:51Z

Thanks for checking back @pentschev, I think everything is fine with the new async allocator (except that its documentation needs to be clearer 😂).

kmaehashi added the issue-checked label Aug 18, 2020

leofang mentioned this issue Aug 20, 2020

Add a cuFFT plan cache #3730

Merged

8 tasks

leofang closed this as completed Nov 13, 2020

leofang added a commit to leofang/cupy that referenced this issue Dec 2, 2020

A few changes in FFT tests, see below

760ee5f

1. Skip the allocate test due to a cuFFT bug (cupy#3777) 2. Fix @multi_gpu 3. Split callback tests

leofang mentioned this issue Dec 2, 2020

Fix some FFT test issues #4385

Merged

leofang reopened this Jan 12, 2021

leofang closed this as completed Jan 12, 2021

leofang mentioned this issue Jan 19, 2021

cuda_async_memory_resource built on cudaMallocAsync rapidsai/rmm#676

Merged

4 tasks

takagi mentioned this issue Sep 29, 2021

Fix failing slow tests #5329

Closed

kmaehashi mentioned this issue Jan 25, 2022

TestFftAllocate test failure #6375

Open

CUDA 11 Test: TestFftAllocate #3777

CUDA 11 Test: TestFftAllocate #3777

Comments

leofang commented Aug 13, 2020 • edited

leofang commented Aug 13, 2020

kmaehashi commented Aug 14, 2020

leofang commented Aug 14, 2020

pentschev commented Aug 14, 2020

leofang commented Aug 14, 2020

leofang commented Aug 14, 2020

pentschev commented Aug 14, 2020

leofang commented Aug 19, 2020 • edited

kmaehashi commented Aug 19, 2020 • edited

leofang commented Aug 19, 2020

leofang commented Aug 26, 2020

pentschev commented Aug 30, 2020

leofang commented Aug 31, 2020

pentschev commented Sep 7, 2020

leofang commented Sep 8, 2020

pentschev commented Sep 9, 2020

leofang commented Sep 10, 2020

pentschev commented Sep 10, 2020

leofang commented Sep 10, 2020

pentschev commented Sep 11, 2020

leofang commented Sep 11, 2020

grlee77 commented Sep 14, 2020

grlee77 commented Sep 14, 2020

leofang commented Sep 14, 2020

pentschev commented Sep 14, 2020

pentschev commented Sep 15, 2020

leofang commented Sep 15, 2020

pentschev commented Oct 27, 2020

leofang commented Nov 13, 2020

leofang commented Jan 11, 2021 • edited

pentschev commented Jan 11, 2021

leofang commented Jan 11, 2021

pentschev commented Jan 11, 2021

leofang commented Jan 11, 2021

pentschev commented Jan 11, 2021

jrhemstad commented Jan 11, 2021

leofang commented Jan 11, 2021

jrhemstad commented Jan 11, 2021

pentschev commented Jan 11, 2021

jrhemstad commented Jan 11, 2021 • edited

leofang commented Jan 12, 2021 • edited

leofang commented Jan 12, 2021

jrhemstad commented Jan 12, 2021

leofang commented Jan 12, 2021

jrhemstad commented Jan 12, 2021

leofang commented Jan 12, 2021

jrhemstad commented Jan 12, 2021 • edited

leofang commented Jan 12, 2021

pentschev commented Jan 26, 2021

leofang commented Feb 1, 2021

CUDA 11 Test: `TestFftAllocate` #3777

CUDA 11 Test: `TestFftAllocate` #3777

leofang commented Aug 13, 2020 •

edited

leofang commented Aug 19, 2020 •

edited

kmaehashi commented Aug 19, 2020 •

edited

leofang commented Jan 11, 2021 •

edited

jrhemstad commented Jan 11, 2021 •

edited

leofang commented Jan 12, 2021 •

edited

jrhemstad commented Jan 12, 2021 •

edited