Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 11 Test: TestFftAllocate #3777

Closed
leofang opened this issue Aug 13, 2020 · 54 comments
Closed

CUDA 11 Test: TestFftAllocate #3777

leofang opened this issue Aug 13, 2020 · 54 comments

Comments

@leofang
Copy link
Member

leofang commented Aug 13, 2020

I built the latest master and fixed #3757 with #3775, and the only error I got from all FFT tests we have is this:

$ pytest tests/cupy_tests/fft_tests/test_fft.py
========================================================================= test session starts =========================================================================
platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/leofang/cupy_cuda11, configfile: setup.cfg
collected 717 items                                                                                                                                                   

tests/cupy_tests/fft_tests/test_fft.py ........................................................................................................................ [ 16%]
............................................................................................................................................................... [ 38%]
............................................................................................................................................................... [ 61%]
............................................................................................................................................................... [ 83%]
.....................................................................................................................F..                                        [100%]

============================================================================== FAILURES ===============================================================================
__________________________________________________________________ TestFftAllocate.test_fft_allocate __________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(10):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:336: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cupy/fft/fft.py:567: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
cupy/fft/fft.py:182: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
cupy/fft/fft.py:152: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
cupy/fft/fft.py:109: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
======================================================================= short test summary info =======================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
=================================================================== 1 failed, 716 passed in 13.16s ====================================================================
@leofang
Copy link
Member Author

leofang commented Aug 13, 2020

Oddly, the CI in chainer/chainer-test#593 seems to be ok with this particular test...

@kmaehashi
Copy link
Member

I couldn't reproduce with P100/V100. Could you try python -m pytest tests/cupy_tests/fft_tests/test_fft.py -k test_fft_allocate to isolate the issue? (maybe it depends on test order?)

@leofang
Copy link
Member Author

leofang commented Aug 14, 2020

Yeah tried that too, still reproducible.

@pentschev
Copy link
Member

@leofang is that failing for you on master? I can't reproduce that on a V100 either.

@leofang
Copy link
Member Author

leofang commented Aug 14, 2020

@pentschev @kmaehashi I changed to another new machine with one GTX 2080 Ti installed (the previous machine has two), and I still get the same error:

$ pytest tests/cupy_tests/fft_tests/test_fft.py 
========================================================================= test session starts =========================================================================
platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/leo/dev/cupy_cuda11, configfile: setup.cfg
collected 717 items                                                                                                                                                   

tests/cupy_tests/fft_tests/test_fft.py ........................................................................................................................ [ 16%]
................................................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss............... [ 38%]
..........................................................................................................................................................sssss [ 61%]
sssssssssssssssssss............................................................................................................................................ [ 83%]
.....................................................................................................................F..                                        [100%]

============================================================================== FAILURES ===============================================================================
__________________________________________________________________ TestFftAllocate.test_fft_allocate __________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(10):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:337: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cupy/fft/fft.py:567: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
cupy/fft/fft.py:182: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
cupy/fft/fft.py:152: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
cupy/fft/fft.py:109: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
======================================================================= short test summary info =======================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR
============================================================= 1 failed, 596 passed, 120 skipped in 45.28s =============================================================

I suspect CUDA 11 raises CUFFT_INTERNAL_ERROR instead of CUFFT_ALLOC_FAILED for OOM, but I need to do more tests to confirm.

@leofang
Copy link
Member Author

leofang commented Aug 14, 2020

As a quick check, I changed the loop here

for i in range(10):

from 10 to 7 and it will pass. CUFFT_INTERNAL_ERROR is raised in the second attempt to allocate a plan:

cupy/cupy/cuda/cufft.pyx

Lines 301 to 305 in edad3ab

if result == 2:
cupy.get_default_memory_pool().free_all_blocks()
with nogil:
result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
&work_size)

@pentschev
Copy link
Member

from 10 to 7 and it will pass. CUFFT_INTERNAL_ERROR is raised in the second attempt to allocate a plan:

Were you able to confirm if that's because the GPU goes OOM?

@leofang
Copy link
Member Author

leofang commented Aug 19, 2020

@kmaehashi @pentschev I have a cleaner reproducer. It is triggered by OOM, but before releasing the memory (del a) cuFFT behaves correctly (throwing CUFFT_ALLOC_FAILED).

In [1]: import cupy as cp

In [2]: a = cp.cuda.memory.alloc(10*1024**3)  # this occupies more than 90% of memory on a 2080 Ti

In [3]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError                                Traceback (most recent call last)
<ipython-input-3-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
    275             # set plan, work_area, gpus, streams, and events
    276             if not use_multi_gpus:
--> 277                 self._single_gpu_get_plan(plan, nx, fft_type, batch)
    278             else:
    279                 self._multi_gpu_get_plan(

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
    304                 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
    305                                          &work_size)
--> 306         check_result(result)
    307 
    308         work_area = memory.alloc(work_size)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
    145 cpdef inline check_result(int result):
    146     if result != 0:
--> 147         raise CuFFTError(result)
    148 
    149 

CuFFTError: CUFFT_ALLOC_FAILED

In [4]: del a   # <------ key!

In [5]: plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)
---------------------------------------------------------------------------
CuFFTError                                Traceback (most recent call last)
<ipython-input-5-2b1dcda66b19> in <module>
----> 1 plan = cp.cuda.cufft.Plan1d(100000007, cp.cuda.cufft.CUFFT_R2C, 1)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d.__init__()
    275             # set plan, work_area, gpus, streams, and events
    276             if not use_multi_gpus:
--> 277                 self._single_gpu_get_plan(plan, nx, fft_type, batch)
    278             else:
    279                 self._multi_gpu_get_plan(

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.Plan1d._single_gpu_get_plan()
    304                 result = cufftMakePlan1d(plan, nx, <Type>fft_type, batch,
    305                                          &work_size)
--> 306         check_result(result)
    307 
    308         work_area = memory.alloc(work_size)

~/dev/cupy_cuda11/cupy/cuda/cufft.pyx in cupy.cuda.cufft.check_result()
    145 cpdef inline check_result(int result):
    146     if result != 0:
--> 147         raise CuFFTError(result)
    148 
    149 

CuFFTError: CUFFT_INTERNAL_ERROR

Note that in both cases the error is raised at line 306 as seen in TestFftAllocate.

@kmaehashi
Copy link
Member

kmaehashi commented Aug 19, 2020

Thanks for the reproducer! Should we do something like synchronization or wait after calling free_all_blocks() and before creating a plan...?

@leofang
Copy link
Member Author

leofang commented Aug 19, 2020

I tested and it would not help. I also noticed the same plan takes 3 GB in CUDA 10.0 and 3.74 GB in CUDA 11.0. Almost certain a regression.

@leofang leofang mentioned this issue Aug 20, 2020
8 tasks
@leofang
Copy link
Member Author

leofang commented Aug 26, 2020

@pentschev Are you able to reproduce this if you increase the loop count (since you're on V100)?

@pentschev
Copy link
Member

Sorry for the late reply. Yes, I'm able to reproduce increasing the count, I incremented in ranges of 10, and it starts failing at 40:

==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(40):
            a.append(cupy.empty(100000000))
        del a
        b = cupy.empty(100000007, dtype=cupy.float32)
>       cupy.fft.fft(b)

tests/cupy_tests/fft_tests/test_fft.py:337:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:578: in fft
    return _fft(a, (n,), (axis,), norm, cupy.cuda.cufft.CUFFT_FORWARD)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:187: in _fft
    a = _fft_c2c(a, direction, norm, axes, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:157: in _fft_c2c
    a = _exec_fft(a, direction, 'C2C', norm, axis, overwrite_x, plan=plan)
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/fft/fft.py:114: in _exec_fft
    plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
cupy/cuda/cufft.pyx:277: in cupy.cuda.cufft.Plan1d.__init__
    self._single_gpu_get_plan(plan, nx, fft_type, batch)
cupy/cuda/cufft.pyx:306: in cupy.cuda.cufft.Plan1d._single_gpu_get_plan
    check_result(result)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CuFFTError(result)
E   cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

cupy/cuda/cufft.pyx:147: CuFFTError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.cufft.CuFFTError: CUFFT_INTERNAL_ERROR

But as seen above, it fails on the actual FFT computation, which is different than increasing it to 50, where it fails during allocation:

==================================================================================================================== FAILURES ====================================================================================================================
_______________________________________________________________________________________________________ TestFftAllocate.test_fft_allocate ________________________________________________________________________________________________________

self = <cupy_tests.fft_tests.test_fft.TestFftAllocate testMethod=test_fft_allocate>

    def test_fft_allocate(self):
        # Check CuFFTError is not raised when the GPU memory is enough.
        # See https://github.com/cupy/cupy/issues/1063
        # TODO(mizuno): Simplify "a" after memory compaction is implemented.
        a = []
        for i in range(50):
>           a.append(cupy.empty(100000000))

tests/cupy_tests/fft_tests/test_fft.py:334:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../miniconda3/envs/rn-102-0.15.0b200820/lib/python3.7/site-packages/cupy/_creation/basic.py:22: in empty
    return cupy.ndarray(shape, dtype, order=order)
cupy/core/core.pyx:193: in cupy.core.core.ndarray.__init__
    self.data = memory.alloc(self.size * itemsize)
cupy/cuda/memory.pyx:569: in cupy.cuda.memory.alloc
    return get_allocator()(size)
cupy/cuda/memory.pyx:1241: in cupy.cuda.memory.MemoryPool.malloc
    cpdef MemoryPointer malloc(self, size_t size):
cupy/cuda/memory.pyx:1262: in cupy.cuda.memory.MemoryPool.malloc
    return mp.malloc(size)
cupy/cuda/memory.pyx:930: in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
    return self._malloc(rounded_size)
cupy/cuda/memory.pyx:950: in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
    mem = self._try_malloc(size)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise OutOfMemoryError(
E   cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes).

cupy/cuda/memory.pyx:1201: OutOfMemoryError
============================================================================================================ short test summary info =============================================================================================================
FAILED tests/cupy_tests/fft_tests/test_fft.py::TestFftAllocate::test_fft_allocate - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 800,000,000 bytes (allocated so far: 33,600,000,000 bytes).

@leofang
Copy link
Member Author

leofang commented Aug 31, 2020

Thanks, @pentschev! Could you file a ticket internally? Would it help if I post a reproducer in C?

@pentschev
Copy link
Member

Sorry @leofang , I was OOO but am back now. A C reproducer would be of immense help if you have the time to write one! :)

@leofang
Copy link
Member Author

leofang commented Sep 8, 2020

@pentschev Here you go. This is the C version of #3777 (comment).

// compile with "nvcc cufft_bug.c -lcufft -o cufft_bug"

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMalloc((void**)(&a), bytes);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    cudaFree(a);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is 5 (CUFFT_INTERNAL_ERROR)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

@pentschev
Copy link
Member

@leofang did you try this on CUDA 10.2 or just CUDA 10.0? I don't have access with any 10.0 machines now to try, but it fails for me on 10.2 as well.

@leofang
Copy link
Member Author

leofang commented Sep 10, 2020

@pentschev Just 10.0. I could try to find a 10.2 machine and confirm, but not immediately.

@pentschev
Copy link
Member

@leofang I just installed CUDA 10.0 on one of our machines -- although with 11.0 driver version 450.51.06 -- and I tested, CUDA 10.0, 10.2 and 11.0 on that machine. For me it fails for both 10.0 and 10.2, but passes for 11.0, this is on a 32GB V100. It seems the results I get are the exact opposite from what you do, or am I interpreting something wrong?

@leofang
Copy link
Member Author

leofang commented Sep 10, 2020

@pentschev I think you got the exactly opposite results indeed -- I am using GTX 2080 Ti with driver 450.57.

@pentschev
Copy link
Member

@pentschev I think you got the exactly opposite results indeed -- I am using GTX 2080 Ti with driver 450.57.

I'm really not sure how to proceed here, it seems that these results are too conflicting to provide a good picture of what's going on. Do you have access to any other GPUs to see what kind of results you get? I unfortunately have only access to V100s so I can't see what happens anywhere else. I can also try filing the bug report, but it will be difficult to then provide any further information with the results we see today.

@leofang
Copy link
Member Author

leofang commented Sep 11, 2020

@pentschev Can we file the report and state our (conflicting) test outcomes? I might be able to access different GPUs (V100), but I don't have control over their driver versions, so I can only test single CUDA versions.

@grlee77
Copy link
Contributor

grlee77 commented Sep 14, 2020

I also get the failure on CUDA 11.0.3 on a GTX 1080 Ti with driver 450.51.06.

For CUDA 10.2 with the 440.100 driver, I would get a failure with an OutOfMemoryError when running this test case, but after switching to 11.0 I see the message about CUFFT_INTERNAL_ERROR instead.

output of the C example provided by @leofang on 11.0

return code is 2
after free: return code is 5
cufft_bug: fft_alloc.cpp:37: int main(): Assertion `result == 0' failed.
Aborted (core dumped)

@grlee77
Copy link
Contributor

grlee77 commented Sep 14, 2020

I went ahead and installed the Ubuntu packages cuda-minimal-build-10-0, cuda-minimal-build-10-1, cuda-minimal-build-10-2 as well as the corresponding CUFFT packages. When using these along with the recent driver 450.51.06 and the C example above, I got the same return code is 5 error on 10.1 and 10.2 that was seen on 11.0, but the example succeeds on 10.0 with:

return code is 2
after free: return code is 0

@leofang
Copy link
Member Author

leofang commented Sep 14, 2020

I directly installed 10.2 via the runfile on Ubuntu and tested on the same machine (GTX 2080 Ti with driver 450.57), and also got the error code 5. Looks like pretty consistent.

@pentschev
Copy link
Member

Sorry for the late reply here, and thank you both for the additional data point. I just filed a bug report for this, the internal reference number for it is 3125482. I'll keep you posted once I know more about it.

@pentschev
Copy link
Member

I heard back from cuFFT developers, and the issue has been prioritized for the CUDA 11.2 release, I don't have further details at this time though.

@leofang
Copy link
Member Author

leofang commented Sep 15, 2020

Thanks, Peter! Keep us posted.

@pentschev
Copy link
Member

I was provided with a preview build of cuFFT 11.2 and verified the issue has been fixed there. Thanks @leofang for the report and reproducer, it has been very helpful!

@leofang
Copy link
Member Author

leofang commented Nov 13, 2020

Thanks, Peter! If I catch that on 11.2 I'll reopen then.

@leofang leofang closed this as completed Nov 13, 2020
leofang added a commit to leofang/cupy that referenced this issue Dec 2, 2020
1. Skip the allocate test due to a cuFFT bug (cupy#3777)
2. Fix @multi_gpu
3. Split callback tests
@leofang
Copy link
Member Author

leofang commented Jan 11, 2021

@pentschev I tested on CUDA 11.2 and it seems to be fixed 🙂

However, if I instead use the async mempool from CUDA 11.2, the very same code with a simple change cudaMalloc -> cudaMallocAsync & cudaFree -> cudaFreeAsync fails again...

// nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMallocAsync((void**)(&a), bytes, 0);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    cudaFreeAsync(a, 0);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

This is discovered when testing #4537.

@pentschev
Copy link
Member

When you say it fails again, do you mean you see the exact same issue as we used to see with cudaMalloc/cudaFree on 11.1 and earlier versions?

@leofang
Copy link
Member Author

leofang commented Jan 11, 2021

Yes, I think this is more or less the same issue, but this time with cudaMallocAsync /cudaFreeAsync on 11.2. Another minor difference is that the error code is now CUFFT_ALLOC_FAILED instead of CUFFT_INTERNAL_ERROR. Perhaps I didn't use the async allocation correctly?

@pentschev
Copy link
Member

I was able to upgrade my workstation to 11.2 and test your code @leofang , I think what you're missing is a cudaStreamSynchronize(0) after cudaFreeAsync(a, 0), that has resolved the issue for me. Could you check on your end too?

@leofang
Copy link
Member Author

leofang commented Jan 11, 2021

I think what you're missing is a cudaStreamSynchronize(0) after cudaFreeAsync(a, 0), that has resolved the issue for me. Could you check on your end too?

Thanks @pentschev! I can confirm this eliminates the error, but why is it needed?

@pentschev
Copy link
Member

Hmm, that's a good point, I actually thought that was what's missing because I noticed using the cuda{Malloc,Free}Async version was much faster than the sync counterparts and instinctively tried adding the sync step without even considering that was stream 0 and should be sync, but now I totally agree with you it shouldn't be necessary. I don't have an answer as to why that resolves the issue, I hope @jrhemstad sees something we're overlooking, perhaps?

@jrhemstad
Copy link

Can someone summarize what I should be looking at?

@leofang
Copy link
Member Author

leofang commented Jan 11, 2021

Hi @jrhemstad Sorry for confusion. This was an old thread I'm revisiting to test with the new async malloc stuff. The test code in question is here: #3777 (comment). It seems @pentschev agrees with me that this is a valid code, but we don't know if the problem is in cuFFT not trying harder to allocate memory, or the async mempool not behaving correctly to release memory.

@jrhemstad
Copy link

I'm not familiar with cuFFT or what cufftMakePlan1d does. It doesn't look like the pointer a interacts with the cuFFT calls at all, so I'm not sure what is expected here.

@pentschev
Copy link
Member

The thing is that in that sample cudaFree(a) was implying a synchronization, which doesn't happen if we switch to cudaFreeAsync(a, 0). Shouldn't cudaFreeAsync imply a synchronization as long as it's on the default stream, as Leo did in that sample?

@jrhemstad
Copy link

jrhemstad commented Jan 11, 2021

The thing is that in that sample cudaFree(a) was implying a synchronization, which doesn't happen if we switch to cudaFreeAsync(a, 0). Shouldn't cudaFreeAsync imply a synchronization as long as it's on the default stream, as Leo did in that sample?

No, that's the primary difference in cudaFreeAsync is that it doesn't do a device sync.

The implicit sync performed by cudaFree isn't documented anywhere. It's implementation defined behavior that people have come to rely on, which prevented removing the sync from cudaFree. This is (partly) why cudaFreeAsync was added to allow doing a free that doesn't sync the whole device.

@leofang
Copy link
Member Author

leofang commented Jan 12, 2021

Thank you @jrhemstad @pentschev. OK, I think it's reasonable that even with stream 0 or 1 cudaFreeAsync does not sync. It's kinda like the order of precedence matters here (a sync stream vs an async API 😁) which would be best to be documented (so is the fact that cudaFree syncs -- it's like a grapevine that we learn all the time from SO, the NV developer forum, etc, but never gets explicitly acknowledged in the official doc).

However, this brings me to my next question: For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators? (ex: cudaMalloc)

The context with cuFFT is this: We noticed cufftMakePlan* internally allocates some scratch memory even if we call cufftSetAutoAllocation to manage the working area ourselves. This is why we tested it (and discovered a bug in CUDA 11) because creating a cuFFT plan could lead to OOM.

Now, the "bug" here is that even I return all memory back to the (async) pool, it seems cuFFT still has difficulty in allocating memory (which should not happen), which leads to my question above.

@leofang leofang reopened this Jan 12, 2021
@leofang leofang closed this as completed Jan 12, 2021
@leofang
Copy link
Member Author

leofang commented Jan 12, 2021

Sorry, silly mobile UI with my fat thumb

@jrhemstad
Copy link

OK, I think it's reasonable that even with stream 0 or 1 cudaFreeAsync does not sync. It's kinda like the order of precedence matters here (a sync stream vs an async API ) which would be best to be documented (so is the fact that cudaFree syncs -- it's like a grapevine that we learn all the time from SO, the NV developer forum, etc, but never gets explicitly acknowledged in the official doc).

Would you have expected a cudaMemcpyAsync on the legacy default stream to sync the device? For any stream-ordered CUDA operation, doing work on the legacy default stream doesn't sync the device, it just implicitly orders with other streams, i.e., work on streams before/after an operation on the legacy default stream cannot overlap. That doesn't mean the legacy default stream operation blocks the device until all previous work as completed.

cudaFree happened to do a device synchronize, but this was never documented because it was and is an implementation detail. It has become something that many have knowingly or unknowingly come to rely on, but it still remains an implementation detail and hence why it is undocumented.

For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators?

You can use cudaMemPoolTrimTo(0) to release any unused memory held by the pool.

@leofang
Copy link
Member Author

leofang commented Jan 12, 2021

Would you have expected a cudaMemcpyAsync on the legacy default stream to sync the device? For any stream-ordered CUDA operation, doing work on the legacy default stream doesn't sync the device, it just implicitly orders with other streams, i.e., work on streams before/after an operation on the legacy default stream cannot overlap. That doesn't mean the legacy default stream operation blocks the device until all previous work as completed.

You're right. I confused myself with device sync vs stream sync.

cudaFree happened to do a device synchronize, but this was never documented because it was and is an implementation detail. It has become something that many have knowingly or unknowingly come to rely on, but it still remains an implementation detail and hence why it is undocumented.

Though I still think it's an important implementation detail (like legacy default stream) that's worth documenting. People ask this for all time.

For memory that is already returned via cudaFreeAsync, how to make it available to the sync-version of allocators?

You can use cudaMemPoolTrimTo(0) to release any unused memory held by the pool.

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

@jrhemstad
Copy link

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

cudaDeviceGetDefaultMemPool will get you the handle to the default memory pool.

@leofang
Copy link
Member Author

leofang commented Jan 12, 2021

@jrhemstad I got error: too few arguments in function call. Also, does this work on the default mempool that backs cudaMallocAsync and cudaFreeAsync? Is (cudaMemPool_t)0 the correct handle to that?

cudaDeviceGetDefaultMemPool will get you the handle to the default memory pool.

Thanks for pointer, @jrhemstad! This really should have been documented/mentioned on the same page for Stream Ordered Memory Allocator...😢 For example, a quick mention under cudaMallocAsync/cudaFreeAsync would be fine.

Unfortunately, though, it still doesn't fix the error after shrinking the size of the default mempool. Looks like a stream synchronization is the only way to expose the unused memory, at least for cuFFT's use.

// nvcc test_cufft_bug_112.cu -lcufft -o test_cufft_bug_112
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cufft.h>


int main() {
    size_t bytes = 10*1024*1024*1024UL;
    size_t work_size = 0;
    float* a = NULL;
    int result = 0;
    cufftHandle h;

    result = cufftCreate(&h);
    assert(result == 0);

    result = cufftSetAutoAllocation(h, 0);
    assert(result == 0);

    // allocate >90% of memory on GTX 2080 Ti
    result = cudaMallocAsync((void**)(&a), bytes, 0);
    assert(result == 0);

    // expect to get result = 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("return code is %i\n", result);

    result = cudaFreeAsync(a, 0);
    assert(result == 0);

    cudaMemPool_t pool;
    result = cudaDeviceGetDefaultMemPool(&pool, 0);
    assert(result == 0);

    result = cudaMemPoolTrimTo(pool, 0);
    assert(result == 0);

    // without syncing the stream, the next call (which allocates memory internally) would fail
    // cudaStreamSynchronize(0);

    // on CUDA 10.0, this works (result = 0)
    // on CUDA 11.0, the return code is again 2 (CUFFT_ALLOC_FAILED)
    result = cufftMakePlan1d(h, 100000007, CUFFT_R2C, 1, &work_size);
    printf("after free: return code is %i\n", result);
    assert(result == 0);

    return 0;
}

@jrhemstad
Copy link

jrhemstad commented Jan 12, 2021

Unfortunately, though, it still doesn't fix the error after shrinking the size of the default mempool. Looks like a stream synchronization is the only way to expose the unused memory, at least for cuFFT's use.

cudaMemPoolTrimTo will only release memory if it's no longer in use. It doesn't do any synchronization.

I'm guessing the memory allocated by cudaMallocAsync is still in use by the first cufftMakePlan1d call by the time you get to the second call. So it makes sense you'd need the synchronization to ensure that memory is actually free.

Note that the synchronization wouldn't be necessary if cufftMakePlan1d were internally calling cudaMallocAsync (on the same stream the memory was freed by cudaFreeAsync) instead of cudaMalloc.

A simpler example is something like this:

cudaMallocAsync( &p0, /*100% of available memory*/, s0);
kernel<<<..., s0>>>(p0);
cudaFreeAsync(p0, s0);
cudaMalloc(p1, /* 100% of available memory*/);

The cudaMalloc will likely fail without any synchronization between the cudaFreeAsync and cudaMalloc.

@leofang
Copy link
Member Author

leofang commented Jan 12, 2021

I'm guessing the memory allocated by cudaMallocAsync is still in use by the first cufftMakePlan1d call by the time you get to the second call. So it makes sense you'd need the synchronization to ensure that memory is actually free.

The first cufftMakePlan1d call fails due to OOM, so it's reasonable to expect that it cleans up before returning, and that the only place holding memory is from cudaMallocAsync. (The allocated memory is not used anywhere actually.)

cudaMemPoolTrimTo will only release memory if it's no longer in use. It doesn't do any synchronization.

@jrhemstad @pentschev I think this does not explain the "bug", because as I said the memory isn't in use after cudaFreeAsync so should be released. But,

The cudaMalloc will likely fail without any synchronization between the cudaFreeAsync and cudaMalloc.

this would explain it (and is also what I was asking and wanted to know). Thank you, Jake, for confirming that we need a stream synchronization to actually release the memory. I wouldn't have thought about it, at least the SOMA documentation didn't say this is expected...😢

@pentschev
Copy link
Member

Sorry @leofang , I dropped the ball here. Is there something that's still blocking you w.r.t. cuFFT with cudaFreeAsync?

@leofang
Copy link
Member Author

leofang commented Feb 1, 2021

Thanks for checking back @pentschev, I think everything is fine with the new async allocator (except that its documentation needs to be clearer 😂).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants