Add a cuFFT plan cache #3730

leofang · 2020-08-06T06:10:36Z

UPDATE: Close #3588.

This PR implements a least recently used (LRU) cache for cuFFT plans. The implementation is done in Cython to minimize the Python overhead; yet, I still use cdef classes (instead of pointers to structs) to avoid managing memory myself, and cdef'ing as much as I can.

Properties of this cache:

Per-thread, per-device
The "size" of the cache can be set by both the number of plans and the amount of (GPU) memory used by the plans (i.e., the work areas)
Enabled by default (with size = 16, which I picked ungroundedly)
Good performance:
- Greatly reduced the CPU overhead of plan allocation, especially with non-prime lengths (cupy.fft.irfft is slow #3556): Add a cuFFT plan cache #3730 (comment)
- Fast access time (< 1 us get/set): Add a cuFFT plan cache #3730 (comment)
- Bonus: a bit of GPU time reduction (as certain plan allocations would launch a few kernels): Add a cuFFT plan cache #3730 (comment)
Adding and removing a multi-GPU plan is done collectively among all participating GPUs' caches to respect the memory limitation.
A few handles are exposed to cupy.fft.config:
- Documented: get_plan_cache(), show_plan_cache_info()
- Undocumented: the five APIs from ENH: Allow user configuration of pocketfft's plan caches scipy/scipy#12512
- I do not expect they need to be used, though
- Question: How can I generate a doc page for PlanCache without explicitly referencing it in the autosummary?

What is NOT done in this PR (see the discussions in the replies below):

Lock stream on to work area
Manage a work area pool by the cache

I think it's out of scope, requires careful planning, and the performance gain, if any, is unknown.

~~Work in progress. Description to follow. All tests passed locally.~~
~~Aim to address #3588 and follow scipy/scipy#12512.~~

TODO:

Decide to what extent we would like to expose the cache to end users (the public API from ENH: Allow user configuration of pocketfft's plan caches scipy/scipy#12512 is a different thing)
Per-device cache
Merge add_multi_gpu_plan() and __setitem__()
Add PR description
Add tests
Add comments to code
Mark the public API from ENH: Allow user configuration of pocketfft's plan caches scipy/scipy#12512 experimental
~~Add get_fft_plan outcomes to the cache?~~

leofang · 2020-08-06T06:10:53Z

Jenkins, test this please

pfn-ci-bot · 2020-08-06T06:10:57Z

Successfully created a job for commit 62f5699:

Dashboard for commit 62f5699

chainer-ci · 2020-08-06T06:32:02Z

Jenkins CI test (for commit 62f5699, target branch master) succeeded!

leofang · 2020-08-06T17:01:49Z

Stupid me...🤦🏻‍♂️

leofang · 2020-08-06T17:14:21Z

HUGE reduction in CPU time (and some improvements in GPU time as well):

import cupy as cp
from cupyx.time import repeat

# choosing non-prime numbers could make plan generation time longer?
a = cp.random.random((198, 597, 418)).astype(cp.complex128)

# typical workload of ours
def fft_roundtrip(a, axis=None):
    out = cp.fft.fftn(a, axes=axis)
    out = cp.fft.ifftn(out, axes=axis)
    return out

n_repeat = 10

print("with cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache, '\n\n')

cp.fft.cache.plan_cache.set_size(0)  # disable the cache
print("without cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache)

output:

with cache:
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
fft_roundtrip       :    CPU:  251.862 us   +/-45.510 (min:  221.953 / max:  380.593) us     GPU-0:294705.273 us   +/-66.591 (min:294637.634 / max:294866.943) us
fft_roundtrip       :    CPU:  256.680 us   +/-42.280 (min:  230.437 / max:  381.849) us     GPU-0:462417.114 us   +/-671.659 (min:460680.573 / max:462825.470) us
-------------- cuFFT plan cache --------------
cache enabled? True
current / max size: 2 / 16 (counts)
current / max memsize: 7141533696 / (unlimited) (bytes)

cached plans (least used first):
key: ((597, 418), (597, 418), 1, 249546, (597, 418), 1, 249546, 105, 198, 'C', 2, None), plan type: PlanNd, memory usage: 3873374208
key: ((198, 597, 418), (198, 597, 418), 1, 1, (198, 597, 418), 1, 1, 105, 1, 'C', 2, None), plan type: PlanNd, memory usage: 3268159488 


without cache:
fft_roundtrip       :    CPU:297780.415 us   +/-304.553 (min:297310.701 / max:298156.878) us     GPU-0:297769.476 us   +/-301.531 (min:297322.083 / max:298143.738) us
fft_roundtrip       :    CPU:466226.026 us   +/-3733.645 (min:464298.991 / max:476941.334) us     GPU-0:466205.548 us   +/-3734.419 (min:464284.668 / max:476923.767) us
-------------- cuFFT plan cache --------------
cache enabled? False
current / max size: 0 / 0 (counts)
current / max memsize: 0 / (unlimited) (bytes)

cached plans (least used first):

cc: @peterbell10 @grlee77 @mnicely @emcastillo

pfn-ci-bot · 2020-08-06T17:14:23Z

  [INVALID_ARGUMENT] unknown command: home/leofang/cupy/cupyx/time.py:56: (686f6d652f6c656f66616e672f637570792f63757079782f74696d652e70793a35363a)
  2020-08-07 02:14:22.514540 github_issue_comment.go:101] unknown command: home/leofang/cupy/cupyx/time.py:56: (686f6d652f6c656f66616e672f637570792f63757079782f74696d652e70793a35363a)
  
  Stack trace:
    github.com/pfnet/flexci/internal/frontend/handler/apihandler.(*githubWebhookIssueCommentFlow).Do (github_issue_comment.go:101)
    github.com/pfnet/flexci/internal/frontend/handler/apihandler.githubIssueCommentHandler (github_issue_comment.go:47)
    runtime.call64 (asm_amd64.s:523)
    reflect.Value.call (value.go:447)
    reflect.Value.Call (value.go:308)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).registerHandler.func1.1 (handler.go:178)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).callHandler (handler.go:466)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).doInternal (handler.go:318)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).Do (handler.go:277)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).registerHandler.func1 (handler.go:175)
    github.com/pfnet/flexci/internal/frontend/core.(*handlerFlow).Do (handler.go:713)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).Register.func1 (handler.go:116)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    github.com/pfnet/flexci/internal/common/api.callInternal.func2 (call.go:204)
    github.com/pfnet/flexci/internal/common/api.callInternal (call.go:212)
    github.com/pfnet/flexci/internal/common/api.Call (call.go:128)
    github.com/pfnet/flexci/internal/common/api.GithubIssueComment (call.go:500)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.(*githubHookFlow).doInternal (github_webhook.go:146)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.(*githubHookFlow).Do (github_webhook.go:39)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.githubWebhookHandler (github_webhook.go:29)
    github.com/pfnet/flexci/internal/frontend/core.(*handlerFlow).Do (handler.go:713)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).Register.func1 (handler.go:116)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    google.golang.org/appengine/internal.executeRequestSafely (api.go:165)
    google.golang.org/appengine/internal.handleHTTP (api.go:124)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.serverHandler.ServeHTTP (server.go:2741)
    net/http.(*conn).serve (server.go:1847)
    runtime.goexit (asm_amd64.s:1333)

leofang · 2020-08-06T17:33:22Z

With the cache enabled (whose size is defaulted to 16), it saves 25% end-to-end runtime for running pytest tests/cupyx_tests/scipy_tests/fft_tests/test_fft.py.

leofang · 2020-08-06T18:00:21Z

Jenkins, test this please

pfn-ci-bot · 2020-08-06T18:00:25Z

Successfully created a job for commit 011222c:

Dashboard for commit 011222c

mnicely · 2020-08-06T18:00:35Z

HUGE reduction in CPU time (and some improvements in GPU time as well):

import cupy as cp
from cupyx.time import repeat

# choosing non-prime numbers could make plan generation time longer?
a = cp.random.random((198, 597, 418)).astype(cp.complex128)

# typical workload of ours
def fft_roundtrip(a, axis=None):
    out = cp.fft.fftn(a, axes=axis)
    out = cp.fft.ifftn(out, axes=axis)
    return out

n_repeat = 10

print("with cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache, '\n\n')

cp.fft.cache.plan_cache.set_size(0)  # disable the cache
print("without cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache)

output:

with cache:
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
fft_roundtrip       :    CPU:  251.862 us   +/-45.510 (min:  221.953 / max:  380.593) us     GPU-0:294705.273 us   +/-66.591 (min:294637.634 / max:294866.943) us
fft_roundtrip       :    CPU:  256.680 us   +/-42.280 (min:  230.437 / max:  381.849) us     GPU-0:462417.114 us   +/-671.659 (min:460680.573 / max:462825.470) us
-------------- cuFFT plan cache --------------
cache enabled? True
current / max size: 2 / 16 (counts)
current / max memsize: 7141533696 / (unlimited) (bytes)

cached plans (least used first):
key: ((597, 418), (597, 418), 1, 249546, (597, 418), 1, 249546, 105, 198, 'C', 2, None), plan type: PlanNd, memory usage: 3873374208
key: ((198, 597, 418), (198, 597, 418), 1, 1, (198, 597, 418), 1, 1, 105, 1, 'C', 2, None), plan type: PlanNd, memory usage: 3268159488 


without cache:
fft_roundtrip       :    CPU:297780.415 us   +/-304.553 (min:297310.701 / max:298156.878) us     GPU-0:297769.476 us   +/-301.531 (min:297322.083 / max:298143.738) us
fft_roundtrip       :    CPU:466226.026 us   +/-3733.645 (min:464298.991 / max:476941.334) us     GPU-0:466205.548 us   +/-3734.419 (min:464284.668 / max:476923.767) us
-------------- cuFFT plan cache --------------
cache enabled? False
current / max size: 0 / 0 (counts)
current / max memsize: 0 / (unlimited) (bytes)

cached plans (least used first):

cc: @peterbell10 @grlee77 @mnicely @emcastillo

@awthomp

chainer-ci · 2020-08-06T18:22:53Z

Jenkins CI test (for commit 011222c, target branch master) succeeded!

grlee77 · 2020-08-06T19:20:16Z

cupy/fft/fft.py

    if plan is None:
        devices = None if not config.use_multi_gpus else config._devices
-        plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
+        # TODO(leofang): do we need to add the current stream to keys?


It looks like currently, the plans are associated with the default stream on creation, but when calling the FFT with a plan provided, the plan will be associated wit the current stream.

cupy/cupy/cuda/cufft.pyx

Lines 782 to 783 in 97573d7

with nogil:

result = cufftSetStream(plan, <driver.Stream>stream)

In that case, it seems like we shouldn't need to cache the stream as the plan will be reassociated with whatever stream is currently active. Is there a scenario where one wants to do FFTs from separate streams on a single device? If not, then I don't think we would need to redesign anything.

Is there a scenario where one wants to do FFTs from separate streams on a single device?

I feel this is an uncharted territory as running the same FFT plan on multiple streams might drive the plan's internal state and work area nuts. The cuFFT documentation doesn't say anything about this, only that overlapping plan executions with data transfer is possible. I don't even know how we can set up a "plan mutex" to prevent the same plan from being executed on multiple streams, and I'd rather put this burden to end users...

running the same FFT plan on multiple streams

I am not suggesting one would ever use the same plan simultaneously from separate streams, as I don't see how that could possibly work. It seems like a user could potentially try and do that though with the current design, but agree that it is probably a rare scenario and we don't need to try and guard against it here.

I was asking more if there is a usage scenario for adding the stream attribute on plan creation (and thus also to the plan cache keys), so that multiple plans that are identical aside from stream could be used simultaneously? I agree that we don't need to add this now even if that is a plausible scenario.

Hi Gregory, yes I think we are talking about different scenarios:

Same transform, multiple plan handles, running in parallel on multiple streams (but each handle on a unique stream),

Same transform, same plan handle, running in parallel on multiple streams,

by "plan handle" I mean cufftHandle.

I agree both scenarios are complex that they should be considered in a separate PR, but just out of curiosity I re-read the doc, and I feel there might be a solution to address both scenarios. Quoting Streamed cuFFT Transforms:

Please note that in order to overlap plans using single plan handle user needs to manage work area buffers. Each concurrent plan execution needs it's exclusive work area. Work area can be set by cufftSetWorkArea function.

If I parse this sentence correctly, it is saying that the situation (2) is possible, as long as the plan is executing on distinct pairs of (stream, work_area). What do you think? I feel it's too good to be true, for example when Run 1 is being executed in the midway and we set the work area and stream for the same plan handle for Run 2, wouldn't Run 1 write to the work area of Run 2 and make everything scrambled?

If an internal lock mechanism makes this possible, then in principle the situation (1) is also fixed (by reusing the same handle) as we still need to allocated multiple work areas anyway, although getting distinct handles is easier to manage.

There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch. That's the case that the quoted documentation section is primarily addressing.

The following common pattern in user code works pretty well:

create plan, once

call cufftSetWorkArea, exec FFT N times

call cudaStreamSynchronize() to wait until all of the FFTs are complete.

I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.

Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call cufftSetAutoAllocation(handle, false) when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.

Probably a separate PR to clean this up? But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.

Hi @dicta Thanks for your comments. I have a few questions:

There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch.

If multiple such FFTs are launched on the same stream, how can they be overlapped with each other? Did you mean to overlap FFT with other user kernels and/or data transfer? Isn't that already possible / being done?

The following common pattern in user code works pretty well:

create plan, once

call cufftSetWorkArea, exec FFT N times

call cudaStreamSynchronize() to wait until all of the FFTs are complete.

On the same stream, yes it will, and it is what will be enabled by the cache. (FYI, it's already possible by explicitly reusing a plan returned by get_fft_plan().)

On different streams, this is the Scenario 2 I mentioned above, and I feel we will need x unique buffers for x overlapped streams (x <= N).

I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.

I guess so too.

Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call cufftSetAutoAllocation(handle, false) when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.

This memory pool thing is actually what @grlee77 mentioned in his #3730 (comment). But I think it's too messy and hard to make it right. You need to have a mutex protecting every unique pairs of (stream, buffer) and also a callback. I don't think it's worth the effort.

Furthermore, CuPy already has its own memory pool, and we should just reuse it instead of layering on top. The lines of code in this simple module will increase significantly, and it's hard to predict/optimize the mempool for every use cases.

Finally, with the recent pains we are experiencing with newer cuFFT (search "fft" in our issues/PRs), I'd rather not count on cuFFT behaving stably predictively, which we'll need for such a mempool. The cuFFT team never provides a strong guarantee for cuFFT's behaviors, nor do they do a good job in documenting the changes where there's one.

Probably a separate PR to clean this up?

Clean up what?

But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.

That's one of the main scenarios I have in mind that a plan cache will be super useful! Another main issue is there's significant slowdown at plan creation time for certain FFTs (for example, #3556).

cupy/fft/fft.py

leofang · 2020-08-06T19:34:45Z

@grlee77 @peterbell10 You folks must have much more experience in caching than I do. What do we need to add to the test suite for the plan cache? @peterbell10 If you have time to add tests to scipy/scipy#12512, it'd be lovely as I can steal things from there 😇

grlee77 · 2020-08-06T19:41:16Z

Hi @leofang. Thanks for working on this!
I like that you can control based on the total memory size. I had tried that previously with a basic pure-python implementation, just storing the cache in a Python dict, but had never gotten around to finishing it. This cython-based LRU cache implementation should be efficient and looks good to me.

The other idea I wanted to look into, but never tried was to modify the Plan1d and PlanNd to store the work_size (which they already compute as a temporary), e.g.

cupy/cupy/cuda/cufft.pyx

Lines 752 to 759 in 97573d7

    
           work_area = memory.alloc(work_size) 
        
           with nogil: 
        
               result = cufftSetWorkArea(plan, <void *>(work_area.ptr)) 
        
           check_result(result) 
        
           self.shape = tuple(shape) 
        
           self.fft_type = fft_type 
        
           self.work_area = work_area

Then we could have an method to dynamically allocate and deallocate the work area as needed (allowing many plans to be stored without substantial memory overhead). Obviously this would only make sense if the memory allocation is not a substantial portion of the required planning overhead. As far as I can tell the work area (plan.work_area.mem.size) is equal in size to the data itself. For data that is a substantial fraction of the GPU memory this can severely limit the number of plans that can be stored. If implemented, the default behavior should be as it is currently for performance reasons, but it seems like it could be of interest in memory-limited scenarios.

grlee77 · 2020-08-06T20:17:10Z

What do we need to add to the test suite for the plan cache?

I don't have any fancy ideas, just basic stuff like verifying that the cache size and memory limits add up to the expected value for some simple 1d and nd plan cases.

Probably also a couple things to test the LRU aspect: e.g.

limit to 2 plans, adding a 3rd plan results in the first plan that was added being removed
adding a plan that already exists in the cache moves it to the "most recent" position

awthomp · 2020-09-11T13:17:35Z

@leofang, please don't get frustrated. Let's continue the discussion here as I think the work you have done so far is valuable and I can tell a lot of work went into this PR.

re: @asi1024

Why not use cupy._util.memoize or functools.lru_cache?

A useful feature added here is being able to control the maximum size of the cache in terms of total GPU memory usage. I think functools.lru_cache only allows controlling the number of items cached.

The only argument I see in favor of memoize or lru_cache is lower maintenance cost (and lower development cost, but in this case the development of the custom cache has already been completed!). Given that the custom cache here is pretty extensively tested, I am not too worried that it has bugs and will need substantial revisions going forward.

@mnicely and I have been closely tracking this PR, as having fast caching features for FFT plans is critical for online signal processing applications and completely applicable to our work on cusignal. The work you've done here has been fantastic, @leofang, and I'm very excited about the simple API.

Let us know how we can help.

emcastillo · 2020-09-14T02:05:10Z

The whole team thinks that this PR is great and we really want to merge it! thanks for all the hard work you put in this @leofang

VolkerH · 2020-09-14T02:31:58Z

I came across this PR last week using a google search (just recent user of cupy, not a team member) and really hope any remaining issues can be addressed and this PR makes it into the code base.

leofang · 2020-09-14T05:36:29Z

Hi @asi1024 Sorry for my outburst. I took a few days off and am now ready to revisit this. I should not have replied at 5am when I was already in a very bad state (for other reasons). I misinterpreted your question as a demand to overturn the current approach. I should have asked for your clarification either here or offline. It wasn't professional...Though I do hope such design discussions could happen earlier. Being able to receive early feedbacks is one of the advantages to develop in the open.

Also, thanks @grlee77 @mnicely @awthomp @kmaehashi @emcastillo @VolkerH for your kind replies and supports.

So, yes, as @grlee77 helped clarified, functools.lru_cache is limited if we want to impose extra constraints such as memory usage. In particular, given that functools.lru_cache is implemented as a function decorator (so is cupy.memoize) instead of an inheritable class, I do not really have a clear picture how to reuse it while dealing with all these complexities (Plan1d vs PlanNd, single-GPU vs multi-GPU, per thread & per device cache, etc) that require special logics for insertion and removal; if it were inheritable I could overwrite/wrap some of its methods. (I actually peeked its internal while developing my cache here to gain my confidence.)

cupy.memoize has an additional obstacle that I do not know how to remove cached plans from it if needed. Perhaps it's simply because I didn't give enough thoughts to it.

I chose to implement it in Cython because I wanted to minimize the overhead (and looks like we made it), though I didn't benchmark how large it could be if the cache were implemented in Python. I think once cuFFT improves, or when running on AMD GPUs (#3896 (comment)), the CPU overhead could be a potential concern, so I'd rather make this the least thing to consider.

Let me know if you have any questions or thoughts!

chainer-ci · 2020-09-14T06:34:43Z

Jenkins CI test (for commit 13b2686, target branch master) succeeded!

leofang · 2020-09-14T20:05:09Z

Jenkins, test this please

chainer-ci · 2020-09-14T20:27:20Z

Jenkins CI test (for commit bfd1de7, target branch master) succeeded!

cupy/fft/_cache.pyx

asi1024 · 2020-09-16T19:13:19Z

This caching mechanism looks useful also for other purposes, for example cuTENSOR descriptors and other objects that have GPU memory spaces!

asi1024 · 2020-09-17T18:13:53Z

Jenkins, test this please.

chainer-ci · 2020-09-17T18:38:51Z

Jenkins CI test (for commit 8d0705c, target branch master) succeeded!

leofang · 2020-09-17T21:02:20Z

Jenkins, test this please

chainer-ci · 2020-09-17T21:29:00Z

Jenkins CI test (for commit 3784845, target branch master) succeeded!

asi1024 · 2020-09-18T07:20:25Z

LGTM!

Add a cuFFT plan cache

leofang · 2020-09-18T18:28:48Z

Thanks a lot, @asi1024 and everyone!

This caching mechanism looks useful also for other purposes, for example cuTENSOR descriptors and other objects that have GPU memory spaces!

Interesting! Do cuTENSOR descriptors use a lot of memory? If so in the future we might consider making this cache to work globally, i.e., store all kinds of plans (cuFFT, cuTENSOR, etc). This also makes it easier to combat with OOM, as we can just clear the cache at once.

Note though a cache like this could also be useful when the object itself takes little memory but a long time to generate (cuFFT plans have both such weaknesses).

leofang · 2020-10-05T18:04:14Z

Just a follow-up: It seems implementing the cache in Cython is further justified: @mnicely showed in rapidsai/cusignal#254 that the performance difference with a Python-based cache can be as high as 1.6x, which is very surprising!

leofang added 8 commits August 5, 2020 01:22

[WIP] checking in...

7eb8985

[WIP] a very simple LRU

3cdd1f2

[WIP] separate linked list from the cache object; improve

e91cc6e

refactored a bit, seems complete

4224ad2

add thread-local storage and helpers from scipy/scipy#12512

ea26609

relax size/memsize constraints; add is_enabled flag (WIP)

4863ce6

add TODOs

d7e65f1

hook up the cache with fft modules; all tests passed

62f5699

bug fix: forgot to cache the plans...

d7d8cbd

fixes for 1. empty plan, 2. multi-gpu plan

011222c

grlee77 reviewed Aug 6, 2020

View reviewed changes

cupy/fft/fft.py Show resolved Hide resolved

leofang added 3 commits August 6, 2020 22:08

[WIP] per-device cache

2ad410b

per-thread, per-device cache!

737d0ab

expose scipy apis to cp.fft.config; fix flake8

583826d

kmaehashi assigned asi1024 Aug 7, 2020

leofang reopened this Sep 14, 2020

record hits/misses

bfd1de7

asi1024 reviewed Sep 16, 2020

View reviewed changes

cupy/fft/_cache.pyx Show resolved Hide resolved

add cleanup routine to break circular ref

8d0705c

leofang marked this pull request as draft September 17, 2020 19:19

use weakref

3784845

leofang marked this pull request as ready for review September 17, 2020 21:56

asi1024 merged commit dd2fb09 into cupy:master Sep 18, 2020

asi1024 added this to the v9.0.0a1 milestone Sep 18, 2020

asi1024 added a commit to asi1024/cupy that referenced this pull request Sep 18, 2020

Merge pull request cupy#3730 from leofang/cufft_cache

200f7c8

Add a cuFFT plan cache

asi1024 mentioned this pull request Sep 18, 2020

[backport] Add a cuFFT plan cache #4010

Merged

leofang deleted the cufft_cache branch September 18, 2020 13:33

leofang mentioned this pull request Sep 18, 2020

ROCm: Support rocFFT/hipFFT #3896

Merged

leofang mentioned this pull request Sep 21, 2020

Improve the plan cache documentation #4013

Merged

VolkerH mentioned this pull request Oct 26, 2020

Deconvolution post dask/dask-blog#77

Merged

	with nogil:
	result = cufftSetStream(plan, <driver.Stream>stream)

Add a cuFFT plan cache #3730

Add a cuFFT plan cache #3730

Conversation

leofang commented Aug 6, 2020 • edited

leofang commented Aug 6, 2020

pfn-ci-bot commented Aug 6, 2020

chainer-ci commented Aug 6, 2020

leofang commented Aug 6, 2020

leofang commented Aug 6, 2020

pfn-ci-bot commented Aug 6, 2020

leofang commented Aug 6, 2020

leofang commented Aug 6, 2020

pfn-ci-bot commented Aug 6, 2020

mnicely commented Aug 6, 2020

chainer-ci commented Aug 6, 2020

grlee77 Aug 6, 2020

Choose a reason for hiding this comment

leofang Aug 6, 2020 • edited

Choose a reason for hiding this comment

grlee77 Aug 6, 2020

Choose a reason for hiding this comment

leofang Aug 6, 2020

Choose a reason for hiding this comment

dicta Aug 11, 2020

Choose a reason for hiding this comment

leofang Aug 11, 2020 • edited

Choose a reason for hiding this comment

leofang commented Aug 6, 2020

grlee77 commented Aug 6, 2020

grlee77 commented Aug 6, 2020

awthomp commented Sep 11, 2020

emcastillo commented Sep 14, 2020

VolkerH commented Sep 14, 2020

leofang commented Sep 14, 2020 • edited

chainer-ci commented Sep 14, 2020

leofang commented Sep 14, 2020

chainer-ci commented Sep 14, 2020

asi1024 commented Sep 16, 2020

asi1024 commented Sep 17, 2020

chainer-ci commented Sep 17, 2020

leofang commented Sep 17, 2020

chainer-ci commented Sep 17, 2020

asi1024 commented Sep 18, 2020

leofang commented Sep 18, 2020

leofang commented Oct 5, 2020

leofang commented Aug 6, 2020 •

edited

leofang Aug 6, 2020 •

edited

leofang Aug 11, 2020 •

edited

leofang commented Sep 14, 2020 •

edited