New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a cuFFT plan cache #3730
Add a cuFFT plan cache #3730
Conversation
Jenkins, test this please |
Successfully created a job for commit 62f5699: |
Jenkins CI test (for commit 62f5699, target branch master) succeeded! |
Stupid me...🤦🏻♂️ |
HUGE reduction in CPU time (and some improvements in GPU time as well): import cupy as cp
from cupyx.time import repeat
# choosing non-prime numbers could make plan generation time longer?
a = cp.random.random((198, 597, 418)).astype(cp.complex128)
# typical workload of ours
def fft_roundtrip(a, axis=None):
out = cp.fft.fftn(a, axes=axis)
out = cp.fft.ifftn(out, axes=axis)
return out
n_repeat = 10
print("with cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache, '\n\n')
cp.fft.cache.plan_cache.set_size(0) # disable the cache
print("without cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache) output:
|
|
With the cache enabled (whose size is defaulted to 16), it saves 25% end-to-end runtime for running |
Jenkins, test this please |
Successfully created a job for commit 011222c: |
|
Jenkins CI test (for commit 011222c, target branch master) succeeded! |
if plan is None: | ||
devices = None if not config.use_multi_gpus else config._devices | ||
plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices) | ||
# TODO(leofang): do we need to add the current stream to keys? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like currently, the plans are associated with the default stream on creation, but when calling the FFT with a plan provided, the plan will be associated wit the current stream.
Lines 782 to 783 in 97573d7
with nogil: | |
result = cufftSetStream(plan, <driver.Stream>stream) |
In that case, it seems like we shouldn't need to cache the stream as the plan will be reassociated with whatever stream is currently active. Is there a scenario where one wants to do FFTs from separate streams on a single device? If not, then I don't think we would need to redesign anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a scenario where one wants to do FFTs from separate streams on a single device?
I feel this is an uncharted territory as running the same FFT plan on multiple streams might drive the plan's internal state and work area nuts. The cuFFT documentation doesn't say anything about this, only that overlapping plan executions with data transfer is possible. I don't even know how we can set up a "plan mutex" to prevent the same plan from being executed on multiple streams, and I'd rather put this burden to end users...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running the same FFT plan on multiple streams
I am not suggesting one would ever use the same plan simultaneously from separate streams, as I don't see how that could possibly work. It seems like a user could potentially try and do that though with the current design, but agree that it is probably a rare scenario and we don't need to try and guard against it here.
I was asking more if there is a usage scenario for adding the stream attribute on plan creation (and thus also to the plan cache keys), so that multiple plans that are identical aside from stream could be used simultaneously? I agree that we don't need to add this now even if that is a plausible scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gregory, yes I think we are talking about different scenarios:
- Same transform, multiple plan handles, running in parallel on multiple streams (but each handle on a unique stream),
- Same transform, same plan handle, running in parallel on multiple streams,
by "plan handle" I mean cufftHandle
.
I agree both scenarios are complex that they should be considered in a separate PR, but just out of curiosity I re-read the doc, and I feel there might be a solution to address both scenarios. Quoting Streamed cuFFT Transforms:
Please note that in order to overlap plans using single plan handle user needs to manage work area buffers. Each concurrent plan execution needs it's exclusive work area. Work area can be set by cufftSetWorkArea function.
If I parse this sentence correctly, it is saying that the situation (2) is possible, as long as the plan is executing on distinct pairs of (stream, work_area)
. What do you think? I feel it's too good to be true, for example when Run 1 is being executed in the midway and we set the work area and stream for the same plan handle for Run 2, wouldn't Run 1 write to the work area of Run 2 and make everything scrambled?
If an internal lock mechanism makes this possible, then in principle the situation (1) is also fixed (by reusing the same handle) as we still need to allocated multiple work areas anyway, although getting distinct handles is easier to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch. That's the case that the quoted documentation section is primarily addressing.
The following common pattern in user code works pretty well:
- create plan, once
- call cufftSetWorkArea, exec FFT N times
- call cudaStreamSynchronize() to wait until all of the FFTs are complete.
I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.
Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call cufftSetAutoAllocation(handle, false)
when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.
Probably a separate PR to clean this up? But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dicta Thanks for your comments. I have a few questions:
There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch.
If multiple such FFTs are launched on the same stream, how can they be overlapped with each other? Did you mean to overlap FFT with other user kernels and/or data transfer? Isn't that already possible / being done?
The following common pattern in user code works pretty well:
- create plan, once
- call cufftSetWorkArea, exec FFT N times
- call cudaStreamSynchronize() to wait until all of the FFTs are complete.
On the same stream, yes it will, and it is what will be enabled by the cache. (FYI, it's already possible by explicitly reusing a plan returned by get_fft_plan()
.)
On different streams, this is the Scenario 2 I mentioned above, and I feel we will need x unique buffers for x overlapped streams (x <= N).
I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.
I guess so too.
Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call
cufftSetAutoAllocation(handle, false)
when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.
This memory pool thing is actually what @grlee77 mentioned in his #3730 (comment). But I think it's too messy and hard to make it right. You need to have a mutex protecting every unique pairs of (stream, buffer)
and also a callback. I don't think it's worth the effort.
Furthermore, CuPy already has its own memory pool, and we should just reuse it instead of layering on top. The lines of code in this simple module will increase significantly, and it's hard to predict/optimize the mempool for every use cases.
Finally, with the recent pains we are experiencing with newer cuFFT (search "fft" in our issues/PRs), I'd rather not count on cuFFT behaving stably predictively, which we'll need for such a mempool. The cuFFT team never provides a strong guarantee for cuFFT's behaviors, nor do they do a good job in documenting the changes where there's one.
Probably a separate PR to clean this up?
Clean up what?
But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.
That's one of the main scenarios I have in mind that a plan cache will be super useful! Another main issue is there's significant slowdown at plan creation time for certain FFTs (for example, #3556).
@grlee77 @peterbell10 You folks must have much more experience in caching than I do. What do we need to add to the test suite for the plan cache? @peterbell10 If you have time to add tests to scipy/scipy#12512, it'd be lovely as I can steal things from there 😇 |
Hi @leofang. Thanks for working on this! The other idea I wanted to look into, but never tried was to modify the Lines 752 to 759 in 97573d7
Then we could have an method to dynamically allocate and deallocate the work area as needed (allowing many plans to be stored without substantial memory overhead). Obviously this would only make sense if the memory allocation is not a substantial portion of the required planning overhead. As far as I can tell the work area ( plan.work_area.mem.size ) is equal in size to the data itself. For data that is a substantial fraction of the GPU memory this can severely limit the number of plans that can be stored. If implemented, the default behavior should be as it is currently for performance reasons, but it seems like it could be of interest in memory-limited scenarios.
|
I don't have any fancy ideas, just basic stuff like verifying that the cache size and memory limits add up to the expected value for some simple 1d and nd plan cases. Probably also a couple things to test the LRU aspect: e.g.
|
@mnicely and I have been closely tracking this PR, as having fast caching features for FFT plans is critical for online signal processing applications and completely applicable to our work on cusignal. The work you've done here has been fantastic, @leofang, and I'm very excited about the simple API. Let us know how we can help. |
The whole team thinks that this PR is great and we really want to merge it! thanks for all the hard work you put in this @leofang |
I came across this PR last week using a google search (just recent user of cupy, not a team member) and really hope any remaining issues can be addressed and this PR makes it into the code base. |
Hi @asi1024 Sorry for my outburst. I took a few days off and am now ready to revisit this. I should not have replied at 5am when I was already in a very bad state (for other reasons). I misinterpreted your question as a demand to overturn the current approach. I should have asked for your clarification either here or offline. It wasn't professional...Though I do hope such design discussions could happen earlier. Being able to receive early feedbacks is one of the advantages to develop in the open. Also, thanks @grlee77 @mnicely @awthomp @kmaehashi @emcastillo @VolkerH for your kind replies and supports. So, yes, as @grlee77 helped clarified,
I chose to implement it in Cython because I wanted to minimize the overhead (and looks like we made it), though I didn't benchmark how large it could be if the cache were implemented in Python. I think once cuFFT improves, or when running on AMD GPUs (#3896 (comment)), the CPU overhead could be a potential concern, so I'd rather make this the least thing to consider. Let me know if you have any questions or thoughts! |
Jenkins CI test (for commit 13b2686, target branch master) succeeded! |
Jenkins, test this please |
Jenkins CI test (for commit bfd1de7, target branch master) succeeded! |
This caching mechanism looks useful also for other purposes, for example cuTENSOR descriptors and other objects that have GPU memory spaces! |
Jenkins, test this please. |
Jenkins CI test (for commit 8d0705c, target branch master) succeeded! |
Jenkins, test this please |
Jenkins CI test (for commit 3784845, target branch master) succeeded! |
LGTM! |
Add a cuFFT plan cache
Thanks a lot, @asi1024 and everyone!
Interesting! Do cuTENSOR descriptors use a lot of memory? If so in the future we might consider making this cache to work globally, i.e., store all kinds of plans (cuFFT, cuTENSOR, etc). This also makes it easier to combat with OOM, as we can just clear the cache at once. Note though a cache like this could also be useful when the object itself takes little memory but a long time to generate (cuFFT plans have both such weaknesses). |
Just a follow-up: It seems implementing the cache in Cython is further justified: @mnicely showed in rapidsai/cusignal#254 that the performance difference with a Python-based cache can be as high as 1.6x, which is very surprising! |
UPDATE: Close #3588.
This PR implements a least recently used (LRU) cache for cuFFT plans. The implementation is done in Cython to minimize the Python overhead; yet, I still use cdef classes (instead of pointers to structs) to avoid managing memory myself, and cdef'ing as much as I can.
Properties of this cache:
cupy.fft.irfft
is slow #3556): Add a cuFFT plan cache #3730 (comment)cupy.fft.config
:get_plan_cache()
,show_plan_cache_info()
PlanCache
without explicitly referencing it in the autosummary?What is NOT done in this PR (see the discussions in the replies below):
I think it's out of scope, requires careful planning, and the performance gain, if any, is unknown.
Work in progress. Description to follow. All tests passed locally.Aim to address #3588 and follow scipy/scipy#12512.TODO:
add_multi_gpu_plan()
and__setitem__()
Addget_fft_plan
outcomes to the cache?