Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a cuFFT plan cache #3730

Merged
merged 50 commits into from Sep 18, 2020
Merged

Add a cuFFT plan cache #3730

merged 50 commits into from Sep 18, 2020

Conversation

leofang
Copy link
Member

@leofang leofang commented Aug 6, 2020

UPDATE: Close #3588.

This PR implements a least recently used (LRU) cache for cuFFT plans. The implementation is done in Cython to minimize the Python overhead; yet, I still use cdef classes (instead of pointers to structs) to avoid managing memory myself, and cdef'ing as much as I can.

Properties of this cache:

What is NOT done in this PR (see the discussions in the replies below):

  • Lock stream on to work area
  • Manage a work area pool by the cache

I think it's out of scope, requires careful planning, and the performance gain, if any, is unknown.


Work in progress. Description to follow. All tests passed locally.
Aim to address #3588 and follow scipy/scipy#12512.

TODO:

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

Jenkins, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 62f5699:

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 62f5699, target branch master) succeeded!

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

Stupid me...🤦🏻‍♂️

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

HUGE reduction in CPU time (and some improvements in GPU time as well):

import cupy as cp
from cupyx.time import repeat

# choosing non-prime numbers could make plan generation time longer?
a = cp.random.random((198, 597, 418)).astype(cp.complex128)

# typical workload of ours
def fft_roundtrip(a, axis=None):
    out = cp.fft.fftn(a, axes=axis)
    out = cp.fft.ifftn(out, axes=axis)
    return out

n_repeat = 10

print("with cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache, '\n\n')

cp.fft.cache.plan_cache.set_size(0)  # disable the cache
print("without cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache)

output:

with cache:
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
fft_roundtrip       :    CPU:  251.862 us   +/-45.510 (min:  221.953 / max:  380.593) us     GPU-0:294705.273 us   +/-66.591 (min:294637.634 / max:294866.943) us
fft_roundtrip       :    CPU:  256.680 us   +/-42.280 (min:  230.437 / max:  381.849) us     GPU-0:462417.114 us   +/-671.659 (min:460680.573 / max:462825.470) us
-------------- cuFFT plan cache --------------
cache enabled? True
current / max size: 2 / 16 (counts)
current / max memsize: 7141533696 / (unlimited) (bytes)

cached plans (least used first):
key: ((597, 418), (597, 418), 1, 249546, (597, 418), 1, 249546, 105, 198, 'C', 2, None), plan type: PlanNd, memory usage: 3873374208
key: ((198, 597, 418), (198, 597, 418), 1, 1, (198, 597, 418), 1, 1, 105, 1, 'C', 2, None), plan type: PlanNd, memory usage: 3268159488 


without cache:
fft_roundtrip       :    CPU:297780.415 us   +/-304.553 (min:297310.701 / max:298156.878) us     GPU-0:297769.476 us   +/-301.531 (min:297322.083 / max:298143.738) us
fft_roundtrip       :    CPU:466226.026 us   +/-3733.645 (min:464298.991 / max:476941.334) us     GPU-0:466205.548 us   +/-3734.419 (min:464284.668 / max:476923.767) us
-------------- cuFFT plan cache --------------
cache enabled? False
current / max size: 0 / 0 (counts)
current / max memsize: 0 / (unlimited) (bytes)

cached plans (least used first):

cc: @peterbell10 @grlee77 @mnicely @emcastillo

@pfn-ci-bot
Copy link
Collaborator

  [INVALID_ARGUMENT] unknown command: home/leofang/cupy/cupyx/time.py:56: (686f6d652f6c656f66616e672f637570792f63757079782f74696d652e70793a35363a)
  2020-08-07 02:14:22.514540 github_issue_comment.go:101] unknown command: home/leofang/cupy/cupyx/time.py:56: (686f6d652f6c656f66616e672f637570792f63757079782f74696d652e70793a35363a)
  
  Stack trace:
    github.com/pfnet/flexci/internal/frontend/handler/apihandler.(*githubWebhookIssueCommentFlow).Do (github_issue_comment.go:101)
    github.com/pfnet/flexci/internal/frontend/handler/apihandler.githubIssueCommentHandler (github_issue_comment.go:47)
    runtime.call64 (asm_amd64.s:523)
    reflect.Value.call (value.go:447)
    reflect.Value.Call (value.go:308)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).registerHandler.func1.1 (handler.go:178)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).callHandler (handler.go:466)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).doInternal (handler.go:318)
    github.com/pfnet/flexci/internal/frontend/core.(*apiHandlerFlow).Do (handler.go:277)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).registerHandler.func1 (handler.go:175)
    github.com/pfnet/flexci/internal/frontend/core.(*handlerFlow).Do (handler.go:713)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).Register.func1 (handler.go:116)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    github.com/pfnet/flexci/internal/common/api.callInternal.func2 (call.go:204)
    github.com/pfnet/flexci/internal/common/api.callInternal (call.go:212)
    github.com/pfnet/flexci/internal/common/api.Call (call.go:128)
    github.com/pfnet/flexci/internal/common/api.GithubIssueComment (call.go:500)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.(*githubHookFlow).doInternal (github_webhook.go:146)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.(*githubHookFlow).Do (github_webhook.go:39)
    github.com/pfnet/flexci/internal/frontend/handler/xternalhandler.githubWebhookHandler (github_webhook.go:29)
    github.com/pfnet/flexci/internal/frontend/core.(*handlerFlow).Do (handler.go:713)
    github.com/pfnet/flexci/internal/frontend/core.(*registerHandlerFlow).Register.func1 (handler.go:116)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.(*ServeMux).ServeHTTP (server.go:2361)
    google.golang.org/appengine/internal.executeRequestSafely (api.go:165)
    google.golang.org/appengine/internal.handleHTTP (api.go:124)
    net/http.HandlerFunc.ServeHTTP (server.go:1964)
    net/http.serverHandler.ServeHTTP (server.go:2741)
    net/http.(*conn).serve (server.go:1847)
    runtime.goexit (asm_amd64.s:1333)

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

With the cache enabled (whose size is defaulted to 16), it saves 25% end-to-end runtime for running pytest tests/cupyx_tests/scipy_tests/fft_tests/test_fft.py.

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

Jenkins, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 011222c:

@mnicely
Copy link
Contributor

mnicely commented Aug 6, 2020

HUGE reduction in CPU time (and some improvements in GPU time as well):

import cupy as cp
from cupyx.time import repeat

# choosing non-prime numbers could make plan generation time longer?
a = cp.random.random((198, 597, 418)).astype(cp.complex128)

# typical workload of ours
def fft_roundtrip(a, axis=None):
    out = cp.fft.fftn(a, axes=axis)
    out = cp.fft.ifftn(out, axes=axis)
    return out

n_repeat = 10

print("with cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache, '\n\n')

cp.fft.cache.plan_cache.set_size(0)  # disable the cache
print("without cache:")
print(repeat(fft_roundtrip, (a, (1,2)), n_repeat=n_repeat))
print(repeat(fft_roundtrip, (a, (0,1,2)), n_repeat=n_repeat))
print(cp.fft.cache.plan_cache)

output:

with cache:
/home/leofang/cupy/cupyx/time.py:56: FutureWarning: cupyx.time.repeat is experimental. The interface can change in the future.
  util.experimental('cupyx.time.repeat')
fft_roundtrip       :    CPU:  251.862 us   +/-45.510 (min:  221.953 / max:  380.593) us     GPU-0:294705.273 us   +/-66.591 (min:294637.634 / max:294866.943) us
fft_roundtrip       :    CPU:  256.680 us   +/-42.280 (min:  230.437 / max:  381.849) us     GPU-0:462417.114 us   +/-671.659 (min:460680.573 / max:462825.470) us
-------------- cuFFT plan cache --------------
cache enabled? True
current / max size: 2 / 16 (counts)
current / max memsize: 7141533696 / (unlimited) (bytes)

cached plans (least used first):
key: ((597, 418), (597, 418), 1, 249546, (597, 418), 1, 249546, 105, 198, 'C', 2, None), plan type: PlanNd, memory usage: 3873374208
key: ((198, 597, 418), (198, 597, 418), 1, 1, (198, 597, 418), 1, 1, 105, 1, 'C', 2, None), plan type: PlanNd, memory usage: 3268159488 


without cache:
fft_roundtrip       :    CPU:297780.415 us   +/-304.553 (min:297310.701 / max:298156.878) us     GPU-0:297769.476 us   +/-301.531 (min:297322.083 / max:298143.738) us
fft_roundtrip       :    CPU:466226.026 us   +/-3733.645 (min:464298.991 / max:476941.334) us     GPU-0:466205.548 us   +/-3734.419 (min:464284.668 / max:476923.767) us
-------------- cuFFT plan cache --------------
cache enabled? False
current / max size: 0 / 0 (counts)
current / max memsize: 0 / (unlimited) (bytes)

cached plans (least used first):

cc: @peterbell10 @grlee77 @mnicely @emcastillo

@awthomp

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 011222c, target branch master) succeeded!

if plan is None:
devices = None if not config.use_multi_gpus else config._devices
plan = cufft.Plan1d(out_size, fft_type, batch, devices=devices)
# TODO(leofang): do we need to add the current stream to keys?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like currently, the plans are associated with the default stream on creation, but when calling the FFT with a plan provided, the plan will be associated wit the current stream.

cupy/cupy/cuda/cufft.pyx

Lines 782 to 783 in 97573d7

with nogil:
result = cufftSetStream(plan, <driver.Stream>stream)

In that case, it seems like we shouldn't need to cache the stream as the plan will be reassociated with whatever stream is currently active. Is there a scenario where one wants to do FFTs from separate streams on a single device? If not, then I don't think we would need to redesign anything.

Copy link
Member Author

@leofang leofang Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a scenario where one wants to do FFTs from separate streams on a single device?

I feel this is an uncharted territory as running the same FFT plan on multiple streams might drive the plan's internal state and work area nuts. The cuFFT documentation doesn't say anything about this, only that overlapping plan executions with data transfer is possible. I don't even know how we can set up a "plan mutex" to prevent the same plan from being executed on multiple streams, and I'd rather put this burden to end users...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running the same FFT plan on multiple streams

I am not suggesting one would ever use the same plan simultaneously from separate streams, as I don't see how that could possibly work. It seems like a user could potentially try and do that though with the current design, but agree that it is probably a rare scenario and we don't need to try and guard against it here.

I was asking more if there is a usage scenario for adding the stream attribute on plan creation (and thus also to the plan cache keys), so that multiple plans that are identical aside from stream could be used simultaneously? I agree that we don't need to add this now even if that is a plausible scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gregory, yes I think we are talking about different scenarios:

  1. Same transform, multiple plan handles, running in parallel on multiple streams (but each handle on a unique stream),
  2. Same transform, same plan handle, running in parallel on multiple streams,

by "plan handle" I mean cufftHandle.

I agree both scenarios are complex that they should be considered in a separate PR, but just out of curiosity I re-read the doc, and I feel there might be a solution to address both scenarios. Quoting Streamed cuFFT Transforms:

Please note that in order to overlap plans using single plan handle user needs to manage work area buffers. Each concurrent plan execution needs it's exclusive work area. Work area can be set by cufftSetWorkArea function.

If I parse this sentence correctly, it is saying that the situation (2) is possible, as long as the plan is executing on distinct pairs of (stream, work_area). What do you think? I feel it's too good to be true, for example when Run 1 is being executed in the midway and we set the work area and stream for the same plan handle for Run 2, wouldn't Run 1 write to the work area of Run 2 and make everything scrambled?

If an internal lock mechanism makes this possible, then in principle the situation (1) is also fixed (by reusing the same handle) as we still need to allocated multiple work areas anyway, although getting distinct handles is easier to manage.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch. That's the case that the quoted documentation section is primarily addressing.

The following common pattern in user code works pretty well:

  • create plan, once
  • call cufftSetWorkArea, exec FFT N times
  • call cudaStreamSynchronize() to wait until all of the FFTs are complete.

I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.

Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call cufftSetAutoAllocation(handle, false) when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.

Probably a separate PR to clean this up? But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.

Copy link
Member Author

@leofang leofang Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dicta Thanks for your comments. I have a few questions:

There's also a third case here: Same transform, same plan handle, same stream, with multiple FFT launches being able to be overlapped just like any other kernel launch.

If multiple such FFTs are launched on the same stream, how can they be overlapped with each other? Did you mean to overlap FFT with other user kernels and/or data transfer? Isn't that already possible / being done?

The following common pattern in user code works pretty well:

  • create plan, once
  • call cufftSetWorkArea, exec FFT N times
  • call cudaStreamSynchronize() to wait until all of the FFTs are complete.

On the same stream, yes it will, and it is what will be enabled by the cache. (FYI, it's already possible by explicitly reusing a plan returned by get_fft_plan().)

On different streams, this is the Scenario 2 I mentioned above, and I feel we will need x unique buffers for x overlapped streams (x <= N).

I haven't personally tested with setting a separate stream on each invocation, but it is very possible that this would work.

I guess so too.

Furthermore, The majority of the memory being used by an allocated plan is the work area that gets allocated by default (when cufftSetAutoAllocation is true). I think it would be a better implementation for the cache if this memory was not allocated at plan creation time -- i.e, always call cufftSetAutoAllocation(handle, false) when generating plans, then explicitly assign a work area for each fft() invocation using a normal cupy memory pool as you describe above. The messy part of this is that you do need an event/callback to handle releasing the work area once the FFT completes.

This memory pool thing is actually what @grlee77 mentioned in his #3730 (comment). But I think it's too messy and hard to make it right. You need to have a mutex protecting every unique pairs of (stream, buffer) and also a callback. I don't think it's worth the effort.

Furthermore, CuPy already has its own memory pool, and we should just reuse it instead of layering on top. The lines of code in this simple module will increase significantly, and it's hard to predict/optimize the mempool for every use cases.

Finally, with the recent pains we are experiencing with newer cuFFT (search "fft" in our issues/PRs), I'd rather not count on cuFFT behaving stably predictively, which we'll need for such a mempool. The cuFFT team never provides a strong guarantee for cuFFT's behaviors, nor do they do a good job in documenting the changes where there's one.

Probably a separate PR to clean this up?

Clean up what?

But it's pretty important for a cache implementation, with large batch FFTs I've worked on systems where this work area ended up being hundreds of megabytes of GPU memory.

That's one of the main scenarios I have in mind that a plan cache will be super useful! Another main issue is there's significant slowdown at plan creation time for certain FFTs (for example, #3556).

@leofang
Copy link
Member Author

leofang commented Aug 6, 2020

@grlee77 @peterbell10 You folks must have much more experience in caching than I do. What do we need to add to the test suite for the plan cache? @peterbell10 If you have time to add tests to scipy/scipy#12512, it'd be lovely as I can steal things from there 😇

@grlee77
Copy link
Contributor

grlee77 commented Aug 6, 2020

Hi @leofang. Thanks for working on this!
I like that you can control based on the total memory size. I had tried that previously with a basic pure-python implementation, just storing the cache in a Python dict, but had never gotten around to finishing it. This cython-based LRU cache implementation should be efficient and looks good to me.

The other idea I wanted to look into, but never tried was to modify the Plan1d and PlanNd to store the work_size (which they already compute as a temporary), e.g.

cupy/cupy/cuda/cufft.pyx

Lines 752 to 759 in 97573d7

work_area = memory.alloc(work_size)
with nogil:
result = cufftSetWorkArea(plan, <void *>(work_area.ptr))
check_result(result)
self.shape = tuple(shape)
self.fft_type = fft_type
self.work_area = work_area

Then we could have an method to dynamically allocate and deallocate the work area as needed (allowing many plans to be stored without substantial memory overhead). Obviously this would only make sense if the memory allocation is not a substantial portion of the required planning overhead. As far as I can tell the work area (plan.work_area.mem.size) is equal in size to the data itself. For data that is a substantial fraction of the GPU memory this can severely limit the number of plans that can be stored. If implemented, the default behavior should be as it is currently for performance reasons, but it seems like it could be of interest in memory-limited scenarios.

@grlee77
Copy link
Contributor

grlee77 commented Aug 6, 2020

What do we need to add to the test suite for the plan cache?

I don't have any fancy ideas, just basic stuff like verifying that the cache size and memory limits add up to the expected value for some simple 1d and nd plan cases.

Probably also a couple things to test the LRU aspect: e.g.

  • limit to 2 plans, adding a 3rd plan results in the first plan that was added being removed
  • adding a plan that already exists in the cache moves it to the "most recent" position

@awthomp
Copy link

awthomp commented Sep 11, 2020

@leofang, please don't get frustrated. Let's continue the discussion here as I think the work you have done so far is valuable and I can tell a lot of work went into this PR.

re: @asi1024

Why not use cupy._util.memoize or functools.lru_cache?

A useful feature added here is being able to control the maximum size of the cache in terms of total GPU memory usage. I think functools.lru_cache only allows controlling the number of items cached.

The only argument I see in favor of memoize or lru_cache is lower maintenance cost (and lower development cost, but in this case the development of the custom cache has already been completed!). Given that the custom cache here is pretty extensively tested, I am not too worried that it has bugs and will need substantial revisions going forward.

@mnicely and I have been closely tracking this PR, as having fast caching features for FFT plans is critical for online signal processing applications and completely applicable to our work on cusignal. The work you've done here has been fantastic, @leofang, and I'm very excited about the simple API.

Let us know how we can help.

@emcastillo
Copy link
Member

The whole team thinks that this PR is great and we really want to merge it! thanks for all the hard work you put in this @leofang

@VolkerH
Copy link

VolkerH commented Sep 14, 2020

I came across this PR last week using a google search (just recent user of cupy, not a team member) and really hope any remaining issues can be addressed and this PR makes it into the code base.

@leofang
Copy link
Member Author

leofang commented Sep 14, 2020

Hi @asi1024 Sorry for my outburst. I took a few days off and am now ready to revisit this. I should not have replied at 5am when I was already in a very bad state (for other reasons). I misinterpreted your question as a demand to overturn the current approach. I should have asked for your clarification either here or offline. It wasn't professional...Though I do hope such design discussions could happen earlier. Being able to receive early feedbacks is one of the advantages to develop in the open.

Also, thanks @grlee77 @mnicely @awthomp @kmaehashi @emcastillo @VolkerH for your kind replies and supports.

So, yes, as @grlee77 helped clarified, functools.lru_cache is limited if we want to impose extra constraints such as memory usage. In particular, given that functools.lru_cache is implemented as a function decorator (so is cupy.memoize) instead of an inheritable class, I do not really have a clear picture how to reuse it while dealing with all these complexities (Plan1d vs PlanNd, single-GPU vs multi-GPU, per thread & per device cache, etc) that require special logics for insertion and removal; if it were inheritable I could overwrite/wrap some of its methods. (I actually peeked its internal while developing my cache here to gain my confidence.)

cupy.memoize has an additional obstacle that I do not know how to remove cached plans from it if needed. Perhaps it's simply because I didn't give enough thoughts to it.

I chose to implement it in Cython because I wanted to minimize the overhead (and looks like we made it), though I didn't benchmark how large it could be if the cache were implemented in Python. I think once cuFFT improves, or when running on AMD GPUs (#3896 (comment)), the CPU overhead could be a potential concern, so I'd rather make this the least thing to consider.

Let me know if you have any questions or thoughts!

@leofang leofang reopened this Sep 14, 2020
@chainer-ci
Copy link
Member

Jenkins CI test (for commit 13b2686, target branch master) succeeded!

@leofang
Copy link
Member Author

leofang commented Sep 14, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit bfd1de7, target branch master) succeeded!

@asi1024
Copy link
Member

asi1024 commented Sep 16, 2020

This caching mechanism looks useful also for other purposes, for example cuTENSOR descriptors and other objects that have GPU memory spaces!

@asi1024
Copy link
Member

asi1024 commented Sep 17, 2020

Jenkins, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 8d0705c, target branch master) succeeded!

@leofang leofang marked this pull request as draft September 17, 2020 19:19
@leofang
Copy link
Member Author

leofang commented Sep 17, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 3784845, target branch master) succeeded!

@leofang leofang marked this pull request as ready for review September 17, 2020 21:56
@asi1024
Copy link
Member

asi1024 commented Sep 18, 2020

LGTM!

@asi1024 asi1024 merged commit dd2fb09 into cupy:master Sep 18, 2020
@asi1024 asi1024 added this to the v9.0.0a1 milestone Sep 18, 2020
asi1024 added a commit to asi1024/cupy that referenced this pull request Sep 18, 2020
@leofang leofang deleted the cufft_cache branch September 18, 2020 13:33
@leofang
Copy link
Member Author

leofang commented Sep 18, 2020

Thanks a lot, @asi1024 and everyone!

This caching mechanism looks useful also for other purposes, for example cuTENSOR descriptors and other objects that have GPU memory spaces!

Interesting! Do cuTENSOR descriptors use a lot of memory? If so in the future we might consider making this cache to work globally, i.e., store all kinds of plans (cuFFT, cuTENSOR, etc). This also makes it easier to combat with OOM, as we can just clear the cache at once.

Note though a cache like this could also be useful when the object itself takes little memory but a long time to generate (cuFFT plans have both such weaknesses).

@leofang
Copy link
Member Author

leofang commented Oct 5, 2020

Just a follow-up: It seems implementing the cache in Cython is further justified: @mnicely showed in rapidsai/cusignal#254 that the performance difference with a Python-based cache can be as high as 1.6x, which is very surprising!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:enhancement Improvements to existing features to-be-backported Pull-requests to be backported to stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reuse cufft plan objects