-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling runtime choice for mallocAsync or sync version. #174
Conversation
This looks fine to me, I like the runtime detection better than adding an additional compiler flag TBH. @Narsil it looks like this deletes some examples, can you restore those? They are probably juts from out-dated fork |
@coreylowman The example is actually because this examples always fails to build. I started just kicking it out whenever I'm using tests. I'll restore for this PR, but would be nice to have a real fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@Narsil new version up on crates.io |
Initial issue: huggingface/candle#353
Basically cudarc fails on Windows even on recent cards, but also on older cards.
The core reason is that
cuMemAllocAsync
is not supported on those devices/OS.This PR enables it by choosing between sync and async at runtime in the intended way by CUDA https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html (Check supported platform).
Alternatives:
I would have preferred to use a compile time flag for this, since it seems always preferable to use async where available and sync elsewhere.
Previous work along that line.
#106
I think we could have had a feature enabled in
build.rs
(rust-lang/cargo#5499) for that particular thing.However I was unable to find an easy way to detect that capacity using
nvidia-smi
(or any other expected CLI).Meaning in order to do that at compile time I think we would have to compile first
cudaDeviceGetAttribute
small program in order to be able to call it withinbuild.rs
. This seems way overengineered and therefore I didn't do that.I haven't measure performance with the if everywhere.
Another variant of this PR would be to keep track of wether the Slice was initialized with Async or not, to remove the if within
Drop
. (We could also store it directly onCudaDevice
to remote the if withinalloc
.)I deemed this as a performance feature and therefore didn't do it in this PR.