Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling runtime choice for mallocAsync or sync version. #174

Merged
merged 10 commits into from
Aug 10, 2023

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Aug 10, 2023

Initial issue: huggingface/candle#353

Basically cudarc fails on Windows even on recent cards, but also on older cards.

The core reason is that cuMemAllocAsync is not supported on those devices/OS.
This PR enables it by choosing between sync and async at runtime in the intended way by CUDA https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html (Check supported platform).

Alternatives:

I would have preferred to use a compile time flag for this, since it seems always preferable to use async where available and sync elsewhere.

Previous work along that line.
#106

I think we could have had a feature enabled in build.rs (rust-lang/cargo#5499) for that particular thing.
However I was unable to find an easy way to detect that capacity using nvidia-smi (or any other expected CLI).
Meaning in order to do that at compile time I think we would have to compile first cudaDeviceGetAttribute small program in order to be able to call it within build.rs. This seems way overengineered and therefore I didn't do that.

I haven't measure performance with the if everywhere.

Another variant of this PR would be to keep track of wether the Slice was initialized with Async or not, to remove the if within Drop. (We could also store it directly on CudaDevice to remote the if within alloc.)
I deemed this as a performance feature and therefore didn't do it in this PR.

@Narsil
Copy link
Contributor Author

Narsil commented Aug 10, 2023

@coreylowman

@coreylowman
Copy link
Owner

This looks fine to me, I like the runtime detection better than adding an additional compiler flag TBH.

@Narsil it looks like this deletes some examples, can you restore those? They are probably juts from out-dated fork

@Narsil
Copy link
Contributor Author

Narsil commented Aug 10, 2023

@coreylowman The example is actually because this examples always fails to build. I started just kicking it out whenever I'm using tests.

I'll restore for this PR, but would be nice to have a real fix.
Same goes for the 2 multi device tests they are failing on single GPU machines which does not feel super great :)

src/driver/safe/core.rs Outdated Show resolved Hide resolved
Copy link
Owner

@coreylowman coreylowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@coreylowman coreylowman merged commit 7fd4ba7 into coreylowman:main Aug 10, 2023
3 checks passed
@coreylowman
Copy link
Owner

@Narsil new version up on crates.io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants