Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd support for asynchronous memcpy #8
Comments
bheisler
added
New CUDA Feature
Bigger Project
labels
Nov 25, 2018
This comment has been minimized.
This comment has been minimized.
|
I'll take a stab at this. |
This comment has been minimized.
This comment has been minimized.
|
My current plan is to add the trait @bheisler Thoughts? |
This comment has been minimized.
This comment has been minimized.
|
Additionally, |
This comment has been minimized.
This comment has been minimized.
|
Actually, spinning off |
This comment has been minimized.
This comment has been minimized.
|
I would split As for |
This comment has been minimized.
This comment has been minimized.
|
Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go? |
This comment has been minimized.
This comment has been minimized.
|
Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more. |
This comment has been minimized.
This comment has been minimized.
|
My current thought is something similar to this code. . This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used. Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping. |
This comment has been minimized.
This comment has been minimized.
AndrewGaspar
commented
Dec 8, 2018
•
|
Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed: In rsmpi we solve this using a We also currently have an outstanding PR (that I still need to finish |
This comment has been minimized.
This comment has been minimized.
|
Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that. Anyways, this scope approach looks very promising! |
This comment has been minimized.
This comment has been minimized.
|
Yeah, that fits really well with how I was planning to handle futures. See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a
Then the If we add non-futures-based async functions, that can just be a different Now that I think about it, this would probably help solve the safety problems with Contexts as well. |
This comment has been minimized.
This comment has been minimized.
|
Ah, I think I understand what your saying now and think that should work. |
This comment has been minimized.
This comment has been minimized.
|
Cool, it works. Link. Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls. |
This comment has been minimized.
This comment has been minimized.
|
Slight problem with that: Link. Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, that's what I was getting at with the second part of my comment. My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.
|
This comment has been minimized.
This comment has been minimized.
|
I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior. Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races. |
This comment has been minimized.
This comment has been minimized.
|
How is pinned host memory done right now? Is that what the DeviceCopy trait indicates? |
This comment has been minimized.
This comment has been minimized.
|
Additionally, implementing the unsafe wrapper layer is done now, save for the test After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion. |
This comment has been minimized.
This comment has been minimized.
|
Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges. |
This comment has been minimized.
This comment has been minimized.
|
DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else. |
This comment has been minimized.
This comment has been minimized.
|
Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper. |
This comment has been minimized.
This comment has been minimized.
|
The error I was mentioning only appears when multiple tests are run at the same time. EDIT: Nevermind, it appears rarely when run alone. |
This comment has been minimized.
This comment has been minimized.
|
Alright to sum up my current thoughts on this.
|
This comment has been minimized.
This comment has been minimized.
Could you elaborate more on this? Why not? |
This comment has been minimized.
This comment has been minimized.
|
Previously, I thought |
This comment has been minimized.
This comment has been minimized.
khyperia
commented
Dec 23, 2018
|
Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda) I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:
Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - (I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time) |
This comment has been minimized.
This comment has been minimized.
|
Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow. I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form |
This comment has been minimized.
This comment has been minimized.
|
See #20 for the PR I'm writing. |
This comment has been minimized.
This comment has been minimized.
Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface. |
bheisler commentedNov 25, 2018
Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.