Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi stream support #6

Closed
coreylowman opened this issue Sep 18, 2022 · 10 comments · Fixed by #82
Closed

Multi stream support #6

coreylowman opened this issue Sep 18, 2022 · 10 comments · Fixed by #82

Comments

@coreylowman
Copy link
Owner

Currently CudaDevice only supports a single stream. Look into how multiple should be supported

@coreylowman
Copy link
Owner Author

coreylowman commented Sep 24, 2022

A possible approach to this is to use type states on CudaRc:

struct OffStream;
struct OnStream<const I: usize>;

struct CudaRc<T, State> { ... }

imp<T> CudaRc<T, OffStream> {
    fn on_stream<const I: usize>(self) -> CudaRc<T, OnStream<I>> { ... }
}

impl<T, const I: usize> CudaRc<T, OnStream<I>> {
     fn sync(self) -> CudaRc<T, OffStream> { ... }
}

Then to ensure data can only be used on its current stream or moved onto a stream, IntoKernelParam and LaunchCudaFunction should have a stream const associated with them:

trait LaunchCudaFunction<const I: usize> { ... }
trait IntoKernelParam<const I: usize> { ... }

impl<const I: usize> IntoKernelParam<I> for CudaRc<T, OffStream> { ... }
impl<const I: usize> IntoKernelParam<I> for CudaRc<T, OnStream<I>> { ... }

As far as which stream to actually use when launching, it could be device.streams[I % device.num_streams]

@coreylowman
Copy link
Owner Author

Interesting details here in the streams & freeing memory section

There are no such ordering guarantees between streams, so extra care has to happen when a tensor A is allocated on one stream creator but used on another stream user. In these cases, users have to call A.record_stream(user) to let the allocator know about A’s use on user. During free, the allocator will only consider the block ready for reuse when all work that has been scheduled up to the point A became free on user is complete. This is done by recording an event on the user stream and only handing out A’s member after that event has elapsed from the perspective of the CPU:

https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html

@coreylowman
Copy link
Owner Author

Another thing to think about: slices of tensors on different streams. e.g. if i have a batch of data, each item in the batch could be computed on a different stream

@M1ngXU
Copy link
Contributor

M1ngXU commented Dec 16, 2022

why would that be advantagous? wouldn’t this cause some synchronization issues, as one would have to synchronize the device and not only a stream?

@coreylowman
Copy link
Owner Author

@M1ngXU
Copy link
Contributor

M1ngXU commented Jan 23, 2023

wouldn’t this also require CudaSlices to be on the same device (in general)? so some generic const is required anyway with the ordinal

@coreylowman
Copy link
Owner Author

Ooooh that is a great call out. I think that even could apply to CudaSlice already? Imagine I create a slice with device 0 and then try to use that slice on a different device. I have no idea if that is even valid.

Will open a separate issue for that!

@M1ngXU
Copy link
Contributor

M1ngXU commented Jan 23, 2023

I have no idea if that is even valid.

I don't think so, but there might be some virtual memory mapping that nvidia does 😅

@M1ngXU
Copy link
Contributor

M1ngXU commented Jan 23, 2023

Also, I can't test this, maybe someone with 2 CUDA-GPU can test this? When adding these const generics, we could probably add const generics for streams too (basically the same ig)

@coreylowman
Copy link
Owner Author

Okay a different direction for this: don't use type states as they complicated things a bit too much, and are probably hard to get right.

Instead:

  1. Add struct Stream that holds a new stream and a reference to Arc<CudaDevice>
  2. in the impl Drop for Stream, it should use an event to force the device's cu_stream to wait on it (cuStreamWaitEvent(event, self.dev.cu_stream))
  3. Add a method to LaunchAsync that takes a Stream object as last arg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants