Multi stream support #6

coreylowman · 2022-09-18T18:23:40Z

Currently CudaDevice only supports a single stream. Look into how multiple should be supported

coreylowman · 2022-09-24T16:21:44Z

A possible approach to this is to use type states on CudaRc:

struct OffStream;
struct OnStream<const I: usize>;

struct CudaRc<T, State> { ... }

imp<T> CudaRc<T, OffStream> {
    fn on_stream<const I: usize>(self) -> CudaRc<T, OnStream<I>> { ... }
}

impl<T, const I: usize> CudaRc<T, OnStream<I>> {
     fn sync(self) -> CudaRc<T, OffStream> { ... }
}

Then to ensure data can only be used on its current stream or moved onto a stream, IntoKernelParam and LaunchCudaFunction should have a stream const associated with them:

trait LaunchCudaFunction<const I: usize> { ... }
trait IntoKernelParam<const I: usize> { ... }

impl<const I: usize> IntoKernelParam<I> for CudaRc<T, OffStream> { ... }
impl<const I: usize> IntoKernelParam<I> for CudaRc<T, OnStream<I>> { ... }

As far as which stream to actually use when launching, it could be device.streams[I % device.num_streams]

coreylowman · 2022-10-24T17:10:02Z

Interesting details here in the streams & freeing memory section

There are no such ordering guarantees between streams, so extra care has to happen when a tensor A is allocated on one stream creator but used on another stream user. In these cases, users have to call A.record_stream(user) to let the allocator know about A’s use on user. During free, the allocator will only consider the block ready for reuse when all work that has been scheduled up to the point A became free on user is complete. This is done by recording an event on the user stream and only handing out A’s member after that event has elapsed from the perspective of the CPU:

https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html

coreylowman · 2022-10-25T18:13:32Z

Another thing to think about: slices of tensors on different streams. e.g. if i have a batch of data, each item in the batch could be computed on a different stream

M1ngXU · 2022-12-16T20:57:05Z

why would that be advantagous? wouldn’t this cause some synchronization issues, as one would have to synchronize the device and not only a stream?

coreylowman · 2023-01-13T21:22:24Z

More info: https://docs.nvidia.com/cuda/cublas/index.html#batching-kernels

M1ngXU · 2023-01-23T10:05:32Z

wouldn’t this also require CudaSlices to be on the same device (in general)? so some generic const is required anyway with the ordinal

coreylowman · 2023-01-23T17:14:27Z

Ooooh that is a great call out. I think that even could apply to CudaSlice already? Imagine I create a slice with device 0 and then try to use that slice on a different device. I have no idea if that is even valid.

Will open a separate issue for that!

M1ngXU · 2023-01-23T18:26:10Z

I have no idea if that is even valid.

I don't think so, but there might be some virtual memory mapping that nvidia does 😅

M1ngXU · 2023-01-23T18:27:31Z

Also, I can't test this, maybe someone with 2 CUDA-GPU can test this? When adding these const generics, we could probably add const generics for streams too (basically the same ig)

coreylowman · 2023-02-25T17:12:54Z

Okay a different direction for this: don't use type states as they complicated things a bit too much, and are probably hard to get right.

Instead:

Add struct Stream that holds a new stream and a reference to Arc<CudaDevice>
in the impl Drop for Stream, it should use an event to force the device's cu_stream to wait on it (cuStreamWaitEvent(event, self.dev.cu_stream))
Add a method to LaunchAsync that takes a Stream object as last arg

coreylowman mentioned this issue Jan 23, 2023

Are cuda device pointers only valid for the device they were allocated onto? #62

Closed

coreylowman mentioned this issue Feb 26, 2023

Adding CudaStream & LaunchAsync::par_launch_async #82

Merged

1 task

coreylowman closed this as completed in #82 Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi stream support #6

Multi stream support #6

coreylowman commented Sep 18, 2022

coreylowman commented Sep 24, 2022 •

edited

Loading

coreylowman commented Oct 24, 2022

coreylowman commented Oct 25, 2022

M1ngXU commented Dec 16, 2022

coreylowman commented Jan 13, 2023

M1ngXU commented Jan 23, 2023

coreylowman commented Jan 23, 2023

M1ngXU commented Jan 23, 2023

M1ngXU commented Jan 23, 2023

coreylowman commented Feb 25, 2023

Multi stream support #6

Multi stream support #6

Comments

coreylowman commented Sep 18, 2022

coreylowman commented Sep 24, 2022 • edited Loading

coreylowman commented Oct 24, 2022

coreylowman commented Oct 25, 2022

M1ngXU commented Dec 16, 2022

coreylowman commented Jan 13, 2023

M1ngXU commented Jan 23, 2023

coreylowman commented Jan 23, 2023

M1ngXU commented Jan 23, 2023

M1ngXU commented Jan 23, 2023

coreylowman commented Feb 25, 2023

coreylowman commented Sep 24, 2022 •

edited

Loading