Rewrite GpuFuture to avoid blocking and to use less space #214
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
src/backend/native_gpu_future.rs
Outdated
None => { | ||
inner.waker_or_result.replace(WakerOrResult::Result(value)); | ||
} | ||
Some(WakerOrResult::Result(_)) => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it truly unreachable? or just unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's truly unreachable. The Result
case will only be set once complete
is called, and since it can only be called once because it's consuming, that case will never happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is a public function of a public type, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the invariants are checked here. There's no way that waker_or_result
can be Some(WakerOrResult::Result(...))
before complete
is called, so that case will never be reached.
I think we could probably switch out the mutex with a spinlock, should be faster, smaller, and won't allocate, unlike `std::sync::Mutex. |
This pr is probably blocked on support for some sort of a wgpu event loop. Since the futures are actually asynchronous now, they need the device to be polled somewhere, which may be difficult with the way that wgpu is set up right now, especially if you don't already have an event loop built into your application. |
I'm glad you've replaced spin lock with parking_lot. You might find these interesting if you haven't already seen them: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html |
Yeah, I found those when I was doing some research earlier. |
src/backend/native_gpu_future.rs
Outdated
}; | ||
|
||
// polling the device should trigger the callback | ||
wgn::wgpu_device_poll(device_id, true); | ||
wgn::wgpu_device_poll(device_id, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this a lot lately and these are some of the conclusions I've been arriving at:
-
Future::poll
should not callwgpu_device_poll
at all. -
wgpu_device_poll
should not fire the user provided callback (at least in the scope ofwgpu-rs
). -
wgpu_device_poll
should execute some kind of shim/wrapper callback that only executesWaker::wake_by_ref
on the current waker associated with the future. The async runtime will then callFuture::poll
again which would then call the user provided callback. -
wgpu_device_poll
should be called externally (one way or another, be it on another thread or something cranking in the event loop). With this setup it wouldn't matter ifwgpu_device_poll
was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.
This might even allow the lifetime constraint on the callback to be relaxed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding, but the user doesn't provide their own callbacks to wgpu-rs to be called by wgpu_device_poll
. This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref
if we'd prefer for it to do that. The user just sees futures for the mapped buffers.
(wgpu_device_poll
will probably be called externally regardless, but ideally we'll add some kind of wgpu event loop to do that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref if we'd prefer for it to do that.
Yes, this is what I'm saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aloucks Any particular reason why using a waker internally is better than a callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before:
With this setup it wouldn't matter if wgpu_device_poll was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.
And this (maybe):
This might even allow the lifetime constraint on the callback to be relaxed.
The callback would be executed wherever you have issued .await
on the future. Where as if you run the user callback directly from device_poll
it could be executed from wherever device_poll
was called.
It's also how rust Futures are designed in general. The Waker
exists so to notify the the runtime that the task is ready to do more work (i.e. polled again).
Consider how async socket IO works. When you make a socket "asynchronous", somewhere in the bowels of tokio/mio/node.js/etc.., that socket is registered with epoll
, iocp
, kqueue
, etc. Also deep in the bowels, there's a thread blocking on epoll_wait. epoll_wait
returns the event data that the caller can use to determine which sockets have data ready for read/write.
The "IO driver" example that I pasted below was trying to emulate how this mechanism might work when there is no "real" IO. In our case, the notification system will have to be triggered by a GPU synchronization primitive (e.g. a fence or timeline semaphore) or something more course like the triggering from queue submission - or simply spinning in a loop with some kind of artificial throttle/delay. Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).
EDIT:
Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).
I'm still not sold that this is a good idea because wgpu_device_poll
calls Device::maintain
which quite a bit of housecleaning. This may not be technically "blocking" in the truest sense, but it may have enough overhead to be considered potentially blocking based on the characteristics outlined in the docs for Future
An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking awhile, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, replacing the internal implementation with Waker
is very doable I think, but that'd still give us the option to fire callbacks from where device_poll is called, or in response to the future. Interestingly, I think firing the future without a special callback waker would be more efficient in Rust's case.
Here's a bare-bones work-in-progress idea that I was experimenting with to get a better understanding of how an IO driver might work. https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264 |
I'm fairly certain that this design will in fact let us relax that lifetime restriction. Dropping |
Could somebody summarize the state of a problem/solution here? |
I think we should stick with how this pr does it. The future is woken when the device polls that the buffer is ready to map. I think that two more functions that take callbacks instead of returning futures would be useful, but that can wait. |
Hmm, now I see that the callback was removed entirely in a previous iteration. I think this is a good path forward but the end state will need further coordination with |
@aloucks can you clarify why it is a problem to poll() at submit() time? |
src/backend/native_gpu_future.rs
Outdated
|
||
( | ||
GpuFuture { | ||
inner: inner.clone(), | ||
data: data.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use Arc::clone
for these cases
wgn::wgpu_device_poll(self.id, force_wait); | ||
pub fn poll(&self, maintain: Maintain) { | ||
wgn::wgpu_device_poll(self.id, match maintain { | ||
Maintain::Poll => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we have Maintain
in wgpu-core under [repr(C)]
with defined values? bools are never nice anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good idea. I'll make a pr.
@kvark It's primarily an issue in an async context. I guess first let's clarify the definition of "blocking". Traditionally, we think of "blocking" as IO related. For example, let's say you want to read from disk. You issue a system call and your thread gets parked until the kernel wakes you up with data available. If you checked iostat, you'd be sitting in iowait and your thread would be consuming zero cpu. In this context, where you're using a separate thread for every task, this definition of "blocking" is fine. However, in an asynchronous context, this definition needs some revision. Note that "asynchronous" is not "parallel". These tasks may be all running on a single thread and require cooperative scheduling. Any task that consumes more than a fair share of CPU is potentially blocking. Again, here is a note from the
Based on this, Furthermore, Right now if I try to map a buffer asynchronously and use To wrap things up - we shouldn't call I built a toy event loop driver here just to demonstrate the IO driver concept: https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264 |
@aloucks thank you for describing this semantics! It's much cleaner now what you mean by blocking. We do some work in it, however I don't think your description of the work is accurate:
This used to be the case. Now it only considers the "suspects", i.e. resources that have been released by something since the last
Polling never waits for fences, it only calls
I generally agree with this, but only with an assumption that there is some event loop running. Perhaps, we could tell |
Remove odd patching matching Replace std::sync::Mutex with spin::Mutex in GpuFuture Reduce GpuFuture usage to one explicit allocation instead of two Fix examples to poll the device in the background when using map_read or map_write Remove device.poll from GpuFuture::poll and document future invariants Massively simplify examples Use Arc::clone(...) instead of arc.clone() Switch println to log::info
Thank you! |
Build succeeded |
Since
GpuFuture
doesn't blocking wait for the mapping to resolve anymore, we need to poll the device for it to actually work.I haven't added that to thehello-compute
example, so it doesn't work anymore.