Rewrite GpuFuture to avoid blocking and to use less space #214

lachlansneff · 2020-03-23T14:14:31Z

Since GpuFuture doesn't blocking wait for the mapping to resolve anymore, we need to poll the device for it to actually work. ~~I haven't added that to the hello-compute example, so it doesn't work anymore.~~

kvark

Nice!

src/backend/native_gpu_future.rs

kvark · 2020-03-23T14:24:17Z

src/backend/native_gpu_future.rs

+            None => {
+                inner.waker_or_result.replace(WakerOrResult::Result(value));
+            }
+            Some(WakerOrResult::Result(_)) => unreachable!(),


is it truly unreachable? or just unexpected?

I think it's truly unreachable. The Result case will only be set once complete is called, and since it can only be called once because it's consuming, that case will never happen.

but this is a public function of a public type, no?

Yes, but the invariants are checked here. There's no way that waker_or_result can be Some(WakerOrResult::Result(...)) before complete is called, so that case will never be reached.

lachlansneff · 2020-03-23T14:36:59Z

I think we could probably switch out the mutex with a spinlock, should be faster, smaller, and won't allocate, unlike `std::sync::Mutex.

lachlansneff · 2020-03-23T15:35:54Z

This pr is probably blocked on support for some sort of a wgpu event loop. Since the futures are actually asynchronous now, they need the device to be polled somewhere, which may be difficult with the way that wgpu is set up right now, especially if you don't already have an event loop built into your application.

aloucks · 2020-03-23T21:21:27Z

I think we could probably switch out the mutex with a spinlock, should be faster, smaller, and won't allocate, unlike `std::sync::Mutex.

I'm glad you've replaced spin lock with parking_lot. You might find these interesting if you haven't already seen them:

https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html

https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html

lachlansneff · 2020-03-23T21:33:27Z

Yeah, I found those when I was doing some research earlier.

aloucks · 2020-03-23T21:37:57Z

src/backend/native_gpu_future.rs

        };

        // polling the device should trigger the callback
-        wgn::wgpu_device_poll(device_id, true);
+        wgn::wgpu_device_poll(device_id, false);


I've been thinking about this a lot lately and these are some of the conclusions I've been arriving at:

Future::poll should not call wgpu_device_poll at all.

wgpu_device_poll should not fire the user provided callback (at least in the scope of wgpu-rs).

wgpu_device_poll should execute some kind of shim/wrapper callback that only executes Waker::wake_by_ref on the current waker associated with the future. The async runtime will then call Future::poll again which would then call the user provided callback.

wgpu_device_poll should be called externally (one way or another, be it on another thread or something cranking in the event loop). With this setup it wouldn't matter if wgpu_device_poll was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.

This might even allow the lifetime constraint on the callback to be relaxed.

I might be misunderstanding, but the user doesn't provide their own callbacks to wgpu-rs to be called by wgpu_device_poll. This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref if we'd prefer for it to do that. The user just sees futures for the mapped buffers.

(wgpu_device_poll will probably be called externally regardless, but ideally we'll add some kind of wgpu event loop to do that)

This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref if we'd prefer for it to do that.

Yes, this is what I'm saying.

@aloucks Any particular reason why using a waker internally is better than a callback?

As I said before:

With this setup it wouldn't matter if wgpu_device_poll was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.

And this (maybe):

This might even allow the lifetime constraint on the callback to be relaxed.

The callback would be executed wherever you have issued .await on the future. Where as if you run the user callback directly from device_poll it could be executed from wherever device_poll was called.

It's also how rust Futures are designed in general. The Waker exists so to notify the the runtime that the task is ready to do more work (i.e. polled again).

Consider how async socket IO works. When you make a socket "asynchronous", somewhere in the bowels of tokio/mio/node.js/etc.., that socket is registered with epoll, iocp, kqueue, etc. Also deep in the bowels, there's a thread blocking on epoll_wait. epoll_wait returns the event data that the caller can use to determine which sockets have data ready for read/write.

The "IO driver" example that I pasted below was trying to emulate how this mechanism might work when there is no "real" IO. In our case, the notification system will have to be triggered by a GPU synchronization primitive (e.g. a fence or timeline semaphore) or something more course like the triggering from queue submission - or simply spinning in a loop with some kind of artificial throttle/delay. Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).

EDIT:

Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).

I'm still not sold that this is a good idea because wgpu_device_poll calls Device::maintain which quite a bit of housecleaning. This may not be technically "blocking" in the truest sense, but it may have enough overhead to be considered potentially blocking based on the characteristics outlined in the docs for Future

An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking awhile, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly.

So, replacing the internal implementation with Waker is very doable I think, but that'd still give us the option to fire callbacks from where device_poll is called, or in response to the future. Interestingly, I think firing the future without a special callback waker would be more efficient in Rust's case.

aloucks · 2020-03-23T21:48:37Z

Here's a bare-bones work-in-progress idea that I was experimenting with to get a better understanding of how an IO driver might work.

https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264

aloucks · 2020-03-25T01:36:27Z

This might even allow the lifetime constraint on the callback to be relaxed.

I'm fairly certain that this design will in fact let us relax that lifetime restriction. Dropping 'static on the closures will be a pretty big win.

kvark · 2020-03-26T22:59:42Z

Could somebody summarize the state of a problem/solution here?

lachlansneff · 2020-03-26T23:42:36Z

I think we should stick with how this pr does it. The future is woken when the device polls that the buffer is ready to map. I think that two more functions that take callbacks instead of returning futures would be useful, but that can wait.

aloucks · 2020-03-26T23:56:51Z

Hmm, now I see that the callback was removed entirely in a previous iteration. I think this is a good path forward but the end state will need further coordination with wgpu-core. There needs to be only a single place that Device::maintain is called so the user has more control over that. It shouldn't be executed in an async context. Right now it can be triggered from wgpu_device_poll, but it's also triggered from queue submissions as well (or at least it was, I haven't checked again recently).

kvark · 2020-03-27T00:19:15Z

@aloucks can you clarify why it is a problem to poll() at submit() time?

kvark · 2020-03-27T02:23:50Z

src/backend/native_gpu_future.rs


    (
        GpuFuture {
-            inner: inner.clone(),
+            data: data.clone(),


please use Arc::clone for these cases

kvark · 2020-03-27T02:24:44Z

src/lib.rs

-        wgn::wgpu_device_poll(self.id, force_wait);
+    pub fn poll(&self, maintain: Maintain) {
+        wgn::wgpu_device_poll(self.id, match maintain {
+            Maintain::Poll => false,


What if we have Maintain in wgpu-core under [repr(C)] with defined values? bools are never nice anyway

This sounds like a good idea. I'll make a pr.

aloucks · 2020-03-27T02:51:01Z

@kvark It's primarily an issue in an async context.

I guess first let's clarify the definition of "blocking". Traditionally, we think of "blocking" as IO related. For example, let's say you want to read from disk. You issue a system call and your thread gets parked until the kernel wakes you up with data available. If you checked iostat, you'd be sitting in iowait and your thread would be consuming zero cpu.

In this context, where you're using a separate thread for every task, this definition of "blocking" is fine.

However, in an asynchronous context, this definition needs some revision. Note that "asynchronous" is not "parallel". These tasks may be all running on a single thread and require cooperative scheduling. Any task that consumes more than a fair share of CPU is potentially blocking. Device::maintain analyzes the state of all in-flight objects and is certainly not O(1). In some cases it even calls vkWaitForFences with a significantly large timeout.

Again, here is a note from the Future docs (see comments above for more context):

An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking awhile, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly.

Based on this, Device::maintain should probably not be called in an asynchronous context. It should rather be called in a thread pool or a dedicated thread so that blocking (be it true "blocking" or just "slow") does not impede the asynchronous event loop.

Furthermore, Device::maintain (which I sometimes refer to with wgpu_device_poll even though it's not 100% the same), should probably be driving the event loop from another thread - not running inside the event loop. (wgpu_device_poll should be the C interface for Device::maintain - right now they are similar but not 1:1)

Right now if I try to map a buffer asynchronously and use .await on that future, it's going to immediately deadlock because there isn't anything that can wake the waker (and we should not call Device::maintain / wgpu_device_poll from the future for the reasons described above).

To wrap things up - we shouldn't call Device::maintain from the queue submission because at that point we'd be creating extra contention on resources if maintain is being executed in parallel somewhere else.

I built a toy event loop driver here just to demonstrate the IO driver concept:

https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264

examples/hello-compute/main.rs

kvark · 2020-03-27T04:09:37Z

@aloucks thank you for describing this semantics! It's much cleaner now what you mean by blocking. We do some work in it, however I don't think your description of the work is accurate:

Device::maintain analyzes the state of all in-flight objects

This used to be the case. Now it only considers the "suspects", i.e. resources that have been released by something since the last maintain() call. Thus, the total sum of work is much less dependent on the number of times you call maintain().

In some cases it even calls vkWaitForFences with a significantly large timeout.

Polling never waits for fences, it only calls get_fence_status. No timeouts involved.

we shouldn't call Device::maintain from the queue submission because at that point we'd be creating extra contention on resources if maintain is being executed in parallel somewhere else.

I generally agree with this, but only with an assumption that there is some event loop running. Perhaps, we could tell wgpu instance/device if it needs to maintain on submit?

Remove odd patching matching Replace std::sync::Mutex with spin::Mutex in GpuFuture Reduce GpuFuture usage to one explicit allocation instead of two Fix examples to poll the device in the background when using map_read or map_write Remove device.poll from GpuFuture::poll and document future invariants Massively simplify examples Use Arc::clone(...) instead of arc.clone() Switch println to log::info

kvark · 2020-03-27T04:18:13Z

Thank you!
bors r+

bors · 2020-03-27T04:35:30Z

Build succeeded

continuous-integration/travis-ci/push

kvark reviewed Mar 23, 2020

View reviewed changes

lachlansneff marked this pull request as ready for review March 23, 2020 21:34

aloucks reviewed Mar 23, 2020

View reviewed changes

kvark approved these changes Mar 27, 2020

View reviewed changes

grovesNL reviewed Mar 27, 2020

View reviewed changes

examples/hello-compute/main.rs Outdated Show resolved Hide resolved

bors bot merged commit 6fd4bf2 into gfx-rs:master Mar 27, 2020

lachlansneff deleted the fix-future-poll branch March 27, 2020 04:53

kvark mentioned this pull request Apr 5, 2020

Compute test hangs #237

Closed

calebwin mentioned this pull request Apr 9, 2020

Important internal optimizations, potential bug calebwin/emu#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite GpuFuture to avoid blocking and to use less space #214

Rewrite GpuFuture to avoid blocking and to use less space #214

lachlansneff commented Mar 23, 2020 •

edited

kvark left a comment

kvark Mar 23, 2020

lachlansneff Mar 23, 2020

kvark Mar 23, 2020

lachlansneff Mar 23, 2020

lachlansneff commented Mar 23, 2020

lachlansneff commented Mar 23, 2020

aloucks commented Mar 23, 2020

lachlansneff commented Mar 23, 2020

aloucks Mar 23, 2020

grovesNL Mar 23, 2020

aloucks Mar 23, 2020

lachlansneff Mar 23, 2020

aloucks Mar 24, 2020 •

edited

lachlansneff Mar 24, 2020

aloucks commented Mar 23, 2020

aloucks commented Mar 25, 2020

kvark commented Mar 26, 2020

lachlansneff commented Mar 26, 2020

aloucks commented Mar 26, 2020

kvark commented Mar 27, 2020

kvark Mar 27, 2020

kvark Mar 27, 2020

lachlansneff Mar 27, 2020

aloucks commented Mar 27, 2020

kvark commented Mar 27, 2020

kvark commented Mar 27, 2020

bors bot commented Mar 27, 2020

Rewrite GpuFuture to avoid blocking and to use less space #214

Rewrite GpuFuture to avoid blocking and to use less space #214

Conversation

lachlansneff commented Mar 23, 2020 • edited

kvark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lachlansneff commented Mar 23, 2020

lachlansneff commented Mar 23, 2020

aloucks commented Mar 23, 2020

lachlansneff commented Mar 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aloucks Mar 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aloucks commented Mar 23, 2020

aloucks commented Mar 25, 2020

kvark commented Mar 26, 2020

lachlansneff commented Mar 26, 2020

aloucks commented Mar 26, 2020

kvark commented Mar 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aloucks commented Mar 27, 2020

kvark commented Mar 27, 2020

kvark commented Mar 27, 2020

bors bot commented Mar 27, 2020

Build succeeded

lachlansneff commented Mar 23, 2020 •

edited

aloucks Mar 24, 2020 •

edited