Skip to content
This repository has been archived by the owner on Jun 18, 2021. It is now read-only.

Rewrite GpuFuture to avoid blocking and to use less space #214

Merged
merged 1 commit into from Mar 27, 2020
Merged

Rewrite GpuFuture to avoid blocking and to use less space #214

merged 1 commit into from Mar 27, 2020

Conversation

lachlansneff
Copy link
Contributor

@lachlansneff lachlansneff commented Mar 23, 2020

Since GpuFuture doesn't blocking wait for the mapping to resolve anymore, we need to poll the device for it to actually work. I haven't added that to the hello-compute example, so it doesn't work anymore.

Copy link
Member

@kvark kvark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/backend/native_gpu_future.rs Outdated Show resolved Hide resolved
None => {
inner.waker_or_result.replace(WakerOrResult::Result(value));
}
Some(WakerOrResult::Result(_)) => unreachable!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it truly unreachable? or just unexpected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's truly unreachable. The Result case will only be set once complete is called, and since it can only be called once because it's consuming, that case will never happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this is a public function of a public type, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the invariants are checked here. There's no way that waker_or_result can be Some(WakerOrResult::Result(...)) before complete is called, so that case will never be reached.

@lachlansneff
Copy link
Contributor Author

I think we could probably switch out the mutex with a spinlock, should be faster, smaller, and won't allocate, unlike `std::sync::Mutex.

@lachlansneff
Copy link
Contributor Author

This pr is probably blocked on support for some sort of a wgpu event loop. Since the futures are actually asynchronous now, they need the device to be polled somewhere, which may be difficult with the way that wgpu is set up right now, especially if you don't already have an event loop built into your application.

@aloucks
Copy link
Contributor

aloucks commented Mar 23, 2020

I think we could probably switch out the mutex with a spinlock, should be faster, smaller, and won't allocate, unlike `std::sync::Mutex.

I'm glad you've replaced spin lock with parking_lot. You might find these interesting if you haven't already seen them:

https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html

https://matklad.github.io/2020/01/04/mutexes-are-faster-than-spinlocks.html

@lachlansneff
Copy link
Contributor Author

Yeah, I found those when I was doing some research earlier.

@lachlansneff lachlansneff marked this pull request as ready for review March 23, 2020 21:34
};

// polling the device should trigger the callback
wgn::wgpu_device_poll(device_id, true);
wgn::wgpu_device_poll(device_id, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this a lot lately and these are some of the conclusions I've been arriving at:

  • Future::poll should not call wgpu_device_poll at all.

  • wgpu_device_poll should not fire the user provided callback (at least in the scope of wgpu-rs).

  • wgpu_device_poll should execute some kind of shim/wrapper callback that only executes Waker::wake_by_ref on the current waker associated with the future. The async runtime will then call Future::poll again which would then call the user provided callback.

  • wgpu_device_poll should be called externally (one way or another, be it on another thread or something cranking in the event loop). With this setup it wouldn't matter if wgpu_device_poll was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.

This might even allow the lifetime constraint on the callback to be relaxed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding, but the user doesn't provide their own callbacks to wgpu-rs to be called by wgpu_device_poll. This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref if we'd prefer for it to do that. The user just sees futures for the mapped buffers.

(wgpu_device_poll will probably be called externally regardless, but ideally we'll add some kind of wgpu event loop to do that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our own callback (internal to wgpu-rs) that we're passing to wgpu-native, so we could easily change it to call wake_by_ref if we'd prefer for it to do that.

Yes, this is what I'm saying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aloucks Any particular reason why using a waker internally is better than a callback?

Copy link
Contributor

@aloucks aloucks Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before:

With this setup it wouldn't matter if wgpu_device_poll was called from this thread or another thread as it's only triggering the waker, not actually calling the user callback.

And this (maybe):

This might even allow the lifetime constraint on the callback to be relaxed.

The callback would be executed wherever you have issued .await on the future. Where as if you run the user callback directly from device_poll it could be executed from wherever device_poll was called.

It's also how rust Futures are designed in general. The Waker exists so to notify the the runtime that the task is ready to do more work (i.e. polled again).

Consider how async socket IO works. When you make a socket "asynchronous", somewhere in the bowels of tokio/mio/node.js/etc.., that socket is registered with epoll, iocp, kqueue, etc. Also deep in the bowels, there's a thread blocking on epoll_wait. epoll_wait returns the event data that the caller can use to determine which sockets have data ready for read/write.

The "IO driver" example that I pasted below was trying to emulate how this mechanism might work when there is no "real" IO. In our case, the notification system will have to be triggered by a GPU synchronization primitive (e.g. a fence or timeline semaphore) or something more course like the triggering from queue submission - or simply spinning in a loop with some kind of artificial throttle/delay. Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).

EDIT:

Having the "device poller" be a future itself that's spawned into the runtime may work too (I think this is what you have in the examples).

I'm still not sold that this is a good idea because wgpu_device_poll calls Device::maintain which quite a bit of housecleaning. This may not be technically "blocking" in the truest sense, but it may have enough overhead to be considered potentially blocking based on the characteristics outlined in the docs for Future

An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking awhile, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, replacing the internal implementation with Waker is very doable I think, but that'd still give us the option to fire callbacks from where device_poll is called, or in response to the future. Interestingly, I think firing the future without a special callback waker would be more efficient in Rust's case.

@aloucks
Copy link
Contributor

aloucks commented Mar 23, 2020

Here's a bare-bones work-in-progress idea that I was experimenting with to get a better understanding of how an IO driver might work.

https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264

@aloucks
Copy link
Contributor

aloucks commented Mar 25, 2020

This might even allow the lifetime constraint on the callback to be relaxed.

I'm fairly certain that this design will in fact let us relax that lifetime restriction. Dropping 'static on the closures will be a pretty big win.

@kvark
Copy link
Member

kvark commented Mar 26, 2020

Could somebody summarize the state of a problem/solution here?

@lachlansneff
Copy link
Contributor Author

I think we should stick with how this pr does it. The future is woken when the device polls that the buffer is ready to map. I think that two more functions that take callbacks instead of returning futures would be useful, but that can wait.

@aloucks
Copy link
Contributor

aloucks commented Mar 26, 2020

Hmm, now I see that the callback was removed entirely in a previous iteration. I think this is a good path forward but the end state will need further coordination with wgpu-core. There needs to be only a single place that Device::maintain is called so the user has more control over that. It shouldn't be executed in an async context. Right now it can be triggered from wgpu_device_poll, but it's also triggered from queue submissions as well (or at least it was, I haven't checked again recently).

@kvark
Copy link
Member

kvark commented Mar 27, 2020

@aloucks can you clarify why it is a problem to poll() at submit() time?


(
GpuFuture {
inner: inner.clone(),
data: data.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use Arc::clone for these cases

wgn::wgpu_device_poll(self.id, force_wait);
pub fn poll(&self, maintain: Maintain) {
wgn::wgpu_device_poll(self.id, match maintain {
Maintain::Poll => false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have Maintain in wgpu-core under [repr(C)] with defined values? bools are never nice anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea. I'll make a pr.

@aloucks
Copy link
Contributor

aloucks commented Mar 27, 2020

@kvark It's primarily an issue in an async context.

I guess first let's clarify the definition of "blocking". Traditionally, we think of "blocking" as IO related. For example, let's say you want to read from disk. You issue a system call and your thread gets parked until the kernel wakes you up with data available. If you checked iostat, you'd be sitting in iowait and your thread would be consuming zero cpu.

In this context, where you're using a separate thread for every task, this definition of "blocking" is fine.

However, in an asynchronous context, this definition needs some revision. Note that "asynchronous" is not "parallel". These tasks may be all running on a single thread and require cooperative scheduling. Any task that consumes more than a fair share of CPU is potentially blocking. Device::maintain analyzes the state of all in-flight objects and is certainly not O(1). In some cases it even calls vkWaitForFences with a significantly large timeout.

Again, here is a note from the Future docs (see comments above for more context):

An implementation of poll should strive to return quickly, and should not block. Returning quickly prevents unnecessarily clogging up threads or event loops. If it is known ahead of time that a call to poll may end up taking awhile, the work should be offloaded to a thread pool (or something similar) to ensure that poll can return quickly.

Based on this, Device::maintain should probably not be called in an asynchronous context. It should rather be called in a thread pool or a dedicated thread so that blocking (be it true "blocking" or just "slow") does not impede the asynchronous event loop.

Furthermore, Device::maintain (which I sometimes refer to with wgpu_device_poll even though it's not 100% the same), should probably be driving the event loop from another thread - not running inside the event loop. (wgpu_device_poll should be the C interface for Device::maintain - right now they are similar but not 1:1)

Right now if I try to map a buffer asynchronously and use .await on that future, it's going to immediately deadlock because there isn't anything that can wake the waker (and we should not call Device::maintain / wgpu_device_poll from the future for the reasons described above).

To wrap things up - we shouldn't call Device::maintain from the queue submission because at that point we'd be creating extra contention on resources if maintain is being executed in parallel somewhere else.

I built a toy event loop driver here just to demonstrate the IO driver concept:

https://gist.github.com/aloucks/b83f2101f530471e1d802305d5265264

@kvark
Copy link
Member

kvark commented Mar 27, 2020

@aloucks thank you for describing this semantics! It's much cleaner now what you mean by blocking. We do some work in it, however I don't think your description of the work is accurate:

Device::maintain analyzes the state of all in-flight objects

This used to be the case. Now it only considers the "suspects", i.e. resources that have been released by something since the last maintain() call. Thus, the total sum of work is much less dependent on the number of times you call maintain().

In some cases it even calls vkWaitForFences with a significantly large timeout.

Polling never waits for fences, it only calls get_fence_status. No timeouts involved.

we shouldn't call Device::maintain from the queue submission because at that point we'd be creating extra contention on resources if maintain is being executed in parallel somewhere else.

I generally agree with this, but only with an assumption that there is some event loop running. Perhaps, we could tell wgpu instance/device if it needs to maintain on submit?

Remove odd patching matching

Replace std::sync::Mutex with spin::Mutex in GpuFuture

Reduce GpuFuture usage to one explicit allocation instead of two

Fix examples to poll the device in the background when using map_read or map_write

Remove device.poll from GpuFuture::poll and document future invariants

Massively simplify examples

Use Arc::clone(...) instead of arc.clone()

Switch println to log::info
@kvark
Copy link
Member

kvark commented Mar 27, 2020

Thank you!
bors r+

@bors
Copy link
Contributor

bors bot commented Mar 27, 2020

Build succeeded

@bors bors bot merged commit 6fd4bf2 into gfx-rs:master Mar 27, 2020
@lachlansneff lachlansneff deleted the fix-future-poll branch March 27, 2020 04:53
@kvark kvark mentioned this pull request Apr 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants