Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let's make surface creation safe! #1463

Closed
pythonesque opened this issue Dec 13, 2020 · 17 comments · Fixed by #4597
Closed

Let's make surface creation safe! #1463

pythonesque opened this issue Dec 13, 2020 · 17 comments · Fixed by #4597
Labels
area: wsi Issues with swapchain management or windowing help required We need community help to make this happen. type: enhancement New feature or request

Comments

@pythonesque
Copy link
Contributor

There has been some discussion in the past over how to make create_surface safe:

gfx-rs/wgpu-rs#78

gfx-rs/wgpu-rs#281 (comment)

From these comments, my impression was that the only issue was not being able to trust the pointers provided by the type implementing https://docs.rs/raw-window-handle/0.3.3/raw_window_handle/trait.HasRawWindowHandle.html. However, this is an unsafe trait and the requirements it puts on its implementors are actually quite strong:

Safety guarantees

Users can safely assume that non-null/0 fields are valid handles, and it is up to the implementer of this trait to ensure that condition is upheld. However, It is entirely valid behavior for fields within each platform-specific RawWindowHandle variant to be null or 0, and appropriate checking should be done before the handle is used.*

Despite that qualification, implementers should still make a best-effort attempt to fill in all available fields. If an implementation doesn't, and a downstream user needs the field, it should try to derive the field from other fields the implementer does provide via whatever methods the platform provides.

The exact handles returned by raw_window_handle must remain consistent between multiple calls to raw_window_handle, and must be valid for at least the lifetime of the HasRawWindowHandle implementer.

(emphasis mine).

We can instantly see that just by taking &W where W: HasRawWindowHandle, we do not need to worry about arbitrary pointers or integers--we can rely on the guarantees of HasRawWindowHandle alone, which ensures that as long as the underlying W instance is alive:

  • The handles won't change.
  • All handles will either be valid, or null/0 (depending on platform).

Since wgpu already needs to deal with handles in a backend-sensitive way, it makes perfect sense for it to just check all the individual handles. Problem solved, right? Unfortunately, it turns out that while this is the only documented source of unsafety, it is not the only source of unsafety when using a surface created by create_surface. @cwfitzgerald gave the following code as an example, when performing rendering using a task system off the main thread:

frame = None;
loop {
    MainEventsCleared => {
        wait_for_frame(frame);

        frame = Some(render());
    }
}

When trying to close a window, this event loop would segfault on some Linux systems, but moving wait_for_frame below the frame = _ portion (at the end of the MainEventsCleaared arm) would prevent the segfault. Here, wait_for_frame waits for all commands to be submitted from the CPU to the GPU (but does not wait for the GPU). This suggests that after exiting the event loop (dropping the window), the window surface may be freed on the CPU on some platforms while there are still in-flight commands being sent to the surface from another thread; presumably, someone (the window manager or graphics driver) is able to make sure device resources are not reused until rendering is complete, but the same is not true of host resources.

The implication here is clear: in general, we can't treat a surface as valid unless the underlying window from which it was created is valid. Here, "window" doesn't necessarily have to be a window (although in practice it usually will be), it just has to be the object that owns whatever resource is referred to by the handle it passes to wgpu. Note that in a wgpu context, we only care about "real" surfaces with valid handles.

While on some platforms (e.g. Android), it seems that a window is always kept alive as long as the surface is, on others (e.g. X11 on Linux), it seems like the surface may be deallocated together with the Window (we could not inspect the actual implementation of https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCreateXlibSurfaceKHR.html so we could not verify this for certain, but either way the Vulkan documentation doesn't indicate that there's any requirement that the surface hold onto a reference to the window). So our first conclusion is that if we want to be safe, we should make sure--within Rust--that a window always outlives the surface from which it was created.

We discussed several approaches here, but the most obvious is this one:

  1. Surface is parameterized by its underlying window: Surface<W>.
  2. The safe version of create_surface (in wgpu) takes W: HasRawWindowHandle by value rather than by reference.
  3. The safe version of create_surface (in wgpu) validates any handles it uses are non-0 / non-null (depending on platform), runs all the checks and resource creation steps it currently does, and then stores W within the Surface structure.
  4. Surface<W> exposes a few methods to get back the window data: fn window(&self) -> &W, and fn into_window(self) -> W. Note that we do not provide fn window_mut(&mut self) -> &mut W, for reasons explained below.

First, safety arguments for why this makes it okay to use the window handle associated with the surface:

  • From the definition of HasRawWindowHandle, as long as the underlying W is alive, the window handles will not change after repeated calls, so we know our checks from (3) will continue to be valid.
  • From the definition of HasRawWindowHandle, as long as the underlying W is alive, the handle returned from a call will refer to a live window, so we know the window is valid as long as the instance of W is; which is at least as long as the Surface itself, provided that we don't provide any means of partially moving out of the Window.
  • And, because we don't provide &mut W access or other partial window move methods, there is no way to alter the stored window without dropping the whole Surface (this is why we can't implement deref_mut()).

These may seem like uncomfortably strong restrictions on the window used for gfx-rs. We address these concerns in two ways:

  • Technically: We propose adding several implementations of HasRawWindowHandle upstream in the raw-window-handle crate (all of which we believe are justified by the requirements on HasRawWindowHandle):
impl<'a, T: HasRawWindowHandle> HasRawWindowHandle for &'a T { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Rc<T> { /* .. */ }
impl<T: HasRawWindowHandle> HasRawWindowHandle for Arc<T> { /* .. */ }

This means that in order to share access to the Window with the rest of an application, one can simply pass Arc<W>, Rc<W>, or &W (and possibly other implementations, should they prove useful) to create_surface. Due to the nature of how windows work (all meaningful operations on them must work in the context of a separate device being able to query them through a duplicable handle), we know that the important, shared component of the window (the part that needs to be alive when the surface is) must have interior mutability of some sort if it can be mutated at all; for actually existing window APIs, this interior mutability is made safe dynamically rather than within the type system, so we don't lose anything by preventing &mut access to the window while wgpu may have references to it. In exchange, we get a powerful safety argument that relies solely on the correct implementation of HasRawWindowHandle and create_surface, rather than requiring detailed investigation of every possible window management API.

We also examine practical, large wgpu projects to see if these restrictons would be onerous. From my own work on Veloren I know that these restrictions would be no problem for us. We also looked at Bevy, Dotrix, and Iced, and manually verified that they could move to this API with minimal changes:

So we can see that our theoretical expectations translate well to practice. We can also see that in all known big examples, winit::Window is what's providing the RawWindowHandle, which makes sense since it works to abstract window creation across as many platforms as possible. As such, verifying that winit::Window's implementation of HasRawWindowHandle satisfies the trait requirements, together with correctness of wgpu's implementation of functions on Surface, will in practice be sufficient to verify the safety of real world code using this API (which is important because it means that we are not simply hiding an added unsafety burden by moving it somewhere else in the stack).

However, this is still not enough! This is because we still haven't addressed the thing you actually do with surfaces, which is use them with swap chains. Swap chains are currently produced from the combination of a surface and a compatible device. When get_next_frame() is called on a swap chain, we acquire a texture resource on the device; the resource is compatible with the target surface. If successful, we can then render to the texture resource from our device, and then finally (on dropping the swap frame) "release" the texture, and it is queued for presentation to the surface. Clearly, the safety of all of this depends crucially on both the device and surface remaining alive for the duration of the swap chain, but this is not guaranteed by the current API!

As discussed on Matrix, the swap chain API is still very much in flux, with no major players having made a firm commitment to any particular API. However, wgpu itself has been learning from how people use it (together with constraints imposed by cross-platform requirements), and the current thought is that surfaces should own the queue currently associated with swap chains; a surface will own one swap chain at a time, and (from our discussion) wgpu has no interest in surfaces that are not part of a swap chain. This takes care of one safety issue: if a surface replaces the functionality of the swap chain, then clearly the swap chain cannot outlive the surface!

Since the swap chain needs to allocate device memory, this also implies that the swap chain should maintain a strong reference to the backing Device. It is possible that given that Window is alive, the existing gfx-rs Context mechanism will be sufficient to prevent Device host memory from being deallocated, since unlike with Window we started out with ownership over the device data; however, it's possible that the current complex dependency graph around swap chains, swap chain frames, devices, surfaces, etc. could be greatly simplified by maintaining a direct Arc<Device>. Either of these mechanisms should be sufficient to ensure safety of creating a swap frame while being reasonably flexible.

To make sure that a swapchain frame is still alive, the most straightforward method would be to add a lifetime to SwapChainFrame; get_next_frame(&'a self) -> SwapChainFrame<'a> would be the new signature defined on Surface, and SwapChainFrame would maintain a reference to the Surface that instantiated it. If this is too inflexible for some purposes (although I suspect most could be made to fit into this pattern), we could define on the method as get_next_frame<S: Borrow<Self>>(surface: S) -> SwapChainFrame<S>, which would allow using Rc<Surface>, Arc<Surface>, Surface, &Surface, etc. as desired (we woul also want to provide an into_surface() and surface() method as usual); though we might not be able to get away with this safely for arbitrary Borrow<Self>, we could at least hardcode the stuff we know works. Alternately, we could again rely on reference counting / pointer following in the shared context, and avoid any explicit references (as Surface would already take care of them).


I believe this plan closes all the potential soundness holes in the current API around windows and surfaces. There may still be implementation issues, of course, but the safety argument should hold as long as HasRawWindowHandle is implemented properly.


So, to recap:

  • create_surface needs to check for null and zero handles as appropriate (in a cross-platform way).
  • The Surface will own the window W and provide &W access and into_window(self), but not &mut W.
  • Surface will own the resources needed for the actual swap chain (queue etc.) and those resources in turn should hold onto the Device.
  • Swap chain frames will hold references to the Surface (and therefore Device) that produced them when they are acquired (exact mechanism TBD).
  • From manual inspection, all known major users of wgpu can easily accommodate this API change, as long as it comes together with HasRawWindowHandle for Arc<W: AsRawWindowHandle> and related instances implemented upstream (technically I think all of them can actually own the Window, but the Arc instance makes it easier).
  • Jointly, this all makes creating a surface and acquiring a swap chain (which are the two main operations on surfaces) safe regardless of what safe code does with the window; all the safety requirements rely solely on a correct implementation of unsafe trait AsRawWindowHandle (needs to satisfy the requirements it already has).
@kvark kvark transferred this issue from gfx-rs/wgpu-rs Jun 3, 2021
@kvark kvark added area: api Issues related to API surface help required We need community help to make this happen. labels Jun 3, 2021
@kvark
Copy link
Member

kvark commented Jun 3, 2021

@pythonesque just a heads up: the issue is moved into wgpu, but we are still very interested to get this fixed!

ecton added a commit to ecton/wgpu that referenced this issue Aug 12, 2021
This clarifies that the window that a surface is created on must be kept
alive for the lifetime of the surface. This requirement and a proposal
to change it are described in gfx-rs#1463
@cwfitzgerald cwfitzgerald added area: wsi Issues with swapchain management or windowing type: enhancement New feature or request and removed area: api Issues related to API surface labels Jun 5, 2022
Patryk27 pushed a commit to Patryk27/wgpu that referenced this issue Nov 23, 2022
parasyte added a commit to parasyte/wgpu that referenced this issue Jan 27, 2023
@parasyte
Copy link
Contributor

So, I decided to see what it would take to implement the "make Surface own the Window" part of this. The experimental patch is available in parasyte@73fbb52

I had to use #![feature(trait_upcasting)] to downcast a winit::window::Window from the reference returned by Surface::window(), but it is all working. Any other ideas to address this would be nice!

The reason for using a trait object instead of generic type parameters is that the type parameters quickly started to infect almost all APIs. E.g.

pub type RequestAdapterOptions<'a> = RequestAdapterOptionsBase<&'a Surface>;
And all types which use RequestAdapterOptions and so on.

@cwfitzgerald
Copy link
Member

It's a cool proof of concept - definitely worth pursuing with a PR. Can't you emulate trait_upcasting with your own trait? Unless I misunderstand its use.

@pythonesque
Copy link
Contributor Author

pythonesque commented Feb 2, 2023

IIRC, the main reason not to use a trait object here is that some platforms have Send windows and some have !Send windows (there were probably other reasons too). Binding it up in a trait object makes a few of the APIs simpler, but makes it almost impossible to write library code that supports both kinds of platforms without making the window related stuff !Send on every platform. Considering that the window type doesn't actually infect most of the objects people are regularly using (only surface related stuff which is pretty minor), and the fact that you can just write a type alias for the dyn case that removes the generic within user code, I think this is an acceptable tradeoff (especially since most people use wgpu through frameworks that already try to handle the window / surface stuff for you). However it looks like whether windows should just all be !Send is still unresolved...

@madsmtm
Copy link
Contributor

madsmtm commented Feb 2, 2023

IIRC there's been a lot of discussion in winit about how to do this safely on Android, since owning the window is not enough for the surface to remain valid (e.g. at any point there could be a Suspended/Resumed event that invalidates the surface).

This is also noted in HasRawWindowHandle (emphasis mine)

The exact handles returned by raw_window_handle must remain consistent between multiple calls to raw_window_handle as long as not indicated otherwise by platform specific events.

Tagging @rib and @MarijnS95 and linking a few related issues:

@rib
Copy link
Contributor

rib commented Feb 2, 2023

Regarding winit integration for surface creation I also tend to think there should be a clearer definition of a "surface size" that winit could convey, instead of the "inner" size of windows which doesn't conceptually (or technically) match the size of the render target/surface across all backends.

Please see: rust-windowing/winit#2308

@MarijnS95
Copy link
Contributor

MarijnS95 commented Feb 2, 2023

(e.g. at any point there could be a Suspended/Resumed event that invalidates the surface).

Being more specific: this is only the case after returning from that callback on the Android side. Hence ndk-glue wraps these in Mutexes, and requests (by means of documentation) its users to keep that Mutex locked for as long as they (implicitly, in e.g. Vulkan Surfaces) are using it. Whenever this callback arrives, any implementation should first clean up its uses before unlocking the Mutex, allowing us to return back to the Android system.

Not sure, but I think @rib implements a very similar thing (albeit with clearer API). I think the goal was for winit to own this surface via its Window, and actually close/destroy the window in this scenario instead of pretending an Android app always has a single(ton) window/screen (and pave the way to multi Activity instances straight away).

@rib
Copy link
Contributor

rib commented Feb 2, 2023

I think it would also be good to think through the details of how to support frame pacing effectively, since there can be a fiddly separation of responsibilities across the driver, window system integration and application, and they need to share/coordinate timing information with respect to surfaces and displays. (and so the questions around who owns what exactly may affect this)

E.g. see frame pacing library overview for Android here: https://developer.android.com/games/sdk/frame-pacing/

@parasyte
Copy link
Contributor

parasyte commented Feb 2, 2023

After reflecting on this more, it really feels more intuitive for the surface to be generic over the lifetime of ... something. Apparently not the RawWindowHandle implementer, due to comments above wrt Android.

The one major drawback I am aware of (and I haven't tried this anyway) is that it requires window T to be Sync because we will need &'surf T: Send to send shared references to the event loop with the surface bound to 'surf. Another way to say this is that we could hypothetically exploit the fact that shared references are trivially copyable. It will avoid taking ownership over the window to make surface creation safe without resorting to Arc or another smart pointer.

The Sync bound is still a blocker for !Send windows, as described in #1463 (comment). I believe winit's web platform is an example, but also note that we offer constructors just for this platform: create_surface_from_canvas and create_surface_from_offscreen_canvas. Does that mean 'surf: 'static for web targets? It should be safe, even if the canvas is removed from the DOM (because garbage collector). If that is true, then maybe that just leaves Android as the unknown if we're only considering winit for simplicity.

@pythonesque
Copy link
Contributor Author

The (newly added) caveat in HasRawWindowHandle makes it a functionally useless unsafe trait. HasRawWindowHandle needs to be able to synchronize with any handles that are currently accessing it. If refcounting on Android can't be made to work by itself, and this means the interface has to be altered to (for example) only allow accessing the raw window handle in a closure that performs synchronization on platforms where that's required (presumably in this case we need to add a lifetime to the raw window handle to prevent it from escaping), so be it... with the surface owning the window there's probably less reason why you need to store the raw pointer for an extended period of time, anyway.

@pythonesque
Copy link
Contributor Author

pythonesque commented Feb 3, 2023

(I believe that the surface owning (a reference to) the window is still the only sane way to make this API usable without essentially giving up the ability for the user to choose when to present entirely, and instead having the trait work "the other way around", where winit asks for an application like wgpu to provide hooks to access the currently queued presentable stuff. Which could totally work--it's a bit more like the browser, arguably--but it would be a dramatic API change).

(Also if it worked "the other way around" you run into a related problem where the part of the wgpu context that records queued commands to surfaces needs to be shared-owned by winit... which isn't really that bad, mostly just pointing out that the problem doesn't go away. This also creates a bit more friction for someone who wants to build their own wgpu-like layer because now they need to be the ones implementing the unsafe trait, but that could probably be resolved by having some sort of fully safe reusable library for queueing up commands to surfaces that implement such a trait, since that part of the implementation isn't really platform-specific. Sorry if this is a bit rambly).

@MarijnS95
Copy link
Contributor

If refcounting on Android can't be made to work by itself

@pythonesque Can you clarify?

@pythonesque
Copy link
Contributor Author

MarijnS95

When I was last working on this, people still weren't 100% sure whether if you explicitly incremented / decremented a reference count to the Android surface, all the objects were still made invalid on those lifecycle events. If they weren't then you wouldn't necessarily need to synchronize with the events for safety.

@MarijnS95
Copy link
Contributor

@pythonesque if you increment the refcount the Surface will stay alive, but it's not very useful as the window is gone (will not be presented to the screen) and Android will give you a new (afaik) hardware buffer when the window is shown again. The docs seem to recommend against doing this too.

Indeed, Surface drop has to be synchronized with Android's onNativeWindowDestroyed callback, or at the very least Surface liveliness has to prevent returning from that Android callback.

@pythonesque
Copy link
Contributor Author

Indeed it is probably wrong, but it sounds like it is memory safe, right? For the purposes of the current discussion, that should be sufficient to make the API safe everywhere. That seems like a worthwhile motivation in itself. Moreover, it would allow winit to pass events on to the user to notify them that they needed to refresh their surfaces. Not that I'm saying this situation is ideal or anything, just that I think anything that works better on Android will probably require breaking changes to the AsRawWindowHandle API (or in general require a different control flow path) so getting a safety win seems like a good idea in the meantime.

@MarijnS95
Copy link
Contributor

I described this more clearly in rust-windowing/raw-window-handle#84 (comment): if anything this is only a representation of the Android version that I tested this on and exempt from vendor changes. The documentation specifically mentions to not touch the hardware buffer after (returning from) that onNativeWindowDestroyed callback.

@pythonesque
Copy link
Contributor Author

pythonesque commented Feb 3, 2023

Right. Well, if we don't want to rely on vendor specific workarounds to support Android (but are there any vendors that actually break this?), I think the simplest thing to do would be to replace AsRawWindowHandle with a method that takes a closure and provides you the raw window handle for use only within the closure (there are a lot of ways to do that). The closure would just be responsible for making sure to acquire / release a lock on Android. Window ownership by the surface would still be needed because that's still the safety requirement needed for the window handle to be valid; Android just has a way of invalidating the window that doesn't exist on other systems.

Like I alluded to earlier, the alternative would be to totally ditch AsRawWindowHandle and instead reverse control flow so that the window manager is exclusively responsible for both processing lifecycle events and drawing to the surface, where it gets communicated with asynchronously by libraries like wgpu leaving work for it to do. But that seems like a much larger / more ambitious potential change that would need a lot of evaluation, whereas "acquire/release in a closure on Android" (and do nothing special in the closure on other platforms) is very straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: wsi Issues with swapchain management or windowing help required We need community help to make this happen. type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants