Skip to content

fix(gui-client): mitigate deadlock when shutting down TUN device#8268

Merged
thomaseizinger merged 17 commits intomainfrom
fix/windows-dead-lock-shutdown
Feb 26, 2025
Merged

fix(gui-client): mitigate deadlock when shutting down TUN device#8268
thomaseizinger merged 17 commits intomainfrom
fix/windows-dead-lock-shutdown

Conversation

@thomaseizinger
Copy link
Member

@thomaseizinger thomaseizinger commented Feb 25, 2025

In #8159, we introduced a regression that could lead to a deadlock when shutting down the TUN device. Whilst we did close the channel prior to awaiting the thread to exit, we failed to notice that another instance of the sender could be alive as part of an internally stored "sending permit" with the PollSender in case another packet is queued for sending. We need to explicitly call abort_send to free that.

Judging from the comment and a prior bug, this shutdown logic has been buggy before. To further avoid this deadlock, we introduce two changes:

  • The worker threads only receive a Weak reference to the wintun::Session
  • We move all device-related state into a dedicated TunState struct that we can drop prior to joining the threads

The combination of these features means that all strong references to channels and the session are definitely dropped without having to wait for anything. To provide a clean and synchronous shutdown, we wait for at most 5s on the worker-threads. If they don't exit until then, we log a warning and exit anyway.

This should greatly reduce the risk of future bugs here because the session (and thus the WinTUN device) gets shutdown in any case and so at worst, we have a few zombie threads around.

Resolves: #8265

@vercel
Copy link

vercel bot commented Feb 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
firezone ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 26, 2025 0:29am

@thomaseizinger thomaseizinger changed the title fix(gui-client): mitigate deadlock when shutting down TUN device fix(gui-client): mitigate dead-lock when shutting down TUN device Feb 25, 2025
@thomaseizinger thomaseizinger changed the title fix(gui-client): mitigate dead-lock when shutting down TUN device fix(gui-client): mitigate deadlock when shutting down TUN device Feb 25, 2025
@thomaseizinger thomaseizinger marked this pull request as ready for review February 25, 2025 22:24
@thomaseizinger
Copy link
Member Author

@jamilbk Waiting for the worker threads in another thread caused a race condition because the WinTUN device wasn't actually shut down by the time we dropped Tun, thus failing a test that tried to create multiple Tuns in a row. I've moved it back to the main thread but now everything is wrapped in another object that we can drop prior to joining the threads.

Comment on lines -28 to -36
#[tokio::test]
#[ignore = "Needs admin / sudo and Internet"]
async fn tunnel() {
let _guard = firezone_logging::test("debug");

no_packet_loops_tcp().await;
no_packet_loops_udp().await;
tunnel_drop();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these tests into a separate test binaries. Those are executed serially by cargo as individual processes so we don't have to sequence them here manually.

@thomaseizinger thomaseizinger added this pull request to the merge queue Feb 26, 2025
Merged via the queue into main with commit 96170be Feb 26, 2025
116 checks passed
@thomaseizinger thomaseizinger deleted the fix/windows-dead-lock-shutdown branch February 26, 2025 01:01
@jamilbk
Copy link
Member

jamilbk commented Feb 26, 2025

@thomaseizinger Tested this on Windows 10. Sign out and sign in work fine, but now I can't access any Resources. Are you seeing the same thing?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR addresses a deadlock regression during the TUN device shutdown by refactoring the internal state management. Key changes include:

  • Introducing new integration tests for UDP, TCP, and tunnel drop scenarios to validate the correct behavior.
  • Refactoring the Windows-specific TUN device code by encapsulating device-related state into a dedicated TunState struct and passing only weak references into the worker threads.
  • Removing inline tests from the tun_device_manager module following their migration into separate test files.

Reviewed Changes

File Description
rust/bin-shared/tests/no_packet_loops_udp.rs Adds a UDP integration test to ensure proper packet handling outside the tunnel.
rust/bin-shared/tests/no_packet_loops_tcp.rs Adds a TCP integration test to verify external connection behavior.
rust/bin-shared/tests/tunnel_drop.rs Adds a test to verify proper cleanup of the TUN device to avoid zombie threads.
rust/bin-shared/src/tun_device_manager/windows.rs Updates shutdown logic by moving device state to TunState and using Weak references.
rust/bin-shared/src/tun_device_manager.rs Removes duplicated tests now that they are relocated to integration test files.

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

rust/bin-shared/src/tun_device_manager/windows.rs:401

  • In start_send_thread, breaking out of the loop on an allocation error now terminates the send thread immediately, whereas the previous behavior continued looping. Confirm that this change in error handling is intentional and that it does not lead to premature termination.
tracing::error!("Failed to allocate WinTUN packet: {e}"); break;

rust/bin-shared/src/tun_device_manager/windows.rs:229

  • The new approach drops the entire state to signal worker thread shutdown instead of explicitly closing channels as before. Verify that this implicit mechanism is sufficient to avoid deadlocks in all scenarios.
let _ = self.state.take(); // Drop all channel / tunnel state, allowing the worker threads to exit gracefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GUI app hangs after signing out and back in

3 participants