Skip to content

android: detach connect() scope so withTimeout actually unblocks the UI#8689

Merged
myleshorton merged 2 commits intogarmr/radiance-daemon-refactorfrom
fisk/vpn-start-timeout-refactor
Apr 23, 2026
Merged

android: detach connect() scope so withTimeout actually unblocks the UI#8689
myleshorton merged 2 commits intogarmr/radiance-daemon-refactorfrom
fisk/vpn-start-timeout-refactor

Conversation

@myleshorton
Copy link
Copy Markdown
Contributor

@myleshorton myleshorton commented Apr 22, 2026

Follow-up to #8688 (which was merged to main and has since landed on garmr/radiance-daemon-refactor via the main-merge, making the cherry-pick commits this PR originally carried redundant — those were dropped during rebase).

Problem

#8688's final shape wraps connect() in:

coroutineScope {
    val deferred = async(Dispatchers.IO) { connect() }
    try {
        withTimeout(VPN_START_TIMEOUT_MS) { deferred.await() }
    } catch (e: TimeoutCancellationException) {
        deferred.cancel()
        throw e
    }
}

Copilot correctly flagged that this still hangs in the exact scenario the timeout is meant to guard against. coroutineScope { } is a structured scope: when the block throws, it cancels its children and then waits for them to finish. deferred.cancel() only signals cancellation cooperatively, but connect() is a blocking JNI call with no suspension points — it ignores the signal. So coroutineScope blocks on the still-running child, the caller stays blocked, and the UI freezes just like before (Freshdesk #173507).

Fix

Switch to a detached CoroutineScope(SupervisorJob() + Dispatchers.IO). Its Job is not a child of the enclosing coroutine, so cancel() signals but doesn't join. The orphan coroutine keeps running the JNI call in the background until Go returns (or the process exits), but the caller is unblocked and runCatching.onFailure fires the ${errorCode}_timeout state for the UI.

Test plan

  • Normal VPN start: timeout never fires, connect completes, status goes to Connected
  • Simulated deadlock (e.g. time.Sleep(90 * time.Second) inside the Go /service/start handler): UI unfreezes after 60s with a *_timeout error instead of staying stuck
  • Logs show VPN operation (...) timed out after 60000ms when the timeout fires

Related

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 22, 2026 17:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-picks Android service changes to add a wall-clock timeout around Go/JNI VPN start/connect calls, aiming to prevent UI freezes when Mobile.startVPN() / Mobile.connectToServer() deadlock.

Changes:

  • Introduces a 60s VPN_START_TIMEOUT_MS ceiling for VPN start/connect operations.
  • Wraps the blocking Go/JNI call in async + withTimeout { await() } and propagates timeout failures through the existing error path.
  • Adds timeout-specific logging and errorCode suffixing (*_timeout) for clearer operator visibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

myleshorton added a commit that referenced this pull request Apr 23, 2026
Copilot flagged on #8689 that the existing coroutineScope { ... } still
hangs in exactly the scenario this change is meant to protect against.
Structured coroutineScope cancels its children on exception but then
waits for them to complete — so when withTimeout fires, we cancel the
deferred (which the JNI call ignores, since it has no suspension
points) and then block on it finishing anyway. Net effect: the UI is
still frozen, which is the symptom we're trying to prevent.

Switch to a DETACHED CoroutineScope(SupervisorJob() + Dispatchers.IO).
Its Job is not a child of the enclosing coroutine, so cancelling it
doesn't join — the orphan coroutine keeps running the JNI call in the
background until Go returns or the process exits, but the caller is
unblocked and the runCatching.onFailure path fires the timeout error
state for the UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot flagged on #8689 that the existing coroutineScope { ... } still
hangs in exactly the scenario this change is meant to protect against.
Structured coroutineScope cancels its children on exception but then
waits for them to complete — so when withTimeout fires, we cancel the
deferred (which the JNI call ignores, since it has no suspension
points) and then block on it finishing anyway. Net effect: the UI is
still frozen, which is the symptom we're trying to prevent.

Switch to a DETACHED CoroutineScope(SupervisorJob() + Dispatchers.IO).
Its Job is not a child of the enclosing coroutine, so cancelling it
doesn't join — the orphan coroutine keeps running the JNI call in the
background until Go returns or the process exits, but the caller is
unblocked and the runCatching.onFailure path fires the timeout error
state for the UI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@myleshorton myleshorton force-pushed the fisk/vpn-start-timeout-refactor branch from 79e3b9f to 4c583e6 Compare April 23, 2026 15:11
@myleshorton myleshorton changed the title android: bound Mobile.startVPN with withTimeout (cherry-pick of #8688 to refactor) android: detach connect() scope so withTimeout actually unblocks the UI Apr 23, 2026
@myleshorton myleshorton requested a review from Copilot April 23, 2026 15:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot correctly pointed out on #8689 that the detached-scope approach
can accumulate orphan coroutines if the user retries while a previous
connect() is still stuck in JNI. Each orphan pins a Dispatchers.IO
thread; enough retries against a truly deadlocked Go side could
pressure the IO pool.

Their suggested fix (Dispatchers.IO.limitedParallelism(1)) would
serialize retries behind the orphan, turning the 2nd retry into
another 60s hang. A simple single-flight AtomicBoolean gate with fast
rejection is the cleaner mitigation:

- compareAndSet rejects concurrent attempts with IllegalStateException
  (surfaces via the existing runCatching.onFailure → error state).
- The flag clears in a try/finally inside the async block, which runs
  when the JNI call eventually returns — cancellation alone can't
  break it out, but once Go completes the finally runs and a future
  retry is admitted.
- Process death (reboot, force-stop) resets the flag naturally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@myleshorton myleshorton merged commit 0540dc9 into garmr/radiance-daemon-refactor Apr 23, 2026
@myleshorton myleshorton deleted the fisk/vpn-start-timeout-refactor branch April 23, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants