Skip to content

fix: iOS worker TCP connection failures (issue #79)#84

Merged
evilsocket merged 4 commits intomainfrom
copilot/verify-debug-issue-79
Apr 24, 2026
Merged

fix: iOS worker TCP connection failures (issue #79)#84
evilsocket merged 4 commits intomainfrom
copilot/verify-debug-issue-79

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 24, 2026

Summary

Investigated and fixed the three root causes of the iOS worker TCP connection failures reported in issue #79.

Root cause analysis

The issue manifests as EHOSTUNREACH / ECONNREFUSED when the master tries to TCP-connect to the iOS worker, even though UDP discovery succeeds and nc -z (quick connect-then-close) works. Three compounding bugs were identified:

  1. No retry on setup connection – the master made exactly one TCP connect attempt; any transient iOS network condition (brief socket initialisation delay after UDP advertisement, app briefly suspended by iOS, etc.) caused immediate failure with no recovery.

  2. Context::from_args blocked the Tokio runtime – in run_zero_config_worker (and run_direct_worker), the model-weight loading call Context::from_args(args) was synchronous inside an async fn. On a multi-thread runtime this starves the async reactor during load, preventing the pre-bound TcpListener from being polled for incoming inference reconnections in a timely manner.

  3. iOS network socket lifecycle – iOS can suspend TCP sockets when the app is briefly in the background. UIBackgroundModes: [voip] keeps sockets alive. NSBonjourServices was missing the _cake._tcp entry, so iOS's Local Network permission did not explicitly cover the TCP inference port.

Changes

cake-core/src/cake/sharding/mod.rs

  • Retry + timeout on setup TCP connect: up to 3 attempts with 1 s / 2 s / 4 s exponential backoff and a 10 s per-attempt timeout. Each failure is logged at WARN level with attempt count.

cake-mobile/src/lib.rs

  • run_zero_config_worker: wrap Context::from_args with tokio::task::spawn_blocking so model loading runs on a dedicated blocking-thread pool, keeping the Tokio async reactor unblocked while weights load.
  • run_direct_worker: same spawn_blocking fix for consistency.
  • JoinError (task panic) is handled gracefully and surfaced as an error status.

cake-mobile-app/iosApp/iosApp/Info.plist

  • Add UIBackgroundModes: [voip] – prevents iOS from suspending the TCP listener when the app is briefly backgrounded during inference.
  • Add _cake._tcp to NSBonjourServices – ensures the iOS Local Network permission covers the TCP inference connection in addition to the UDP discovery service.

Testing

All 826 existing tests pass (cargo test -p cake-core --lib --test unit --test protocol). No new Clippy warnings. CodeQL scan: 0 alerts.

Copilot AI and others added 2 commits April 24, 2026 11:46
…cking, UIBackgroundModes

Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/272a2117-aebc-4a10-a2ab-0348f3040921

Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com>
…MEOUT value

Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/272a2117-aebc-4a10-a2ab-0348f3040921

Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com>
@evilsocket evilsocket marked this pull request as ready for review April 24, 2026 13:51
@evilsocket evilsocket marked this pull request as draft April 24, 2026 13:51
@evilsocket evilsocket marked this pull request as ready for review April 24, 2026 13:51
@evilsocket evilsocket self-requested a review April 24, 2026 13:51
Copy link
Copy Markdown
Owner

@evilsocket evilsocket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failed, check why and fix

…iles field on Windows

Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/4cd3dc02-5334-464d-8e1d-e9a2135e8792

Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 24, 2026

CI failed, check why and fix

Found two pre-existing CI failures in tensor_storage.rs (not touched by the iOS fix commits) — both fixed in e58cb3a:

  1. Android (Test (Android aarch64) failure): libc::posix_madvise/POSIX_MADV_SEQUENTIAL aren't defined in Android's libc. Replaced with libc::madvise/MADV_SEQUENTIAL which is available on all Unix platforms including Android.

  2. Windows (Test (windows-latest) failure): SafetensorsStorage had a dead #[cfg(not(unix))] files: Vec<File> field that wasn't included in the struct initializers (Ok(Self { index, shards })), causing a compile error on Windows. Removed the unused field since file handles are already owned by MappedShard on non-Unix.

Copilot AI requested a review from evilsocket April 24, 2026 14:04
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

📊 Code Coverage by Platform & Acceleration

Platform Acceleration Line Coverage Function Coverage Region Coverage
Linux CPU (Android host) 57.3% (12802/22333) 55.0% (1341/2439) 53.3% (24172/45350)
macOS Metal (iOS host) 56.8% (13153/23153) 53.9% (1383/2564) 52.5% (25254/48099)
Linux CUDA 54.3% (12807/23576) 53.2% (1344/2526) 50.5% (24210/47935)
Linux Flash Attention 54.2% (12813/23622) 53.1% (1344/2529) 50.4% (24220/48093)
Linux ROCm 56.5% (12790/22622) 54.2% (1345/2482) 52.2% (24187/46292)
Linux Vulkan 54.8% (12794/23354) 53.4% (1344/2516) 51.0% (24199/47446)
macOS CPU 57.3% (12808/22364) 54.9% (1342/2443) 53.3% (24183/45408)
macOS Metal 56.8% (13153/23153) 53.9% (1383/2564) 52.5% (25254/48099)
Linux CPU 57.3% (12802/22333) 55.0% (1341/2439) 53.3% (24172/45350)
Windows CPU 57.4% (12799/22287) 55.0% (1338/2432) 53.3% (24123/45280)

Copy link
Copy Markdown
Owner

@evilsocket evilsocket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI still failing

…r.rs for Android compat

Agent-Logs-Url: https://github.com/evilsocket/cake/sessions/8c7587c9-58e7-4276-80dd-82a6b595cbef

Co-authored-by: evilsocket <86922+evilsocket@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 24, 2026

CI still failing

Fixed in 0f1eebb. The Android build was still failing in disk_expert_provider.rs (a second occurrence of the same libc::posix_madvise/POSIX_MADV_WILLNEED issue that Android's libc doesn't define). Replaced with libc::madvise/MADV_WILLNEED.

Copilot AI requested a review from evilsocket April 24, 2026 14:19
@evilsocket evilsocket merged commit be522af into main Apr 24, 2026
24 checks passed
@evilsocket evilsocket deleted the copilot/verify-debug-issue-79 branch April 24, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants