Problem
CI flakes frequently during iOS simulator / Android emulator boot and early command startup. Today failures often surface as generic timeouts, which makes root-cause unclear and hard to act on.
Goal
Design a lightweight, reusable timeout/retry system for CLI command execution and boot paths, with clear user-facing failure reasons (especially in CI).
Proposal
1) Reusable timeout/retry primitives
Add shared utilities (small surface area):
Deadline: top-level command budget with remainingMs() propagation
TimeoutProfile: per command family defaults (startup, operation, total)
RetryPolicy: transient-only retries with exponential backoff + full jitter
FailureClassifier: map errors into retryable vs terminal + normalized reason codes
Constraints:
- keep it minimal and composable (no new framework)
- use across iOS runner, Android transport, and future command families
2) Boot-specific resilience and diagnostics
Introduce explicit boot phases + reasoned failures:
- iOS: simulator boot status checks + runner startup checks
- Android: emulator readiness checks + ADB transport readiness
Emit phase-aware failure reasons, e.g.:
IOS_BOOT_TIMEOUT
IOS_RUNNER_CONNECT_TIMEOUT
ANDROID_BOOT_TIMEOUT
ADB_TRANSPORT_UNAVAILABLE
CI_RESOURCE_STARVATION_SUSPECTED
Each failure should include actionable hints (what to retry, what to inspect, likely CI cause).
3) Optional explicit boot commands (design discussion)
Consider adding commands (or subcommands) for deterministic preflight in CI:
agent-device boot --platform ios
agent-device boot --platform android
Use cases:
- warm up devices/emulators before test steps
- fail early with clearer diagnostics than first interactive command
Question to settle: expose as new public commands vs internal preflight hooks only.
4) Telemetry/logging
Add structured retry/timeout logs:
- attempt number
- phase (
boot, connect, execute)
- backoff delay
- elapsed and remaining deadline
- normalized reason code
This should improve CI triage and enable future tuning from real data.
Acceptance criteria
- Shared timeout/retry utilities integrated in at least one iOS and one Android path
- Boot failures report normalized reason codes + actionable hints
- At least one end-to-end CI-flake scenario yields a more specific error than generic timeout
- Docs updated with timeout profiles and failure code meanings
Out of scope (first pass)
- full auto-recovery orchestration across all commands
- complex policy DSL/config language
Problem
CI flakes frequently during iOS simulator / Android emulator boot and early command startup. Today failures often surface as generic timeouts, which makes root-cause unclear and hard to act on.
Goal
Design a lightweight, reusable timeout/retry system for CLI command execution and boot paths, with clear user-facing failure reasons (especially in CI).
Proposal
1) Reusable timeout/retry primitives
Add shared utilities (small surface area):
Deadline: top-level command budget withremainingMs()propagationTimeoutProfile: per command family defaults (startup, operation, total)RetryPolicy: transient-only retries with exponential backoff + full jitterFailureClassifier: map errors into retryable vs terminal + normalized reason codesConstraints:
2) Boot-specific resilience and diagnostics
Introduce explicit boot phases + reasoned failures:
Emit phase-aware failure reasons, e.g.:
IOS_BOOT_TIMEOUTIOS_RUNNER_CONNECT_TIMEOUTANDROID_BOOT_TIMEOUTADB_TRANSPORT_UNAVAILABLECI_RESOURCE_STARVATION_SUSPECTEDEach failure should include actionable hints (what to retry, what to inspect, likely CI cause).
3) Optional explicit boot commands (design discussion)
Consider adding commands (or subcommands) for deterministic preflight in CI:
agent-device boot --platform iosagent-device boot --platform androidUse cases:
Question to settle: expose as new public commands vs internal preflight hooks only.
4) Telemetry/logging
Add structured retry/timeout logs:
boot,connect,execute)This should improve CI triage and enable future tuning from real data.
Acceptance criteria
Out of scope (first pass)