Skip to content

Design reusable timeout/retry system + boot diagnostics for iOS/Android CI flakiness #39

@thymikee

Description

@thymikee

Problem

CI flakes frequently during iOS simulator / Android emulator boot and early command startup. Today failures often surface as generic timeouts, which makes root-cause unclear and hard to act on.

Goal

Design a lightweight, reusable timeout/retry system for CLI command execution and boot paths, with clear user-facing failure reasons (especially in CI).

Proposal

1) Reusable timeout/retry primitives

Add shared utilities (small surface area):

  • Deadline: top-level command budget with remainingMs() propagation
  • TimeoutProfile: per command family defaults (startup, operation, total)
  • RetryPolicy: transient-only retries with exponential backoff + full jitter
  • FailureClassifier: map errors into retryable vs terminal + normalized reason codes

Constraints:

  • keep it minimal and composable (no new framework)
  • use across iOS runner, Android transport, and future command families

2) Boot-specific resilience and diagnostics

Introduce explicit boot phases + reasoned failures:

  • iOS: simulator boot status checks + runner startup checks
  • Android: emulator readiness checks + ADB transport readiness

Emit phase-aware failure reasons, e.g.:

  • IOS_BOOT_TIMEOUT
  • IOS_RUNNER_CONNECT_TIMEOUT
  • ANDROID_BOOT_TIMEOUT
  • ADB_TRANSPORT_UNAVAILABLE
  • CI_RESOURCE_STARVATION_SUSPECTED

Each failure should include actionable hints (what to retry, what to inspect, likely CI cause).

3) Optional explicit boot commands (design discussion)

Consider adding commands (or subcommands) for deterministic preflight in CI:

  • agent-device boot --platform ios
  • agent-device boot --platform android

Use cases:

  • warm up devices/emulators before test steps
  • fail early with clearer diagnostics than first interactive command

Question to settle: expose as new public commands vs internal preflight hooks only.

4) Telemetry/logging

Add structured retry/timeout logs:

  • attempt number
  • phase (boot, connect, execute)
  • backoff delay
  • elapsed and remaining deadline
  • normalized reason code

This should improve CI triage and enable future tuning from real data.

Acceptance criteria

  • Shared timeout/retry utilities integrated in at least one iOS and one Android path
  • Boot failures report normalized reason codes + actionable hints
  • At least one end-to-end CI-flake scenario yields a more specific error than generic timeout
  • Docs updated with timeout profiles and failure code meanings

Out of scope (first pass)

  • full auto-recovery orchestration across all commands
  • complex policy DSL/config language

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions