Design reusable timeout/retry system + boot diagnostics for iOS/Android CI flakiness

## Problem
CI flakes frequently during iOS simulator / Android emulator boot and early command startup. Today failures often surface as generic timeouts, which makes root-cause unclear and hard to act on.

## Goal
Design a lightweight, reusable timeout/retry system for CLI command execution and boot paths, with clear user-facing failure reasons (especially in CI).

## Proposal

### 1) Reusable timeout/retry primitives
Add shared utilities (small surface area):
- `Deadline`: top-level command budget with `remainingMs()` propagation
- `TimeoutProfile`: per command family defaults (startup, operation, total)
- `RetryPolicy`: transient-only retries with exponential backoff + full jitter
- `FailureClassifier`: map errors into retryable vs terminal + normalized reason codes

Constraints:
- keep it minimal and composable (no new framework)
- use across iOS runner, Android transport, and future command families

### 2) Boot-specific resilience and diagnostics
Introduce explicit boot phases + reasoned failures:
- iOS: simulator boot status checks + runner startup checks
- Android: emulator readiness checks + ADB transport readiness

Emit phase-aware failure reasons, e.g.:
- `IOS_BOOT_TIMEOUT`
- `IOS_RUNNER_CONNECT_TIMEOUT`
- `ANDROID_BOOT_TIMEOUT`
- `ADB_TRANSPORT_UNAVAILABLE`
- `CI_RESOURCE_STARVATION_SUSPECTED`

Each failure should include actionable hints (what to retry, what to inspect, likely CI cause).

### 3) Optional explicit boot commands (design discussion)
Consider adding commands (or subcommands) for deterministic preflight in CI:
- `agent-device boot --platform ios`
- `agent-device boot --platform android`

Use cases:
- warm up devices/emulators before test steps
- fail early with clearer diagnostics than first interactive command

Question to settle: expose as new public commands vs internal preflight hooks only.

### 4) Telemetry/logging
Add structured retry/timeout logs:
- attempt number
- phase (`boot`, `connect`, `execute`)
- backoff delay
- elapsed and remaining deadline
- normalized reason code

This should improve CI triage and enable future tuning from real data.

## Acceptance criteria
- Shared timeout/retry utilities integrated in at least one iOS and one Android path
- Boot failures report normalized reason codes + actionable hints
- At least one end-to-end CI-flake scenario yields a more specific error than generic timeout
- Docs updated with timeout profiles and failure code meanings

## Out of scope (first pass)
- full auto-recovery orchestration across all commands
- complex policy DSL/config language


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design reusable timeout/retry system + boot diagnostics for iOS/Android CI flakiness #39

Problem

Goal

Proposal

1) Reusable timeout/retry primitives

2) Boot-specific resilience and diagnostics

3) Optional explicit boot commands (design discussion)

4) Telemetry/logging

Acceptance criteria

Out of scope (first pass)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design reusable timeout/retry system + boot diagnostics for iOS/Android CI flakiness #39

Description

Problem

Goal

Proposal

1) Reusable timeout/retry primitives

2) Boot-specific resilience and diagnostics

3) Optional explicit boot commands (design discussion)

4) Telemetry/logging

Acceptance criteria

Out of scope (first pass)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions