Agent UI: Playwright + Strix Halo E2E suite (validate every UI feature on real AMD hardware)

Part of #875 (Tier 3). **Flagship E2E initiative.**

## Goal
Stand up a Strix Halo self-hosted GHA runner and a Playwright suite that drives a headed Chromium against the running Agent UI to validate every user-facing feature on every PR.

## Why
The Agent UI is GAIA's flagship UX surface. Today it has zero browser automation, zero accessibility checks, and zero end-to-end validation. Strix Halo's 40-CU iGPU + large unified memory pool can run real Lemonade with the larger 4B/8B models alongside a headed browser — perfect for full-stack validation on real AMD hardware.

## Scope — phased delivery

### Phase 1 — runner + workflow skeleton (single PR)
- [ ] Register Strix Halo machine as a self-hosted GHA runner with labels `[self-hosted, strix-halo, gaia-ui]`. Distinct from existing `stx` runners (Strix Point, NPU correctness focus).
- [ ] Add a heartbeat job to `.github/workflows/runner_heartbeat.yml` matrix entry for `strix-halo`
- [ ] Add a monitoring entry to `.github/workflows/monitor_selfhosted_runners.yml` so we get a Teams alert if the heartbeat is stale
- [ ] Create `.github/workflows/test_agent_ui_e2e.yml`:
  ```yaml
  on:
    pull_request:
      paths:
        - 'src/gaia/apps/webui/**'
        - 'src/gaia/ui/**'
        - 'src/gaia/cli.py'
        - 'src/gaia/agents/{chat,routing}/**'
        - 'src/gaia/llm/**'
        - 'tests/e2e/**'
        - '.github/workflows/test_agent_ui_e2e.yml'
    push:
      branches: [main]
  jobs:
    e2e:
      runs-on: [self-hosted, strix-halo]
      timeout-minutes: 30
      concurrency:
        group: agent-ui-e2e-${{ github.ref }}
        cancel-in-progress: true
      steps:
        - uses: actions/checkout@v4
        - run: uv venv && uv pip install -e ".[dev,e2e,ui]"
        - run: gaia init --non-interactive --model Gemma-3-4B-Instruct
        - run: cd src/gaia/apps/webui && npm install && npm run build
        - run: npx playwright install --with-deps chromium
        - run: gaia chat --ui --ui-port 4200 &  # NEVER 4001
        - run: ./scripts/wait_for_health.sh http://localhost:4200/health 60
        - run: npx playwright test
        - if: failure()
          uses: actions/upload-artifact@v4
          with:
            name: playwright-failure-${{ github.run_id }}
            path: |
              src/gaia/apps/webui/playwright-report/
              src/gaia/apps/webui/test-results/
              ~/.gaia/logs/
              lemonade.log
        - if: always()
          run: gaia kill
  ```
- [ ] Add `e2e` extra to `setup.py`: `["playwright>=1.40", "pytest-playwright>=0.5", "axe-playwright-python"]`
- [ ] Add `playwright.config.ts` to `src/gaia/apps/webui/` with: trace on first retry, video on failure, base URL, headed mode in CI for visual debugging
- [ ] Add Playwright scripts to `src/gaia/apps/webui/package.json`:
  ```json
  "test:e2e": "playwright test",
  "test:e2e:headed": "playwright test --headed",
  "test:e2e:trace": "playwright test --trace on"
  ```
- [ ] One smoke test (`tests/e2e/smoke.spec.ts`): launch UI, assert window title contains "GAIA"

**Phase 1 acceptance:** Strix Halo runner appears in the runner list; smoke test passes on PR.

### Phase 2 — chat golden path + RAG (one PR per group)
Build groups incrementally, one PR each. Every test under `tests/e2e/`. Use the existing `tests/fixtures/agent_ui/` data (`sample_code.py`, `expenses.csv`, `sales_data.csv`, `employee_records.csv`, `config_with_emails.yaml`, `empty.txt`).

**Group 1 — first-run / setup**
- T-001 First launch shows setup wizard if `~/.gaia/config` doesn't exist
- T-002 Wizard installs Lemonade and pulls default model (mock long install)
- T-003 Post-setup, chat panel renders

**Group 2 — chat golden path**
- T-010 Send message → SSE stream produces tokens → final bubble appears
- T-011 Stop generation mid-stream → UI cancels, partial response saved
- T-012 Multi-turn (3 messages) → history preserved
- T-013 Switch model in settings → next message uses new model
- T-014 New session → list session → delete session

**Group 3 — RAG / documents**
- T-020 Upload `sample_code.py`
- T-021 Upload folder (`expenses.csv` + `sales_data.csv`)
- T-022 RAG query → citations panel renders
- T-023 Delete indexed document → next query doesn't cite it
- T-024 Folder-index progress reporter

### Phase 3 — voice, agents, MCP (one PR per group)

**Group 4 — voice**
- T-030 Mic without permission → friendly error
- T-031 With granted mic + WAV fixture → ASR transcript appears
- T-032 TTS playback → audio plays (`<audio>` events fire)

**Group 5 — agent switching / routing**
- T-040 Default agent is GaiaAgent; switch to CodeAgent
- T-041 Routing — "draw me a sunset" → routes to SDAgent
- T-042 Each registered agent shows correct tool list

**Group 6 — MCP**
- T-050 List installed MCP servers
- T-051 Add new server → appears in list
- T-052 Disable server → tools no longer offered to agents
- T-053 Remove server

### Phase 4 — vision, settings, tunnel, error, observability (one PR per group)

**Group 7 — image / vision**
- T-060 Upload image + question → VLM response cites image
- T-061 SDAgent — "a cat" → image renders
- T-062 Camera capture (skip cleanly if no webcam)

**Group 8 — settings / persistence**
- T-070 Theme toggle persists across reload
- T-071 Telemetry toggle persists
- T-072 Per-agent custom system prompt persists

**Group 9 — tunnel** (active feature; `tunnel-friendly-error.png` exists in repo root)
- T-080 Enable tunnel → public URL appears
- T-081 Disable tunnel → URL goes away
- T-082 Friendly-error path (the screenshot scenario)

**Group 10 — error / resilience**
- T-090 Kill Lemonade mid-stream → "not reachable" surfaces; retry works after restart
- T-091 Send during model-load → clear queue/error
- T-092 External provider 429 → clear error UI

**Group 11 — accessibility / visual**
- T-100 axe-core on every panel → 0 critical/serious violations
- T-101 Lighthouse a11y score ≥ 95 on chat page
- T-102 Snapshot chat panel light + dark; diff vs baseline (use `@playwright/test`'s `toHaveScreenshot`)

**Group 12 — export / observability**
- T-110 Export session JSON → reimport → messages match
- T-111 Open observability dashboard → tool-call traces render

## Failure-mode rules (per CLAUDE.md "fail loudly")
- Strix Halo runner offline → workflow FAILS (use `runs-on`, not `if:` + `continue-on-error`)
- Trace upload failure → fail loudly, not `continue-on-error: true`
- Hardware-flake mitigation: `test.describe.configure({ retries: 1 })` only; if 2 retries fail, real bug
- Never bump global Playwright timeout — fix the underlying race instead
- Never use `page.waitForTimeout(N)` — always wait for a real signal

## CI cost
~12 groups × ~3 tests × ~10s each ≈ **6 minutes** runtime; +2 min for backend warmup. With concurrency cancellation, ~8 min per push.

## Caching
- [ ] Cache Lemonade model dir keyed on model id
- [ ] Cache `~/.cache/ms-playwright` keyed on `playwright` package version
- [ ] Cache `node_modules` keyed on `package-lock.json`
- [ ] Cache `uv` virtualenv keyed on `setup.py` + `pyproject.toml`

## Acceptance criteria (per phase)
- **Phase 1:** Strix Halo runner is healthy; smoke test passes; logs upload on failure
- **Phase 2:** Chat golden path + RAG groups pass on every PR touching `apps/webui` or `ui/`
- **Phase 3:** Voice + agent + MCP groups pass; routing test catches a deliberate mis-routing
- **Phase 4:** All 12 groups pass; axe-core finds 0 critical/serious violations on rendered panels; tunnel test catches the friendly-error scenario; deliberate Lemonade kill is surfaced clearly in UI

## Out of scope
- Windows-side Playwright (could follow once Strix Halo flow is stable)
- Headless-only fast smoke (the headed run on Strix Halo IS our integration coverage)
- Mobile responsive testing (Agent UI is desktop-first per `docs/plans/agent-ui.mdx`)

## Dependencies / blockers
- Need physical access to the Strix Halo machine to register the runner (assignee task)
- Need decision on default model for E2E (Gemma-3-4B-Instruct per #865, but confirm)
- Phase 2 onwards depends on **#TODO sub-issue 7 (Vitest)** landing first so we don't conflate component bugs with E2E bugs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent UI: Playwright + Strix Halo E2E suite (validate every UI feature on real AMD hardware) #883

Goal

Why

Scope — phased delivery

Phase 1 — runner + workflow skeleton (single PR)

Phase 2 — chat golden path + RAG (one PR per group)

Phase 3 — voice, agents, MCP (one PR per group)

Phase 4 — vision, settings, tunnel, error, observability (one PR per group)

Failure-mode rules (per CLAUDE.md "fail loudly")

CI cost

Caching

Acceptance criteria (per phase)

Out of scope

Dependencies / blockers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent UI: Playwright + Strix Halo E2E suite (validate every UI feature on real AMD hardware) #883

Description

Goal

Why

Scope — phased delivery

Phase 1 — runner + workflow skeleton (single PR)

Phase 2 — chat golden path + RAG (one PR per group)

Phase 3 — voice, agents, MCP (one PR per group)

Phase 4 — vision, settings, tunnel, error, observability (one PR per group)

Failure-mode rules (per CLAUDE.md "fail loudly")

CI cost

Caching

Acceptance criteria (per phase)

Out of scope

Dependencies / blockers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions