Skip to content

Agent UI: Playwright + Strix Halo E2E suite (validate every UI feature on real AMD hardware) #883

@kovtcharov

Description

@kovtcharov

Part of #875 (Tier 3). Flagship E2E initiative.

Goal

Stand up a Strix Halo self-hosted GHA runner and a Playwright suite that drives a headed Chromium against the running Agent UI to validate every user-facing feature on every PR.

Why

The Agent UI is GAIA's flagship UX surface. Today it has zero browser automation, zero accessibility checks, and zero end-to-end validation. Strix Halo's 40-CU iGPU + large unified memory pool can run real Lemonade with the larger 4B/8B models alongside a headed browser — perfect for full-stack validation on real AMD hardware.

Scope — phased delivery

Phase 1 — runner + workflow skeleton (single PR)

  • Register Strix Halo machine as a self-hosted GHA runner with labels [self-hosted, strix-halo, gaia-ui]. Distinct from existing stx runners (Strix Point, NPU correctness focus).
  • Add a heartbeat job to .github/workflows/runner_heartbeat.yml matrix entry for strix-halo
  • Add a monitoring entry to .github/workflows/monitor_selfhosted_runners.yml so we get a Teams alert if the heartbeat is stale
  • Create .github/workflows/test_agent_ui_e2e.yml:
    on:
      pull_request:
        paths:
          - 'src/gaia/apps/webui/**'
          - 'src/gaia/ui/**'
          - 'src/gaia/cli.py'
          - 'src/gaia/agents/{chat,routing}/**'
          - 'src/gaia/llm/**'
          - 'tests/e2e/**'
          - '.github/workflows/test_agent_ui_e2e.yml'
      push:
        branches: [main]
    jobs:
      e2e:
        runs-on: [self-hosted, strix-halo]
        timeout-minutes: 30
        concurrency:
          group: agent-ui-e2e-${{ github.ref }}
          cancel-in-progress: true
        steps:
          - uses: actions/checkout@v4
          - run: uv venv && uv pip install -e ".[dev,e2e,ui]"
          - run: gaia init --non-interactive --model Gemma-3-4B-Instruct
          - run: cd src/gaia/apps/webui && npm install && npm run build
          - run: npx playwright install --with-deps chromium
          - run: gaia chat --ui --ui-port 4200 &  # NEVER 4001
          - run: ./scripts/wait_for_health.sh http://localhost:4200/health 60
          - run: npx playwright test
          - if: failure()
            uses: actions/upload-artifact@v4
            with:
              name: playwright-failure-${{ github.run_id }}
              path: |
                src/gaia/apps/webui/playwright-report/
                src/gaia/apps/webui/test-results/
                ~/.gaia/logs/
                lemonade.log
          - if: always()
            run: gaia kill
  • Add e2e extra to setup.py: ["playwright>=1.40", "pytest-playwright>=0.5", "axe-playwright-python"]
  • Add playwright.config.ts to src/gaia/apps/webui/ with: trace on first retry, video on failure, base URL, headed mode in CI for visual debugging
  • Add Playwright scripts to src/gaia/apps/webui/package.json:
    "test:e2e": "playwright test",
    "test:e2e:headed": "playwright test --headed",
    "test:e2e:trace": "playwright test --trace on"
  • One smoke test (tests/e2e/smoke.spec.ts): launch UI, assert window title contains "GAIA"

Phase 1 acceptance: Strix Halo runner appears in the runner list; smoke test passes on PR.

Phase 2 — chat golden path + RAG (one PR per group)

Build groups incrementally, one PR each. Every test under tests/e2e/. Use the existing tests/fixtures/agent_ui/ data (sample_code.py, expenses.csv, sales_data.csv, employee_records.csv, config_with_emails.yaml, empty.txt).

Group 1 — first-run / setup

  • T-001 First launch shows setup wizard if ~/.gaia/config doesn't exist
  • T-002 Wizard installs Lemonade and pulls default model (mock long install)
  • T-003 Post-setup, chat panel renders

Group 2 — chat golden path

  • T-010 Send message → SSE stream produces tokens → final bubble appears
  • T-011 Stop generation mid-stream → UI cancels, partial response saved
  • T-012 Multi-turn (3 messages) → history preserved
  • T-013 Switch model in settings → next message uses new model
  • T-014 New session → list session → delete session

Group 3 — RAG / documents

  • T-020 Upload sample_code.py
  • T-021 Upload folder (expenses.csv + sales_data.csv)
  • T-022 RAG query → citations panel renders
  • T-023 Delete indexed document → next query doesn't cite it
  • T-024 Folder-index progress reporter

Phase 3 — voice, agents, MCP (one PR per group)

Group 4 — voice

  • T-030 Mic without permission → friendly error
  • T-031 With granted mic + WAV fixture → ASR transcript appears
  • T-032 TTS playback → audio plays (<audio> events fire)

Group 5 — agent switching / routing

  • T-040 Default agent is GaiaAgent; switch to CodeAgent
  • T-041 Routing — "draw me a sunset" → routes to SDAgent
  • T-042 Each registered agent shows correct tool list

Group 6 — MCP

  • T-050 List installed MCP servers
  • T-051 Add new server → appears in list
  • T-052 Disable server → tools no longer offered to agents
  • T-053 Remove server

Phase 4 — vision, settings, tunnel, error, observability (one PR per group)

Group 7 — image / vision

  • T-060 Upload image + question → VLM response cites image
  • T-061 SDAgent — "a cat" → image renders
  • T-062 Camera capture (skip cleanly if no webcam)

Group 8 — settings / persistence

  • T-070 Theme toggle persists across reload
  • T-071 Telemetry toggle persists
  • T-072 Per-agent custom system prompt persists

Group 9 — tunnel (active feature; tunnel-friendly-error.png exists in repo root)

  • T-080 Enable tunnel → public URL appears
  • T-081 Disable tunnel → URL goes away
  • T-082 Friendly-error path (the screenshot scenario)

Group 10 — error / resilience

  • T-090 Kill Lemonade mid-stream → "not reachable" surfaces; retry works after restart
  • T-091 Send during model-load → clear queue/error
  • T-092 External provider 429 → clear error UI

Group 11 — accessibility / visual

  • T-100 axe-core on every panel → 0 critical/serious violations
  • T-101 Lighthouse a11y score ≥ 95 on chat page
  • T-102 Snapshot chat panel light + dark; diff vs baseline (use @playwright/test's toHaveScreenshot)

Group 12 — export / observability

  • T-110 Export session JSON → reimport → messages match
  • T-111 Open observability dashboard → tool-call traces render

Failure-mode rules (per CLAUDE.md "fail loudly")

  • Strix Halo runner offline → workflow FAILS (use runs-on, not if: + continue-on-error)
  • Trace upload failure → fail loudly, not continue-on-error: true
  • Hardware-flake mitigation: test.describe.configure({ retries: 1 }) only; if 2 retries fail, real bug
  • Never bump global Playwright timeout — fix the underlying race instead
  • Never use page.waitForTimeout(N) — always wait for a real signal

CI cost

~12 groups × ~3 tests × ~10s each ≈ 6 minutes runtime; +2 min for backend warmup. With concurrency cancellation, ~8 min per push.

Caching

  • Cache Lemonade model dir keyed on model id
  • Cache ~/.cache/ms-playwright keyed on playwright package version
  • Cache node_modules keyed on package-lock.json
  • Cache uv virtualenv keyed on setup.py + pyproject.toml

Acceptance criteria (per phase)

  • Phase 1: Strix Halo runner is healthy; smoke test passes; logs upload on failure
  • Phase 2: Chat golden path + RAG groups pass on every PR touching apps/webui or ui/
  • Phase 3: Voice + agent + MCP groups pass; routing test catches a deliberate mis-routing
  • Phase 4: All 12 groups pass; axe-core finds 0 critical/serious violations on rendered panels; tunnel test catches the friendly-error scenario; deliberate Lemonade kill is surfaced clearly in UI

Out of scope

  • Windows-side Playwright (could follow once Strix Halo flow is stable)
  • Headless-only fast smoke (the headed run on Strix Halo IS our integration coverage)
  • Mobile responsive testing (Agent UI is desktop-first per docs/plans/agent-ui.mdx)

Dependencies / blockers

Metadata

Metadata

Assignees

No one assigned

    Labels

    consumerBlocks consumer adoption — must ship for the v0.20.0 consumer launch windowdevopsDevOps/infrastructure changesdomain:qualityTests, CI/CD, security, performance, evalselectronElectron app changesp0high priorityp1medium prioritytestsTest changestier-0Tier 0 — must ship before parallel agent work begins (test/CI prerequisites)track:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions