You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stand up a Strix Halo self-hosted GHA runner and a Playwright suite that drives a headed Chromium against the running Agent UI to validate every user-facing feature on every PR.
Why
The Agent UI is GAIA's flagship UX surface. Today it has zero browser automation, zero accessibility checks, and zero end-to-end validation. Strix Halo's 40-CU iGPU + large unified memory pool can run real Lemonade with the larger 4B/8B models alongside a headed browser — perfect for full-stack validation on real AMD hardware.
Scope — phased delivery
Phase 1 — runner + workflow skeleton (single PR)
Register Strix Halo machine as a self-hosted GHA runner with labels [self-hosted, strix-halo, gaia-ui]. Distinct from existing stx runners (Strix Point, NPU correctness focus).
Add a heartbeat job to .github/workflows/runner_heartbeat.yml matrix entry for strix-halo
Add a monitoring entry to .github/workflows/monitor_selfhosted_runners.yml so we get a Teams alert if the heartbeat is stale
Add e2e extra to setup.py: ["playwright>=1.40", "pytest-playwright>=0.5", "axe-playwright-python"]
Add playwright.config.ts to src/gaia/apps/webui/ with: trace on first retry, video on failure, base URL, headed mode in CI for visual debugging
Add Playwright scripts to src/gaia/apps/webui/package.json:
"test:e2e": "playwright test",
"test:e2e:headed": "playwright test --headed",
"test:e2e:trace": "playwright test --trace on"
One smoke test (tests/e2e/smoke.spec.ts): launch UI, assert window title contains "GAIA"
Phase 1 acceptance: Strix Halo runner appears in the runner list; smoke test passes on PR.
Phase 2 — chat golden path + RAG (one PR per group)
Build groups incrementally, one PR each. Every test under tests/e2e/. Use the existing tests/fixtures/agent_ui/ data (sample_code.py, expenses.csv, sales_data.csv, employee_records.csv, config_with_emails.yaml, empty.txt).
Group 1 — first-run / setup
T-001 First launch shows setup wizard if ~/.gaia/config doesn't exist
T-002 Wizard installs Lemonade and pulls default model (mock long install)
Trace upload failure → fail loudly, not continue-on-error: true
Hardware-flake mitigation: test.describe.configure({ retries: 1 }) only; if 2 retries fail, real bug
Never bump global Playwright timeout — fix the underlying race instead
Never use page.waitForTimeout(N) — always wait for a real signal
CI cost
~12 groups × ~3 tests × ~10s each ≈ 6 minutes runtime; +2 min for backend warmup. With concurrency cancellation, ~8 min per push.
Caching
Cache Lemonade model dir keyed on model id
Cache ~/.cache/ms-playwright keyed on playwright package version
Cache node_modules keyed on package-lock.json
Cache uv virtualenv keyed on setup.py + pyproject.toml
Acceptance criteria (per phase)
Phase 1: Strix Halo runner is healthy; smoke test passes; logs upload on failure
Phase 2: Chat golden path + RAG groups pass on every PR touching apps/webui or ui/
Phase 3: Voice + agent + MCP groups pass; routing test catches a deliberate mis-routing
Phase 4: All 12 groups pass; axe-core finds 0 critical/serious violations on rendered panels; tunnel test catches the friendly-error scenario; deliberate Lemonade kill is surfaced clearly in UI
Out of scope
Windows-side Playwright (could follow once Strix Halo flow is stable)
Headless-only fast smoke (the headed run on Strix Halo IS our integration coverage)
Mobile responsive testing (Agent UI is desktop-first per docs/plans/agent-ui.mdx)
Dependencies / blockers
Need physical access to the Strix Halo machine to register the runner (assignee task)
Part of #875 (Tier 3). Flagship E2E initiative.
Goal
Stand up a Strix Halo self-hosted GHA runner and a Playwright suite that drives a headed Chromium against the running Agent UI to validate every user-facing feature on every PR.
Why
The Agent UI is GAIA's flagship UX surface. Today it has zero browser automation, zero accessibility checks, and zero end-to-end validation. Strix Halo's 40-CU iGPU + large unified memory pool can run real Lemonade with the larger 4B/8B models alongside a headed browser — perfect for full-stack validation on real AMD hardware.
Scope — phased delivery
Phase 1 — runner + workflow skeleton (single PR)
[self-hosted, strix-halo, gaia-ui]. Distinct from existingstxrunners (Strix Point, NPU correctness focus)..github/workflows/runner_heartbeat.ymlmatrix entry forstrix-halo.github/workflows/monitor_selfhosted_runners.ymlso we get a Teams alert if the heartbeat is stale.github/workflows/test_agent_ui_e2e.yml:e2eextra tosetup.py:["playwright>=1.40", "pytest-playwright>=0.5", "axe-playwright-python"]playwright.config.tstosrc/gaia/apps/webui/with: trace on first retry, video on failure, base URL, headed mode in CI for visual debuggingsrc/gaia/apps/webui/package.json:tests/e2e/smoke.spec.ts): launch UI, assert window title contains "GAIA"Phase 1 acceptance: Strix Halo runner appears in the runner list; smoke test passes on PR.
Phase 2 — chat golden path + RAG (one PR per group)
Build groups incrementally, one PR each. Every test under
tests/e2e/. Use the existingtests/fixtures/agent_ui/data (sample_code.py,expenses.csv,sales_data.csv,employee_records.csv,config_with_emails.yaml,empty.txt).Group 1 — first-run / setup
~/.gaia/configdoesn't existGroup 2 — chat golden path
Group 3 — RAG / documents
sample_code.pyexpenses.csv+sales_data.csv)Phase 3 — voice, agents, MCP (one PR per group)
Group 4 — voice
<audio>events fire)Group 5 — agent switching / routing
Group 6 — MCP
Phase 4 — vision, settings, tunnel, error, observability (one PR per group)
Group 7 — image / vision
Group 8 — settings / persistence
Group 9 — tunnel (active feature;
tunnel-friendly-error.pngexists in repo root)Group 10 — error / resilience
Group 11 — accessibility / visual
@playwright/test'stoHaveScreenshot)Group 12 — export / observability
Failure-mode rules (per CLAUDE.md "fail loudly")
runs-on, notif:+continue-on-error)continue-on-error: truetest.describe.configure({ retries: 1 })only; if 2 retries fail, real bugpage.waitForTimeout(N)— always wait for a real signalCI cost
~12 groups × ~3 tests × ~10s each ≈ 6 minutes runtime; +2 min for backend warmup. With concurrency cancellation, ~8 min per push.
Caching
~/.cache/ms-playwrightkeyed onplaywrightpackage versionnode_moduleskeyed onpackage-lock.jsonuvvirtualenv keyed onsetup.py+pyproject.tomlAcceptance criteria (per phase)
apps/webuiorui/Out of scope
docs/plans/agent-ui.mdx)Dependencies / blockers