flightdeckdev · Gsbreddy · May 5, 2026 · May 4, 2026 · May 4, 2026 · May 4, 2026
diff --git a/docs/ui-roadmap.md b/docs/ui-roadmap.md
@@ -0,0 +1,148 @@
+# FlightDeck Serve UI roadmap
+
+This document turns the strict UI review into sequenced work. Scope is the checked-in React app under `web/` (served from `flightdeck serve`). Goal: move from **ledger viewer** to a **release-centric control plane** without changing core product boundaries (CLI-first, local ledger).
+
+## Principles
+
+1. **One spine**: a release under judgment → diff verdict → promotion action (evidence on demand).
+2. **URL is state**: deep links prefill forms so operators can share “this comparison” and “this promotion.”
+3. **Verdict before detail**: policy outcome and blockers must dominate; tables and JSON stay secondary.
+4. **Boring over flashy**: prefer clear hierarchy and high-contrast failure states over decorative chrome.
+
+## Phase 0 — Done in repo (this slice)
+
+| Item | Outcome |
+|------|---------|
+| Release-centric hero | Overview highlights a focused release when `?release=` is set; row shortcuts jump to Diff / Runs / Promote with params. |
+| Wire navigation to state | `Diff`, `Runs`, and `Promote` read `baseline`, `candidate`, `release_id`, `window`, `environment` from the URL search string. |
+| Blocked / pass unavoidable | Diff page shows a full-width **verdict banner** (alert on FAIL) above the result card stack. |
+| Bridge Diff → Promote | After a computed diff, a primary **Continue to promote** action links to Promote with release + environment + window prefilled (read-only builds omit). |
+
+## Phase 1 — Hierarchy and differentiation
+
+| Priority | Work | Status |
+|----------|------|--------|
+| P1 | Collapse or relocate **Ledger metrics** on Overview so the releases + promoted story leads. | Done — metrics in collapsible panel below tables (collapsed by default). |
+| P1 | **Reorder Diff result**: top fold = verdict + key deltas; pricing/catalog in collapsed sections or tabs. | Done — verdict banner; samples + rollups; pricing summary inline with expandable detail. |
+| P1 | **Promoted vs candidate** narrative per `agent + environment` (e.g. inline summary above tables). | Done — promoted table first with version column; releases show Live vs Registered. |
+| P1 | Reduce reliance on **manual checksum scanning** — surface version + agent + env as the human keys. | Done — Primary column on releases table; hero leads with agent/version/env. |
+
+## Phase 2 — Polish and operator UX
+
+| Priority | Work | Status |
+|----------|------|--------|
+| P2 | Typography scale for page vs card titles; consistent vertical rhythm. | Done — `fd-page-sub--tight` / `--meta`, wider page header measure. |
+| P2 | Table ergonomics: row hover, optional filters, copy-to-clipboard for release IDs. | Done — filter row on releases; copy buttons; hover accent on `fd-table--hover`. |
+| P2 | Tone down gradient accents for a more **infra / audit** aesthetic (keep accessible contrast). | Done — solid primary buttons; flat logo tile; nav indicator unchanged. |
+| P2 | Copy pass: each primary page answers *What changed?* *Is it safe?* *Can I ship?* in one short block. | Done — Overview, Diff, Runs, Actions, Settings intros. |
+
+## Non-goals (near term)
+
+- Embedded orchestration or graph execution.
+- Chart-heavy analytics dashboards (prefer summary metrics tied to gates).
+- Replacing the CLI registration / ingest workflow.
+
+## Verification
+
+After `web/` changes: from `web/`, `npm ci && npm run build`; commit `src/flightdeck/server/static/` updates; run `npm run test:e2e` when navigation or forms behavior changes.
+
+On Unix hosts where `python` is not on `PATH`, set `FLIGHTDECK_E2E_PYTHON` to a Python that has FlightDeck installed (for example the repo venv: `FLIGHTDECK_E2E_PYTHON=/path/to/.venv/bin/python npm run test:e2e`). The default is `python3`.
+
+## Blueprint alignment (external product IA review)
+
+This section maps a fuller “control plane” blueprint to FlightDeck’s **current** CLI-first ledger and HTTP surface. Use it to avoid building UI that implies APIs or workflows we do not ship yet.
+
+### Adopted from the blueprint
+
+- **Page litmus**: each primary screen should answer at least one of — *What changed?* · *What happened because of it?* · *Can I ship?*
+- **Cross-page consistency**: shared status semantics (pass / fail / warn / neutral), fixed vocabulary (**Release**, **Diff**, **Policy**, **Evidence**), repeated rhythm (**header → summary → detail → actions**).
+- **Sparse chrome**: summary metrics and tables over chart-heavy dashboards (matches roadmap non-goals).
+- **Diff as differentiator**: structured comparison and policy outcome stay central; layout can evolve toward “baseline vs candidate” twin + verdict-first fold (Phase 1).
+- **Evidence as ground truth**: runs + rollups remain the forensic surface; avoid Langfuse-style analytics_scope creep.
+- **Component direction**: prefer one reusable set (`ReleaseHeader`, `StatusBadge`, `MetricCard`, etc.) over one-off page styling.
+
+### Merged information architecture (near term)
+
+Avoid exploding to eight top-level nav items before contracts exist. Practical sequencing:
+
+1. **Overview** — situational awareness; add promoted / last-action strip before burying operators in ledger counters (Phase 1).
+2. **Releases** — table-first browsing (today: Overview table; later: dedicated route if needed).
+3. **Release detail** — evolve `?release=` hero into `/release/:id` when we want a stable bookmark per artifact.
+4. **Diff** — deep dive; expand “change → impact → policy” **only** when diff payloads expose comparable structure (prompt/tools/model deltas as data, not copy).
+5. **Evidence** — Runs page (rename in nav only if it helps operators).
+6. **Promote** — Actions; surface approval flow when `promotion_requires_approval` is on (today: request / confirm API).
+
+Defer standalone **Policies** (rule catalog with thresholds), **multi-role approval chains**, and **rich audit timeline filters** until read APIs and persistence match those stories.
+
+### Deferred / backend-gated (do not imply in UI yet)
+
+- **Per-release row status** (“Blocked”, “Live”, “Rolled back”) with sortable **cost Δ / latency Δ**: “Live” can align with promoted pointers; “blocked” is **evaluation-scoped** (depends on baseline, window, environment)—not a global attribute unless we store or cache last evaluation per release.
+- **Policies page** listing rules with “expected vs actual”: needs a stable **rule listing** or workspace-backed contract; today policy output is **evaluated reasons**, not necessarily a browsable catalog.
+- **Approvals** as org chart (Platform → ML → Security): requires identity, roles, and workflow beyond optional promotion request/confirm.
+- **Risk score** / composite **HIGH** labels: needs a defined server-side aggregate or explicit mapping from existing fields (e.g. sample confidence alone is not a full risk model).
+- **Release twin** lines such as “system prompt +N tokens” unless those deltas exist on the wire from release/diff payloads.
+
+### Terminology note
+
+Treat **policy FAIL** as **do not promote this candidate under this evaluation context** (baseline + window + environment), not “this release ID is permanently blocked everywhere.”
+
+## Production wireframe direction (external — change → impact → policy → decision)
+
+This section folds **final wireframe** feedback into the same constraints as **Blueprint alignment**: useful as **layout and component targets**, not as a promise that every block exists on the wire today.
+
+### Thesis (keep)
+
+The UI should reinforce **change → impact → policy → decision**, not generic dashboards. Prefer **deepening diff causality and decision clarity** over charts and vanity metrics (already in **Non-goals**).
+
+### Target section stack (conceptual)
+
+| Section | Role | FlightDeck today (serve UI) | Next evolution |
+|--------|------|----------------------------|----------------|
+| Sidebar | Stable nav | `AppShell` | Optional rename **Runs → Evidence** if it helps operators without splitting routes. |
+| Release header | Human anchor for the release under review | Overview `?release=` hero; Diff form IDs | Dedicated **`/release/:id`** or shared **`ReleaseHeader`** component fed by timeline + focused release. |
+| Block reason banner | Unmissable “why stop” | Diff verdict banner (policy FAIL + reasons) | Optional **single-line primary reason** when server ranks or summarizes reasons. |
+| Release twin (OLD vs NEW) | At-a-glance identity change | Pricing model line + rollups (Diff) | Explicit **baseline vs candidate** strip (version/agent/env + model/provider) once data is stable in **`POST /v1/diff`**. |
+| Change impact analysis (expandable) | Causal / drill-down | Collapsible pricing/catalog + metric grid | **Structured change list** only when diff payload exposes comparable artifacts (prompt/tools deltas)—no invented causality. |
+| Policy evaluation | Gate outcome | Verdict banner + policy reasons | Optional **`PolicyPanel`** extracting banner + evaluated_at for reuse on Actions outcomes. |
+| Approvals | Human layer | **Actions** when `promotion_requires_approval` | Not multi-role org charts until backend supports it; keep **request / confirm** truthy UI. |
+| Decision | Readable outcome | PASS/FAIL copy + promote CTA | **`DecisionCard`** summarizing verdict + next step (promote / fix / widen evidence). |
+| Actions | Mutations | Promote / rollback / request / confirm | Same page; ensure cross-links from Diff retain window/env. |
+
+### Suggested components (map to repo gradually)
+
+Names from feedback are **targets** for extraction/refactor—not required file renames in one PR:
+
+- **`ReleaseHeader`** — consolidate Overview hero + future release route header.
+- **`ReleaseTwin`** — thin summary row for baseline vs candidate (model/pricing/version IDs).
+- **`DiffList` / change rows** — defer until **`changes[]`** (or equivalent) exists on the API.
+- **`PolicyPanel`** — wrapper around policy PASS/FAIL + reasons + timestamp.
+- **`ApprovalPanel`** — pending requests + confirm flow (today on Actions).
+- **`DecisionCard`** — verdict + recommended action line.
+
+### Illustrative data shape (not current wire contract)
+
+A unified front-end model such as:
+
+```ts
+// Illustrative only — do not treat as implemented HTTP schema.
+type Release = {
+  id: string;
+  status: "blocked" | "ready";
+  changes: Change[];
+  policies: PolicyResult[];
+  approvals: Approval[];
+};
+```
+
+…only makes sense after the server can compute **`blocked` vs `ready`** for a **specific evaluation context** (baseline, window, environment) and optionally expose **`changes[]`**. Until then, compose views from **`TimelinePayload`**, **`POST /v1/diff`**, **`GET /v1/runs`**, and promotion APIs **without** implying a single merged **`Release`** document.
+
+### Hard “don’t” (reasserted)
+
+- Do **not** add chart-heavy dashboards or random metric walls.
+- Do **not** fake approval chains or policy catalogs without API backing.
+
+### Relation to open UI work (e.g. PR #53 trajectory)
+
+Recent UI slices move toward this wireframe: **verdict-first Diff**, **collapsed deep pricing**, **promoted-first Overview**, **copy/filters**, **decision-litmus copy**. On **Diff**, the **Release twin** (baseline vs candidate + resolved model line), **blocked strip** (first policy reason), **policy evaluation card**, **decision card** (promote when PASS), and **Change impact** section align layout with **change → impact → policy → decision** without inventing API fields.
+
+Remaining gap is mostly **component extraction** (`ReleaseHeader`, shared panels) and **release route**, gated on contracts above.
diff --git a/src/flightdeck/server/static/assets/index-BPDMrxvX.js b/src/flightdeck/server/static/assets/index-BPDMrxvX.js
diff --git a/src/flightdeck/server/static/assets/index-Cmx_W8JU.js b/src/flightdeck/server/static/assets/index-Cmx_W8JU.js
diff --git a/src/flightdeck/server/static/assets/index-Dr1ovfXv.css b/src/flightdeck/server/static/assets/index-Dr1ovfXv.css
diff --git a/src/flightdeck/server/static/assets/index-DrCTr-qj.css b/src/flightdeck/server/static/assets/index-DrCTr-qj.css
diff --git a/src/flightdeck/server/static/index.html b/src/flightdeck/server/static/index.html
@@ -8,8 +8,8 @@
       content="FlightDeck web UI: release diffs, run evidence, policy gates, and promote or rollback actions against a local flightdeck serve instance."
     />
     <title>FlightDeck</title>
-    <script type="module" crossorigin src="/assets/index-BPDMrxvX.js"></script>
-    <link rel="stylesheet" crossorigin href="/assets/index-Dr1ovfXv.css">
+    <script type="module" crossorigin src="/assets/index-Cmx_W8JU.js"></script>
+    <link rel="stylesheet" crossorigin href="/assets/index-DrCTr-qj.css">
   </head>
   <body>
     <div id="root"></div>

diff --git a/web/e2e/diff-ui.spec.ts b/web/e2e/diff-ui.spec.ts
@@ -0,0 +1,139 @@
+import { expect, test } from "@playwright/test";
+
+const mockReleaseRow = {
+  release_id: "rel_e2e_wireframe",
+  agent_id: "support-bot",
+  version: "1.0.0",
+  environment: "local",
+  checksum: "sha256deadbeef",
+  created_at: "2026-01-01T12:00:00Z",
+};
+
+const mockDiffPass = {
+  policy: {
+    passed: true,
+    reasons: [] as string[],
+    evaluated_at: "2026-01-02T00:00:00Z",
+  },
+  samples: {
+    baseline_runs: 10,
+    candidate_runs: 12,
+    confidence: "medium",
+  },
+  metrics: {
+    baseline_cost_per_run_usd: 0.012,
+    candidate_cost_per_run_usd: 0.015,
+    delta_cost_per_run_usd: 0.003,
+    delta_cost_per_run_pct: 0.25,
+    baseline_latency_ms_avg: 100,
+    candidate_latency_ms_avg: 110,
+    delta_latency_ms_avg: 10,
+    baseline_error_rate: 0.01,
+    candidate_error_rate: 0.02,
+    delta_error_rate: 0.01,
+  },
+  pricing: {
+    baseline_provider: "openai",
+    baseline_version: "2026-01",
+    baseline_model: "gpt-4.1",
+    candidate_provider: "openai",
+    candidate_version: "2026-01",
+    candidate_model: "gpt-4.1-mini",
+    pricing_or_model_changed: true,
+    warnings: ["Synthetic pricing warning for e2e."],
+    hints: [] as string[],
+    prices: {
+      baseline_input_usd_per_1k_tokens: 0.005,
+      baseline_output_usd_per_1k_tokens: 0.015,
+      candidate_input_usd_per_1k_tokens: 0.002,
+      candidate_output_usd_per_1k_tokens: 0.008,
+    },
+  },
+};
+
+test.describe("overview copy & mocked diff interactions", () => {
+  test.beforeEach(async ({ page }) => {
+    await page.route("**/v1/releases", async (route) => {
+      await route.fulfill({
+        status: 200,
+        contentType: "application/json",
+        body: JSON.stringify({ releases: [mockReleaseRow] }),
+      });
+    });
+    await page.route("**/v1/promoted", async (route) => {
+      await route.fulfill({
+        status: 200,
+        contentType: "application/json",
+        body: JSON.stringify({ promoted: [] }),
+      });
+    });
+    await page.route("**/v1/actions", async (route) => {
+      await route.fulfill({
+        status: 200,
+        contentType: "application/json",
+        body: JSON.stringify({ actions: [] }),
+      });
+    });
+    await page.route("**/v1/diff", async (route) => {
+      if (route.request().method() !== "POST") {
+        await route.continue();
+        return;
+      }
+      const postData = route.request().postDataJSON() as { baseline_release_id?: string } | null;
+      const baseline = postData?.baseline_release_id ?? "";
+      let body: typeof mockDiffPass & { policy: { passed: boolean; reasons: string[]; evaluated_at: string } };
+      if (baseline.includes("fail_gate")) {
+        body = {
+          ...mockDiffPass,
+          policy: {
+            passed: false,
+            reasons: ["cost regression exceeds threshold", "secondary reason"],
+            evaluated_at: "2026-01-02T00:00:00Z",
+          },
+        };
+      } else {
+        body = mockDiffPass;
+      }
+      await route.fulfill({
+        status: 200,
+        contentType: "application/json",
+        body: JSON.stringify(body),
+      });
+    });
+  });
+
+  test("overview copy release ID shows transient copied state", async ({ page }) => {
+    await page.goto("/");
+    await expect(page.getByRole("columnheader", { name: "Primary" })).toBeVisible({ timeout: 30_000 });
+    const copyBtn = page.getByTestId("overview-copy-release-row");
+    await expect(copyBtn).toBeVisible();
+    await copyBtn.click();
+    await expect(copyBtn).toHaveText("Copied");
+    await expect(copyBtn).toHaveText("Copy", { timeout: 4000 });
+  });
+
+  test("diff with mocked POST shows twin, policy PASS, decision CTA, pricing expand", async ({ page }) => {
+    await page.goto("/#/diff?baseline=rel_a&candidate=rel_b&environment=local&window=7d");
+    await page.getByRole("button", { name: "Compute diff" }).click();
+    await expect(page.getByRole("heading", { name: "Policy evaluation", level: 3 })).toBeVisible();
+    await expect(page.locator(".fd-policy-panel").getByText("PASS", { exact: true })).toBeVisible();
+    await expect(page.getByRole("heading", { name: "Decision", level: 3 })).toBeVisible();
+    await expect(page.getByRole("link", { name: "Continue to promote" })).toBeVisible();
+    const expand = page.getByTestId("diff-pricing-expand");
+    await expect(expand).toBeVisible();
+    await expect(page.getByTestId("diff-per-1k-prices-title")).not.toBeVisible();
+    await expand.click();
+    await expect(page.getByTestId("diff-per-1k-prices-title")).toBeVisible();
+    await expect(page.getByText("Synthetic pricing warning for e2e.")).toBeVisible();
+  });
+
+  test("diff with mocked FAIL shows blocked strip and no promote CTA", async ({ page }) => {
+    await page.goto("/#/diff?baseline=rel_fail_gate&candidate=rel_b&environment=local&window=7d");
+    await page.getByRole("button", { name: "Compute diff" }).click();
+    await expect(page.getByText(/^Blocked:/)).toBeVisible();
+    await expect(page.locator(".fd-diff-block-strip").getByText("cost regression exceeds threshold")).toBeVisible();
+    await expect(page.getByRole("heading", { name: "Policy evaluation", level: 3 })).toBeVisible();
+    await expect(page.locator(".fd-policy-panel").getByText("FAIL", { exact: true })).toBeVisible();
+    await expect(page.getByRole("link", { name: "Continue to promote" })).not.toBeVisible();
+  });
+});
diff --git a/web/e2e/smoke.spec.ts b/web/e2e/smoke.spec.ts
@@ -7,8 +7,10 @@ test("home loads FlightDeck shell and overview tables", async ({ page }) => {
   await expect(page.getByRole("link", { name: "Overview" })).toBeVisible();
   await expect(page.getByRole("heading", { name: "Overview", level: 2 })).toBeVisible();
   await expect(page.getByRole("navigation", { name: "Release governance workflow" })).toBeVisible();
-  await expect(page.getByRole("region", { name: "Ledger metrics" })).toBeVisible({ timeout: 30_000 });
-  await expect(page.getByRole("columnheader", { name: "Release ID" })).toBeVisible({ timeout: 30_000 });
+  await expect(page.getByTestId("ledger-metrics-toggle")).toBeVisible({ timeout: 30_000 });
+  await page.getByTestId("ledger-metrics-toggle").click();
+  await expect(page.getByRole("region", { name: "Ledger metrics" })).toBeVisible();
+  await expect(page.getByRole("columnheader", { name: "Primary" })).toBeVisible({ timeout: 30_000 });
   await expect(page.getByText("No releases yet.")).toBeVisible();
 });
 
@@ -30,6 +32,24 @@ test("runs page requires release id before query", async ({ page }) => {
   await expect(page.getByText("Release ID is required.")).toBeVisible();
 });
 
+test("deep links prefill diff, runs, and promote forms from query params", async ({ page }) => {
+  await page.goto("/#/diff?baseline=rel_base&candidate=rel_cand&environment=staging&window=14d");
+  await expect(page.getByRole("textbox", { name: /baseline release id/i })).toHaveValue("rel_base");
+  await expect(page.getByRole("textbox", { name: /candidate release id/i })).toHaveValue("rel_cand");
+  await expect(page.getByRole("textbox", { name: /^environment$/i })).toHaveValue("staging");
+  await expect(page.getByRole("textbox", { name: /^window$/i })).toHaveValue("14d");
+
+  await page.goto("/#/runs?release_id=rel_run&environment=prod&window=30d");
+  await expect(page.getByLabel(/release id/i)).toHaveValue("rel_run");
+  await expect(page.getByLabel(/environment \(optional\)/i)).toHaveValue("prod");
+  await expect(page.getByLabel(/^window$/i)).toHaveValue("30d");
+
+  await page.goto("/#/actions?release_id=rel_act&environment=qa&window=1d");
+  await expect(page.getByLabel(/^release id$/i)).toHaveValue("rel_act");
+  await expect(page.getByLabel(/^environment$/i)).toHaveValue("qa");
+  await expect(page.getByLabel(/^window$/i)).toHaveValue("1d");
+});
+
 test("GET /v1/workspace returns WorkspacePublic", async ({ request }) => {
   const res = await request.get("/v1/workspace");
   expect(res.ok()).toBeTruthy();