Skip to content

0.7.0

Latest

Choose a tag to compare

@wu-sheng wu-sheng released this 25 Jun 01:23
· 10 commits to main since this release

Browser errors & source maps

  • New "Browser Logs" tab on the BROWSER layer — lists the JS error logs the browser agent reports (message, category, page, app version, time, and the minified line:col), filterable by category and time window. Expanding a row shows the raw stack alongside a de-obfuscated view.
  • Source-map de-obfuscation (issue #6784). Upload a .map file from the tab and resolve any error's minified stack back to the original source — file, line, column, symbol name, and a source snippet — by picking which map to apply. Maps are held in the BFF's memory only (no backend storage): they're surfaced as temporary, evicted least-recently-used when the configured budget is hit, and lost on restart. For durable provisioning, mount .map files into the server's static source-map directory (HORIZON_SOURCEMAPS_DIR, /app/sourcemaps in the image) — those reload automatically and can't be deleted from the UI. Budgets are configurable via the new sourceMaps block in horizon.yaml (per-file and total in-memory caps; defaults 64 MiB / 512 MiB). Upload/delete require the new source-map:write permission; viewing + resolving ride on browser-errors:read.

Layers

  • Split a layer's menu by service group. A new per-layer Split menu by service group toggle (Layer dashboards admin, right after Alias; default off) fans the layer out into one level-0 sidebar entry per OAP Service.group — the <group>:: prefix. The entry's display name leads with the group so it reads everywhere (sidebar, page header, KPI tile) and survives narrow-sidebar truncation — e.g. agent · General Service. Each entry is scoped to its group: the service header + its picker, the topology map + its in-box selector, the dashboards, and the service roster all show only that group's services. A layer's group entries stay contiguous in the sidebar (sorted by group), and the cross-group view returns by turning the toggle back off (off = one combined entry holding all groups). The group value is OAP-supplied data and is shown verbatim (not translated). Travels with template export/import like every other layer setting.
  • The navigation sidebar is now resizable — drag the divider between the sidebar and the page to widen or narrow it (double-click the divider to reset to the default width); the chosen width persists per browser. Useful when long entries — like group-split names (agent · General Service) or deep namespaces — would otherwise truncate.
  • The per-layer service picker shows each service's group. When a layer has no topology-cluster naming rule, the service-list rows now surface the OAP <group>:: prefix (e.g. agent) as the group chip — so the group is visible there as it already is on the topology map.
  • Every layer OAP reports now appears in the sidebar, including ones with no Horizon template (they render with default capabilities — a plain Service page). The previous hard-coded hidden-layer list (which dropped BanyanDB) is gone; a layer is hidden only when an admin explicitly disables its template, or when it is listed in the new config-driven layers.excluded block in horizon.yaml (defaults: FAAS and VIRTUAL_GATEWAY; clear the list to surface every reported layer).
  • The admin Layer dashboards page is now layer-list-oriented. It lists every available layer — not just the ones shipping a bundled JSON or living on OAP. A layer with no template yet opens on a blank default you can configure (components, metric columns, widgets, topology) and Save, which publishes the template to OAP on first save. No per-layer JSON has to be shipped for a layer to be configurable. The picker gains a Not configured filter (beside Diverged / Local) and the sync banner spells out "N templates match bundled defaults · M layers not configured yet".
  • Removed the legacy per-layer overview block from every bundled layer template (and its translation overlays). It no longer rendered anything — the standalone Overview Dashboards replaced the old per-layer Overview tile — so it was dead config; the per-layer KPI strip is driven by the layer-header columns.

Airflow monitoring layer (SWIP-7)

  • New Airflow layer under Workflow Scheduler — service dashboard (scheduler / executor / pool KPIs and trends), Components dashboard (per-host scheduler and triggerer metrics for Airflow 3.x native OTel), and a 3D Infra Map load ring for Tasks Executable. Pairs with OAP backend SWIP-7 (meter_airflow_* / meter_airflow_instance_*).

BanyanDB self-observability layer (SWIP-15)

  • New BanyanDB layer under Self-Observability, modeling a clustered, role- and tier-aware BanyanDB deployment scraped through its FODC proxy. The cluster is one Cluster (service), each container is one Container (instance, carrying its container_name role and node_type tier as attributes), and each storage Group is an endpoint:
    • Cluster dashboard — write / query / error-rate KPIs, CPU / memory / disk capacity, a throughput + errors trend, and a Containers by Role table.
    • Container dashboard adapts to the selected container's role: every container shows CPU / memory / Go-runtime resources (and, where the system collector runs, uptime / disk / network); a liaison adds ingestion, query, gRPC errors, the tier-2 publish pipeline and write-queue depth; a data node adds storage totals, merge/compaction, inverted index, subscribe queue and retention; a lifecycle sidecar shows migration cycles and last-run time / status. Liaison and data panels are gated on the container's role attribute; resource panels the lifecycle sidecar doesn't emit, and the lifecycle migration panels themselves, self-gate on data presence so they surface only once that container actually reports them (the lifecycle panels stay hidden until the first migration cycle runs). This template targets the clustered model; a single-process standalone BanyanDB (container_name=standalone) shows the shared resource / Go panels but not the role-specific ingestion / storage panels — those extend to standalone once the entity-gate membership operator (SWIP-15 §6) lands.
    • Group dashboard — metrics split per data-model (measure / stream / trace / property): each model gets write rate, query latency, stored data, merge rate / latency / partitions, series write + term-search and total series, plus the type-agnostic subscribe / publish queue (throughput, p99, batch + message rate, publish bytes). Because a BanyanDB group stores one catalog, only the matching model's panels render for a given group — gated by the model's series-count flag — so a measure group shows the measure panels, a property group its index-write / merge / term-search / series panels, and so on.
    • A Deployment tab renders the cluster's container inventory — every container grouped into its node's role/tier box (liaison, data hot/warm/cold) and its pod, with per-role health metrics (liaison query rate + gRPC errors, data ingest rate + disk usage, lifecycle migration cycles + last-run status). The node health-ring legend names each role's own ring metric and its colour-band thresholds (driven by the layer template) instead of a single shared, hard-coded one. Container-to-container call edges carry role-pair-specific metrics off the SWIP-15 instance-relation families: a liaison → data edge shows write / query / part-sync throughput + p99 (one per queue operation), a liaison → liaison edge shows write-forward + control, and a lifecycle → data edge shows tier-migration volume / rate / p99. The edge prints up to 3 of the pair's metrics inline (short aliases like W / R, flowing onto one line or stacking by edge length); the selected-edge panel keeps the full client | server breakdown, and the Flows sub-tab tables every edge per role-pair. Edges render once the OAP build includes the SERVICE_INSTANCE_RELATION scope (the migration_* family also needs the lifecycle sidecar reporting); until then the tab shows the inventory without edges.
    • The whole deployment model — clustering / grouping rules, per-role node metrics, and role-pair edge metrics — is editable from the Layer dashboards admin → Deployment scope.
  • Pairs with OAP backend SWIP-15 (meter_banyandb_* cluster / meter_banyandb_instance_* container / meter_banyandb_endpoint_* group). Queue-batch and lifecycle last-run panels appear once the cluster runs a BanyanDB build that emits those metrics.

Dashboard widget value formatting

  • Card widgets gain a enum format with a value→label map: a coded metric (e.g. a 1/0 success gauge) renders a readable label (1 → OK, 0 → Failed) instead of the raw number. Labels are translatable per locale (BFF-side template i18n overlay) and the map is editable in the Layer dashboards admin. BanyanDB's lifecycle Last Sync card uses it.
  • New duration format renders a SECONDS metric as a human time-ago (5m 20s ago; compact 5m / 2h on axes) — used by BanyanDB's Time Since Last Sync card.

Record widgets — jump to trace & copy

  • Record widgets now drill into the originating trace. Each sampled row gets a jump-to-trace icon at the row head — shown only when the sample actually carries a trace id (these are sampled, so it can be absent) — that opens the trace waterfall in the global popout. It resolves the trace by id, not by layer, so it works even though the trace belongs to the calling service on a different layer (a virtual-target layer has no traces tab of its own). The statement text itself is click-to-copy. For example, the Slow Statements record widget on a Virtual Database / Cache / MQ service.

Instance-list badge

  • The badge on each row of the instance list (Containers / Pods / Nodes / …) is now configurable per layer (instances.badge on the layer template) — it can show any instance attribute instead of the fixed agent language. BanyanDB shows container_name (liaison / data / lifecycle), the role that actually distinguishes a container; agent-traced layers keep language (Java / Go / …). The badge is now hidden when the value is empty or UNKNOWN, so OpenTelemetry-scraped layers (which report no agent language) drop the meaningless UNKNOWN chip across the board.

Dashboard widget visibility

  • Layer-dashboard widgets gain a structured Visible when gate (Layer dashboards admin → widget drawer) so a widget only renders when it's relevant to the selected entity. Two kinds:
    • MQE metric — show the widget only when an expression has value, or when any value is > / < a threshold. Naming the widget's own metric self-gates it (the JVM widgets appear only on JVM instances, the MQ widgets only on MQ producers, …); naming a different metric gates a whole group on one shared signal — that metric is checked once and the entire group's queries are skipped when it's empty, so e.g. a non-JVM instance no longer runs the JVM widget queries at all.
    • Entity attribute — on the Instance scope, gate on the selected instance's attributes, e.g. language equals JAVA (case-insensitive) or an attribute simply being present. Service / Endpoint entities carry no attributes, so entity gates are ignored on those scopes.
  • Gates are evaluated server-side; gated-out widgets just don't appear in the grid. Note: a layer dashboard saved before this release that used the old free-text predicate loses its gate (the widget renders ungated) until you re-set the gate in the new editor and save the dashboard.

Topology node filter & component icons

  • The per-layer Topology map (and the embedded topology widget on the Services / Mesh overview dashboards) gains a Filter control to hide the conjectured peers that clutter a dense map. One auto-derived facet — by layer, presented exactly as the sidebar shows it: each row carries the layer's own icon and its localized display name (General Service, Virtual Database, Java Agent, …), plus an Others bucket for nodes OAP couldn't resolve, alongside a standalone User toggle. The layer rows self-populate from whatever the map currently shows and re-derive on every refresh / depth / time change. Unchecking a row hides those nodes and their now-dangling edges; the Others bucket is where uninstrumented "undefined" peers (e.g. a bare rcmd:80) land, so one click clears them, while your real databases / queues / caches — separated by their own VIRTUAL_* layer rows — stay on the map. Filtering is client-side and defaults to showing everything.
  • Technology component icons on the nodes. Service-map nodes now render the icon for their detected component — the same icon set the trace waterfall uses, so a PostgreSQL node looks like PostgreSQL — falling back to the generic service / external / user glyph when the component ships no icon or couldn't be resolved.
  • The topology's service selector (the "All services" picker) now groups its list by service group — OAP's Service.group (the <group>:: prefix, e.g. agent) shown under a value-first <name> [GROUP] header — so a layer whose services share a group reads grouped instead of as one flat list. This is a per-service attribute and needs no per-layer naming-rule setup; services with no group stay in a single header-less section. Clicking a group header batch-selects or unselects every service in that group — the header carries a filled / half / hollow marker for all / some / none of its services focused.

Instance topology

  • The per-layer Topology map gains an instance map drill-down on layers that enable instance topology. Click a call between two services and then Instance map → to open it: the instances of each service as two columns (left = client, right = server) with the instance-level calls between them — pan/zoom, animated client→server flow, the same node health-ring + per-call client/server metric sidebar the service map uses, and a node popover with Open instance dashboard. A back button returns to the service map; a toolbar pair-picker swaps the two services. The two service pickers are relationship-aware, drawn from the service-topology call graph (including conjectured / cross-layer callees like rcmd:80, named the same as on the service map): the server list is the chosen client's callees and the client list is the chosen server's callers, each re-deriving when the other changes without resetting your current pick. A side the graph leaves no real choice for (e.g. a single caller) shows as plain text instead of a one-option dropdown. Each service's instances sit inside a labelled grouping box — named with the service, using the same <group>:: prefix handling as the service map so a name reads identically on both — and a ring-colour legend explains what the node health bands (green → red) mean for the configured ring metric. Labels follow the layer's own terms (e.g. Pods on Kubernetes, Sidecars on the data plane).
  • Configurable like the service map. The Layer-dashboards admin → Topology scope now has an Enable instance topology toggle and its own node / server-edge / client-edge metric editors, kept visually separate from the service-topology metrics so the two are never confused. Enabled out of the box on General, Service Mesh, Kubernetes Service, and Cilium Service; the config rides each layer's topology template (so it travels with template export/import).
  • When OAP's template store is unreachable, the instance map now shows the same empty + connectivity-banner state as the service map, rather than a misleading "not supported" — block and unsupported are no longer conflated.
  • Localized across all eight UI languages. The instance-map UI, the template-store-unreachable banner, and the remaining alarm / live-debugger strings are now translated in zh-CN, ja, ko, es, pt, de and fr (English stays the source) — no feature renders English-only for non-English operators.

Lock & compare entities on a layer dashboard

  • Lock several services, instances, or endpoints — including ones from different services — and compare them in place. Compare is standard on every service / instance / endpoint layer dashboard — no flag, nothing to enable. Pin entities from the service picker or the instance / endpoint list; instance and endpoint pins are cross-service, so instances belonging to different services can be compared side by side. A persistent, scope-aware comparison bar shows the cohort regardless of how the underlying list paginates or which entity is currently selected.
  • The entity you're viewing is always part of the comparison — it appears first, tagged CURRENT in the accent color (and still drives the header KPIs); pinned entities add to it, each in its own stable hue (up to six pins). The comparison-bar chips are display-only: clicking a chip never changes what you're viewing (no disruptive reload) and × unpins — switch the focused entity from the top selector / list as usual.
  • Each widget compares inline in its own tile — line widgets overlay one hued series per entity; card widgets show one row per entity; top-N and record widgets get per-entity tabs plus a merged "All" tab; table widgets gain an Entity column that groups rows by entity and folds each entity's long tail into one (others) row (summed for counts, count-only for latencies / percentiles where a sum would mislead). With nothing locked, every page renders exactly as before.
  • Labeled series lead with the meaningful dimension — a multi-label series reads <label> · <service> · <instance> so the label (e.g. a JVM thread state) survives when a long instance / endpoint id has to truncate; entities stay distinguishable by color. A widget that only some compared entities expose (e.g. JVM widgets when a Java service is pinned alongside a non-JVM one) shows whenever any compared entity has it.
  • Progressive, per-entity loading — each entity loads as its own request; tiles fill in as entities arrive and one slow or failed entity never blanks the others.
  • The Topology, Deployment, trace, and log pages are unaffected — comparison applies to the service / instance / endpoint dashboards only.
  • Widget tooltips and legends show the widget title, never the raw MQE expression, for un-labeled single-series line widgets; the multi-series tooltip is a fixed, aligned table — the entity name truncates and the values form one clean right-aligned column with the unit in the header.

Charts & bundled metrics

  • Large numbers on chart axes and tooltips now use compact SI suffixes (45.1k, 1.34M, 2.5G) instead of scientific notation (4.51e4), which operators found hard to read.
  • Go runtime "Metadata Mspan" and "Metadata Mcache" widgets now report KB, fixing values that were displayed as a mislabeled "MB" of raw bytes (≈1000× too large) — they now read in the same KB scale as the other Go metadata-size widgets.

Deployment

  • New per-layer Deployment tab — the deployment topology of all of a service's instances: the instance-to-instance call graph within a single service. Where the instance map drills into the instances between two services, this shows how one service's own instances are deployed and talk to each other (e.g. a clustered store's nodes calling each other). Pick a service from the layer's Service header and the tab draws its instances as health-ring nodes with the intra-service calls between them — pan/zoom, animated edge flow, the per-call client/server metric sidebar, and a node popover that shows the instance's attributes and an Open instance dashboard link. Self-calls and back-and-forth pairs are drawn distinctly.
  • Node clustering. Instances can group into labelled boxes by a single instance attribute (e.g. role / tier), by several attributes combined into one key (e.g. node_role + node_type, where an attribute absent on a node drops out — so a BanyanDB cluster splits its data nodes into hot / warm / cold boxes while the liaison nodes, which carry no tier, stay one box), or by a name regex run on the instance name — so a fleet of mixed-role nodes reads as one box per role instead of a flat cloud. The boxes lay out left→right along the calls between them, so an upstream→downstream chain reads in order.
  • Optional + configurable. Off by default for every layer; a layer opts in from the Layer-dashboards admin → Deployment scope, which has its own node / server-edge / client-edge metric editors (instance scope) plus the clustering-rule picker. The config is a self-contained block on the layer template, so it travels with template export/import and is independent of the service-map topology config.
  • Pod / sibling model. Instances render as hexagons and can bundle into pods: a pod's main container is a full hex with its sibling containers attached as smaller hexes around its edges. Three independent rules drive it — cluster (the dashed boxes), sibling (which containers form one pod), and role (per-container-type metrics + which container is the main). Edges resolve to the exact container, so cross-pod sidecar links (e.g. a lifecycle agent calling its peer in another pod) connect the small hexes. The model can be previewed before real data exists via the admin's draft Preview flow — edit the Deployment scope and preview the live page without publishing.
  • Tiered layout + draggable pods. Each cluster box lays its pods out by call depth — sources on the left, the pods they call to the right — so a hot → warm → cold lifecycle chain reads as left-to-right tiers. Pods stack vertically within a tier, and a tier with more than four pods wraps into additional stacked columns of four. Drag any pod to rearrange; its cluster box re-flows to keep every node enclosed.
  • Role-to-role edge metrics + a Flows view. Deployment edge metrics are keyed by the (source-role → target-role) pair, so each kind of link shows its own metrics rather than one flat set — a liaison → data edge can surface write / query / part-sync throughput while a lifecycle → data edge surfaces migration volume. Pairs match most-specific-first with a * wildcard fallback. Each pair names a primary metric that prints inline on the edge in the map, so the headline number reads at a glance without opening anything; the selected-edge sidebar shows that pair's full metric set. A new Flows sub-tab (next to Topology) lays the edges out as one aligned table per role-pair — click a row to jump to that edge in the graph.

API dependency

  • The per-layer API dependency tab renders an endpoint's caller → callee chain as a graph. Pick an endpoint and it lays out in columns by direction — callers on the left, the focus endpoint in the centre, callees on the right — with the same node health-ring border, SLA-coloured RPM, and latency you read on the service map; edges animate the call direction and label the heaviest by RPM.
  • Expand to walk the chain. A selected endpoint shows a single + handle that pulls in its own callers and callees in one click (new callers land left, callees right). The handle spins while the dependency query is in flight; when an endpoint is a leaf with nothing further to load it fades and a brief banner says so — a silent "nothing happened" never reads as a bug.
  • Rearrange freely. Drag any node box to pull a dense graph apart — edges follow live. Pan, wheel-zoom, and a fit button act on the whole canvas, and a node holds a steady on-screen size whether or not the detail sidebar is open.
  • Drill straight out, in a new tab. The node detail's Open endpoint and Service →, and the service-map node/edge jumps (Open service, API map →, Instance map →), now open in a new browser tab — so you keep the graph you're exploring while the drill-down opens alongside it.
  • Nodes share the service-map's visual vocabulary (SLA-band border, an agent badge on instrumented endpoints, the focus star), and the tab is localized across all eight UI languages.

Dashboard template portability

  • Every template admin page — Overview templates, Layer dashboards, and the 3D-map config — now has Export and Import actions. Export downloads the in-use version (what end users render: the version live on OAP, or the bundled default when OAP has none) as a JSON file, for backup, sharing, or moving a dashboard to another OAP. Import reads a JSON file, validates it, and loads it as a local draft in this browser — preview it, then publish with “Check diff & push” as usual. Importing never writes OAP directly. Overview import can recreate a deleted dashboard or seed a brand-new one; layer import targets a layer already present on this deployment.
  • The Translations page has matching Export / Import, scoped to the current language: export the in-use translation for a template + locale as a JSON file, or import one as a local draft to review and push. (Source templates and their translations are edited on separate pages, so their import/export are separate too — each on its own page.)

Template store reliability

  • Runtime config is strictly what's on OAP. Layer dashboards, overviews, and topology now render only the version published to OAP's UI-template store (or the in-code minimal default for a layer that has none). The disk-bundled templates reach a running UI only by being synced to OAP (first boot / admin reset) or through the admin Preview button — they are never a silent live fallback. So an operator always sees the live published config, not a stale bundled copy masquerading as current.
  • Unreachable template store is a visible block, not a quiet fallback. When OAP's UI-template host can't be reached, a banner (same red treatment as the OAP-query-unreachable strip) reports it, and the dashboard / overview / topology surfaces stay empty rather than back-filling bundled defaults that could be read as real. The sidebar still navigates so the rest of the app is reachable.
  • The admin Preview button now drives every template-rendered page — the overview detail view and the per-layer topology (incl. the instance map), API dependency, traces, and network-profiling pages — not just the layer dashboards. Previewing renders the draft's metrics/config against live OAP, so an edit to topology or dependency metrics is visible before you publish. Preview and the absent-remote path stay strictly separate: a draft renders only in ?mode=preview; normal reads never carry one.
  • Editors no longer silently fall back to the bundled default. When a layer / overview / translation has no version published to OAP, the editor shows a "No published version on OAP" panel instead of quietly loading the shipped bundled copy as if it were live. Bundled now reaches the editor only when you click Reset to bundled — matching the runtime, which renders the published version or blocks, never the bundle.

Layer landing & service list

  • The layer landing now shows your services, not just an arbitrary 25. It used to cap the metric fan-out at the first 25 services by list order — so larger layers hid the rest, and the "top" services weren't even the true top (the cap happened before the ranking). Now it probes all services up to a configurable cap and, when a layer exceeds it, runs a cheap single- metric ranking pass to pick the true top-N by the landing's order-by column. The service picker surfaces "top N of M" so the trim is never silent. Queries drain through a bounded-concurrency pool, so a big layer fans out in controlled waves rather than a thundering herd.
  • New query.landingServiceCap in horizon.yaml (default 100) tunes how many services a landing probes per request — raise it if your OAP + storage can take the larger fan-out, lower it to protect a modest deployment.
  • The service picker now lists the whole layer, not only the metric-probed top-N. Services that ranked below the metric cap on the order-by column now appear as their own rows with low in that column (and for the others, which were never probed) instead of being hidden — every service stays browsable, searchable, and selectable regardless of the cap. The header chip reads "metrics: top N" to make the metric trim explicit.
  • Removed the stale "Landing KPI tile" controls (Headline / Trend line) from the Layer-dashboards admin. They no longer matched the rendered layer header — which shows every configured metric column as its own KPI with its own trend line — so editing them changed nothing on screen. The header is driven entirely by the service-list columns + default sort; the preview now reflects that.
  • Selecting a low-traffic (below-cap) service now works on every tab, not just the dashboard. Logs, traces, and endpoint-dependency resolved the picked service's name from the landing sample only — so a tail service queried as blank (and Logs even snapped the pick back to the top service). All per-layer tabs now resolve the name from the full roster, so a low service drills in everywhere.
  • Profiling scopes no longer show an editor grid that goes nowhere. Trace / eBPF / async profiling are built-in runtime views with nothing to author, so the admin now shows a "configured at runtime" note for them instead of a widget grid whose widgets never rendered.

Access control & permissions

  • The Roles & Permissions board now lists infra-3d:read — the permission to view the 3D Infrastructure Map — under the data-catalog group, with a matching "3D infrastructure map" row in the menu-visibility matrix. It was already enforced and granted to every built-in role (viewer and up), but it never appeared on the board, so an admin couldn't see who held it.
  • Editing a layer dashboard template is now gated on the dashboard:write permission the editor already advertises; publishing overview, alert, and 3D-map configs stays on overview:write. The required permission is resolved per template kind at save time. Built-in roles are unaffected (operator and admin hold both), but a custom role granted only dashboard:write can now save layer dashboards.
  • The Cluster Status debug view (/api/debug/status) now requires only live-debug:read. It previously also demanded cluster:read, so a role granted live-debug access but not cluster-read was wrongly blocked.
  • Saving a local draft of a template (the "Save local" action) now enforces the same per-kind permission as publishing — a layer draft needs dashboard:write, other kinds overview:write — instead of a blanket overview:write.

Performance hardening

  • Layer dashboards skip a redundant service lookup on every load. The dashboard route used to issue its own listServices to auto-pick the service and carry its entity-scope flags; it now reads the shared per-layer service catalog the sidebar already keeps warm, so a dashboard's first paint costs one fewer OAP round-trip in the common case. It still falls back to a live lookup for a just-registered service, a cold snapshot, or to surface an OAP outage (the "OAP unreachable" state now follows the actual widget fetch, so a warm cache can't mask a backend that's gone away).
  • The alarms list and count fire their two startup probes in parallel. The server-time offset and backend-capability probes that precede every alarms query now run concurrently instead of one-after-the-other.
  • The 3D Infra Map loads its metrics in parallel. Per-node metric values used to fetch one batch at a time; they now load in bounded-concurrency batches, so the load rings and traffic values fill in sooner on large layers. A new metricConcurrency setting in the Infra Map config (default 4) caps how many metric batches run at once.
  • Oversized layer topologies fail with a clear message instead of an unreadable map. When a layer's service graph exceeds the render ceiling (5,000 services or 15,000 calls), the service map shows a "Topology too large to render" notice with the live counts and a hint to pick a specific service or lower the depth, rather than attempting to lay out a graph too dense to read.
  • Partial metric-load failures are now surfaced on every topology map. If some metric batches fail to load (a transient OAP error) on the service map, instance topology, deployment, or endpoint-dependency map, a banner now explains that blank values may be unavailable rather than zero — and on the endpoint-dependency map, that some endpoints or links may be missing — so a backend hiccup isn't misread as real "no traffic" data.

Fixes

  • Metrics Inspect — the crosshair value tooltip is no longer clipped behind the navigation sidebar when you hover near a widget's left edge; it now renders above the page chrome.
  • 3D Infrastructure Map config — "Check diff & push" now requires saving a local draft first (it was selectable while edits were still unsaved), and the push dialog renders the side-by-side before/after JSON diff instead of an empty panel.
  • The API-dependency tab now honors the topbar time picker. It was pinned to the last hour regardless of the selected range; changing the range (and expanding a node) now re-queries the chosen window, like the service map and instance map already did.
  • A dashboard no longer blanks entirely when one metric group fails. A transient backend error (timeout / 5xx / query-complexity limit) on a single batch of widgets now marks only those widgets as failed; the rest of the dashboard renders normally instead of every cell going blank.
  • Trace list rows pick the correct root span on BanyanDB. A multi-service trace could surface a downstream span's endpoint / duration / start time in the list; the row now reliably reflects the trace's true entry span.
  • Correct timestamps right after repointing OAP. The server-timezone offset is now cached per OAP URL, so a configuration reload that switches to a different-timezone OAP re-probes immediately instead of serving the previous server's offset for up to a minute.
  • Baseline security response headers (X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: no-referrer) are now sent on every response.
  • Removed an internal ?mockTop= debug query parameter that padded top-N widgets with synthetic rows; it no longer ships in release builds.
  • The profiling pages now use more of the page height. The Trace / eBPF / Async / pprof / network profiling layouts were sized off a viewport offset that over-counted the chrome above them, leaving dead space at the bottom on taller screens; they now extend closer to the bottom of the view.
  • Overview dashboard templates are validated before save. A malformed overview (missing a required field, an unknown widget type) is now rejected with a clear field-level error instead of being written to OAP — restoring a guard that was lost when overview editing moved to the OAP-backed save path.

Documentation & release tooling

  • A further accuracy pass corrected the Cluster Status page (three panes — Query, Admin, Zipkin/OTLP — and no per-node member list), the Kubernetes readiness-probe guidance (point it at the public /api/health, not the authenticated /api/oap/info), the layer-template components default (only the service dashboard is on when a key is omitted) and the aliases authoring key, the removed visibleWhen free-text and embedded-i18n template shapes, and the data-retention cold-stage controls.
  • The website docs were brought current with the 0.6.0 build and the configuration pages restructured around the admin UI — the JSON shape is now a reference appendix, not an authoring surface (these admin pages are structured editors, not raw-JSON editors). Accuracy fixes span the RBAC verbs (incl. infra-3d:read), the audit-log action set, the Metrics Inspect API paths, the layer-template component flags, and the redesigned 3D-map config + loading stages. A new docs/CLAUDE.md records the doc-writing principles, and the i18n docs gain a language × scope coverage matrix plus a translation step in the add-a-layer recipe.
  • The container image is published to Docker Hub by CI on every v* tag; the post-vote finalize script now only verifies the published tags (the manual local-push fallback and Docker Hub login preflight were removed).

Layer drill-down fixes

  • The per-layer Instance and Endpoint pages now honor the layer's configured aliases in their section headers and in the service-picker's name column — e.g. ActiveMQ reads Brokers / Destinations and Virtual MQ reads Topics / MQ clusters, matching the sidebar — instead of the generic "Instance" / "Endpoint" / "Service" labels. Layers that define no alias still read the generic words.
  • A layer's Instance or Endpoint page no longer hangs on a perpetual "Reading data…" when the selected service reports no instances or endpoints (or a search matches nothing). It now shows the empty picker and renders the metric widgets in their normal "no data" state, so the layout stays visible and ready for services that do report them.
  • Clearer cluster boundaries on every topology view. The dashed grouping boxes — namespaces on the service map, per-service boxes on the instance map, role/tier clusters on the Deployment tab — now draw with a bolder, brighter dashed border and a fully transparent background, so the boundary reads clearly on every theme (light themes included) instead of fading into the canvas. The Deployment tab also packs its cluster boxes evenly: boxes sit at a uniform spacing with no dead corridor between tiers and no blank strip before the first box.

Live debugger

  • MAL sample groups. A captured step that fans out to many samples no longer dumps every label set on screen: the samples are grouped by metric name into a one-line summary — <metric> · N samples · values=… — and you expand only the groups you care about to see each sample's full labels. Groups are collapsed by default.
  • Diff is the default when a group is expanded. A multi-sample group opens straight into diff view: the labels shared by every sample collapse into a dimmed "common" block and only the labels that differ are highlighted per sample — so it is immediate which label distinguishes each one (e.g. node_role / pod_name) and what value it maps to. A diff toggle beside the group's header switches back to the full per-sample label list. The "common" set is computed across the whole group, not just the rendered rows.
  • Multiple output entities collapse the same way. When a record materialises one metric for several entities (e.g. a per-endpoint write rate over sw_metricsMinute / sw_metricsHour / sw_metricsDay), the repeated meter cards fold into one block: a shared header (metric / function / time bucket), a N outputs · values=… summary, and a diff that surfaces only the entity fields that actually differ — whichever they are, not a fixed field — with each output's value beside it.
  • Readable sample values. Long fractional values from rate() / avg() (e.g. 57.0333333333…) are trimmed to a few significant digits for display so they stop overflowing the value column; integer counters still render exact, and the precise value stays available on hover.

DSL management — live apply progress & recovery

  • A structural rule change now shows live apply progress. Saving an edit that moves a metric's storage shape (scope, downsampling, or the metric set) no longer just flashes "submitted" — the editor tracks the apply across the cluster through a phase stepper (Compiled → Confirming across the cluster → Committing → Done) and reports success only once OAP confirms the change is durable. Revert to bundled (also a schema change) goes through the same stepper. Body- and filter-only edits still apply instantly with no stepper. You can navigate away mid-apply; reloading the editor resumes the progress.
  • "Applied — cluster propagation unconfirmed" is a warning, not an error. When a structural change is committed and durable but one or more nodes hadn't confirmed the new schema within OAP's fence budget, the editor names the lagging nodes and explains they self-converge on their next scan — the rule is applied, not rolled back. Reloading the editor reads it back as applied (from the stored rule).
  • A failed apply is called out as rolled back — the cluster stays on the previous rule, the failure reason is shown inline, and your edit is kept in the editor so you can fix and save again. A compile error now surfaces as an inline diagnostic under the editor instead of a transient toast.
  • Force re-apply to recover. A degraded or transiently-failed apply offers a one-click Force re-apply (recover) that re-runs the apply across the cluster to re-confirm the schema and un-stick any waiting node — gated behind a confirm that spells out it briefly pauses collection for that rule's metrics, even when the content is unchanged. This subsumes the old Advanced force toggle for the recovery case.

Source & binary releases (with signatures and checksums):

Container image: docker pull apache/skywalking-ui:horizon-0.7.0