Skip to content

Add Cluster Usage admin page with discovery-driven resource display #15

@lexfrei

Description

Problem

The console has no view for cluster-wide resource consumption. To answer questions like "how busy is the cluster right now", "how many GPUs are allocated", or "where can I schedule the next pod" — the operator has to drop to kubectl top nodes, kubectl describe node, or an external Grafana. There is no in-dashboard answer.

This issue adds a new admin page that surfaces cluster-scoped resource usage, including arbitrary extended resources (GPUs and any other accelerators), discovered at runtime rather than hardcoded.

Visibility

The page is gated to cluster-level operators only. Tenant users who lack permission to list nodes must not see the menu item at all — not "see it and then hit a 403".

Implementation: a new conditional item in Console → Administration in apps/console/src/routes/sidebar-sections.tsx, rendered only when a SelfSubjectAccessReview returns allowed: true for nodes list. This is a new pattern in the console; once introduced, the same hook can later be retrofitted to other cluster-scoped items (Tenants, External IPs, Modules) without further design work.

Numbers the page distinguishes

Any resource on a node has up to four distinct numbers, and the page surfaces them as separate concepts because they answer different questions:

  • Capacitynode.status.capacity — what is physically on the node. Does not change without nodes joining or leaving.
  • Allocatablenode.status.allocatable — capacity minus the kubelet/system reservation. What the scheduler can actually hand out.
  • Requested — sum of pod.spec.containers[].resources.requests over pods bound to the node. Answers "how much room is left for scheduling".
  • Usedmetrics.k8s.io/v1beta1/nodes — what is actually consumed right now. Answers "how loaded is the node in reality".

Requested and Used are different numbers and both have to be visible. A node can sit at 10% CPU used and 95% CPU requested — it looks free, but it is unschedulable. Hiding either column produces a misleading view.

For extended resources, only Capacity / Allocatable / Requested apply. metrics.k8s.io does not report extended resources. True utilization for accelerators (DCGM SM utilization, HAMi fractional usage, Prometheus-backed metrics) is a deeper layer and is explicitly out of scope for the first iteration.

Layout

Aggregates panel (top)

A grid of cards summarizing the cluster.

A header line shows total node count split by Ready / NotReady / SchedulingDisabled.

Fixed cards for standard resources: CPU, Memory, Storage (ephemeral), Pods. Each card uses Allocatable as the denominator and overlays Used (if metrics.k8s.io is available) and Requested as two separate progress bars. If metrics.k8s.io is not discovered, the Used line is omitted without any "metrics-server not installed" warning — the rest of the page works regardless.

Below the standard cards, a section "Extended resources (discovered)" with one card per extended-resource prefix found in the cluster (everything in status.capacity except cpu, memory, ephemeral-storage, pods, hugepages-*). Each card shows Capacity, Allocatable, Requested for the aggregated prefix. Card titles use the prefix verbatim — nvidia.com/gpu, amd.com/gpu, hami.io/vgpu, acme.io/fpga — no vendor mapping. Cards are sorted alphabetically by prefix for stability. If zero extended resources exist in the cluster, the section is not rendered.

Progress-bar color: ≤70% neutral, 70–90% warning, >90% danger. The same scale applies to both Used and Requested bars.

Per-node table (bottom)

Columns, left to right:

  • Name — node name. Plain text in the first iteration (a future per-node detail page is out of scope).
  • Status — Ready / NotReady / SchedulingDisabled. A warning icon appears when any status.conditions of type MemoryPressure, DiskPressure, PIDPressure, or NetworkUnavailable is True; tooltip lists which.
  • Roles — labels with the node-role.kubernetes.io/* prefix. If the node is cordoned, an indicator is shown. If the node has taints, a +tainted N chip appears with a tooltip listing the taints.
  • CPU — two-line cell. Top line: used / allocatable with a progress bar (only if metrics.k8s.io is present, otherwise omitted). Bottom line: requested / allocatable with a progress bar.
  • Memory — same structure as CPU.
  • One column per full extended-resource key found anywhere in the cluster. Cell shows requested / allocatable / capacity. Cells on nodes that do not expose the key render . Columns are per full keynvidia.com/gpu and nvidia.com/gpu.shared are separate columns — even though the aggregate panel above groups by prefix.
  • AgeformatAge(creationTimestamp), the same helper currently used by TenantsPage.

Sorting: click any column header. Default sort is by Name.

Filter: a text input above the table filters by Name and Roles substring.

Pagination: none in the first iteration. The list comes whole from the K8s API; clusters with >>100 nodes already do not render well in any current console page, and adding pagination is a separate piece of work that affects multiple pages.

Horizontal scroll is acceptable when the number of dynamic extended-resource columns exceeds what fits in the viewport — preferred over collapsing distinct resources into a single column.

Discovery flow

On page mount the following queries run in parallel:

  1. GET /apis to determine whether metrics.k8s.io is registered. One request, cached for the page lifetime.
  2. useK8sList<Node> with watch enabled — feeds both panels with capacity/allocatable/conditions. Never polls.
  3. useK8sList<Pod> cluster-wide with watch enabled — used to compute Requested per node (group by spec.nodeName, sum containers[].resources.requests). Never polls.
  4. Conditional on step 1 finding metrics.k8s.io: useK8sList<NodeMetrics> from metrics.k8s.io/v1beta1, no watch (the API does not support it), refetchInterval: 30000. Thirty seconds matches the default --metric-resolution of metrics-server; a faster cadence returns identical values and wastes requests.

Extended-resource keys are derived from step 2, not requested separately. After the Node list arrives, walk every status.capacity map, filter out the standard keys, the remaining unique keys define the dynamic columns and the extended-resource aggregate cards. A new accelerator vendor surfaces in the UI automatically the moment a node exposing it joins the cluster — no code change, no release.

Edge cases

  • nodes list denied: the sidebar gate should already hide the entry, but on direct URL navigation the page renders a single-line "You do not have permission to view cluster nodes" with a link back to the console — not a browser 403.
  • pods list cluster-wide denied: Requested cells render empty with a tooltip "Requires cluster-wide pod read access". Used is unaffected.
  • metrics.k8s.io registered but the NodeMetrics request returns 403: Used cells show , no full-page error.
  • NodeMetrics partially absent (node just joined, metrics-server has not yet collected it): for that node only, the rest render normally.
  • NotReady node: CPU/Memory/extended cells show . Status and Age still render.
  • Single-node cluster: aggregate cards equal the single row — visually redundant but does not break.

RBAC (companion PR in cozystack/cozystack)

A new ClusterRole and ClusterRoleBinding ship with the dashboard chart under packages/system/dashboard/templates/:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cozystack-dashboard-cluster-usage
rules:
- apiGroups: [""]
  resources: ["nodes", "pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
  resources: ["nodes"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cozystack-dashboard-cluster-usage
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cozystack-dashboard-cluster-usage
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: cozystack-cluster-admin

Rationale for the cluster-wide pods list grant: cozystack-cluster-admin is currently bound to the built-in edit ClusterRole via cozystack-dashboard-cluster-admin, and edit only grants pod read inside namespaces. Cluster-wide listing requires this explicit grant. Without it, Requested cannot be computed across the whole cluster.

The chart intentionally does not add permissions for deeper sources (HAMi CRs, Prometheus, DCGM). Those belong in their respective packages or in operator-provided RBAC; the page detects them via discovery and gates each section with a per-source SelfSubjectAccessReview. If an operator wants those sections to surface, they add the bindings themselves.

The RBAC change must be covered by a helm-unittest fixture under packages/system/dashboard/tests/ so future refactors of the chart cannot silently drop the rules.

Out of scope (first iteration)

  • True utilization for accelerators (DCGM SM utilization, HAMi fractional usage, Prometheus-backed metrics).
  • Per-tenant resource breakdown (this page is cluster-scoped only).
  • Historical time-series / charts.
  • Alerting and threshold configuration UI.
  • A node-detail page reached by clicking a node name in the table.
  • Multi-select filters on Status / Roles.
  • Pagination of the per-node table.

Acceptance criteria

  • Console → Administration → Cluster Usage appears for users who can list nodes and is invisible to users who cannot.
  • The page renders correctly on a cluster with metrics.k8s.io (Used cells populated) and on a cluster without it (Used omitted, no errors anywhere).
  • Aggregate cards correctly sum across all nodes for CPU, Memory, ephemeral-storage, Pods.
  • Every extended-resource prefix discovered in the cluster gets its own aggregate card, with the prefix shown verbatim — no hardcoded vendor names anywhere in the code.
  • The per-node table contains one dynamic column per full extended-resource key found in the cluster.
  • A new ClusterRole cozystack-dashboard-cluster-usage and its ClusterRoleBinding ship in the dashboard chart and are covered by helm-unittest fixtures.
  • The sidebar item is hidden when the current user fails SelfSubjectAccessReview for nodes list.

Metadata

Metadata

Labels

kind/featureCategorizes issue or PR as related to a new feature

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions