Conversation
Move ironcore, pods, and hypervisor RBAC ClusterRole/ClusterRoleBinding templates out of the shared library chart (helm/library/cortex/) into their respective bundle charts (cortex-ironcore, cortex-nova, cortex-pods). These resources were gated by conditional values toggles in the library chart but only ever applied to specific bundles, so they belong in those bundles directly. This simplifies the library chart, removes unused values toggles, and resolves the TODO comment in hypervisor_role.yaml that noted it should live in the nova bundle. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
The library chart contained webhook Kubernetes resources (ValidatingWebhookConfiguration, Service, Certificate) that only apply to cortex-nova's scheduling controllers. Other bundles (ironcore, cinder, manila, pods) had webhook enabled but don't actually implement the webhook endpoint, resulting in unnecessary resources being deployed. This moves the webhook Service, ValidatingWebhookConfiguration, and Certificate into a new helm/bundles/cortex-nova/templates/webhook.yaml, deletes the helm/library/cortex/templates/webhook/ directory, removes the webhook Certificate block from the library cert-manager template, and removes webhook/certmanager enablement from bundles that don't need it. The webhook flag remains in cortex-nova's scheduling-controllers values where it drives the manager container configuration (args, ports, volume mounts). Additionally, the Tiltfile's cert-manager setup is wrapped in a `setup_certmanager()` function that is only called when cortex-nova is in the active deployments, since it's the only bundle that needs it. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
Golang 1.26 added the `new` statement which can replace our testlib.Ptr.
## Changes - Include domain_id as label in vmware project capacity metrics - Add join to domain table to also include the name as a label for better usability
After PR #797 moved hypervisor RBAC from the per-subchart library into the cortex-nova bundle chart, deploying fails because the existing ClusterRoleBindings in the cluster still reference the old per-subchart ClusterRoles (e.g. cortex-nova-scheduling-manager-role-hypervisor). Kubernetes does not allow changing roleRef on an existing ClusterRoleBinding, so the deployment errors out. This renames both ClusterRoleBindings to new unique names so Helm creates them fresh, pointing to the shared cortex-nova-manager-role-hypervisor ClusterRole. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
Commit 33cb060 moved webhook Kubernetes resources (Service, Certificate, ValidatingWebhookConfiguration) from the shared library chart into only the cortex-nova bundle, assuming non-nova bundles don't implement webhooks. In reality all bundles (manila, cinder, ironcore, pods) register pipeline validation webhooks in the Go code and need these resources deployed. Without them the manager crashes on startup trying to load TLS certs from /tmp/k8s-webhook-server/serving-certs/tls.crt. This reverts that commit to restore the webhook templates in the library chart and re-enables webhook/certmanager in all bundle values. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
## Changes - Add project utilization KPI for kvm - Minor adjustments on how the usage is calculated for vmware
The scheduling pipeline's human-readable explanation previously only described which filter steps removed which hosts. Operators had no insight into why the weighing stage ranked hosts the way it did, especially when negative multipliers are involved. This adds a weighing explainer that recovers per-weigher multipliers via least-squares on the normal equations, performs counterfactual analysis to identify pivotal weighers whose removal would change the #1 host, and decomposes pairwise score gaps to surface the leading contributor and any opposing weighers. The result is appended to the existing generateExplanation output so operators can see actionable detail about ranking causality without any additional API surface. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
## Changes - Renamed KPI from `Resource` -> `Project` - Add `project_name`, `domain_id` and `domain_name` as labels to the metrics
## Changes - Move kvm host capacity metric to infrastructure package - Refactored metric to align with infrastructure kpi patterns
## Changes - Remove KPIs that are replaced by the project utilization kpis
## Changes - Add OS version fo the hypervisor crd as label to the kvm host metrics
Introduces a `FlavorGroupCapacity` CRD and a background capacity controller that pre-computes per-flavor VM slot capacity for each (flavor group × AZ) pair on a fixed interval. This allows future capacity API endpoints to serve data from cache instead of probing the scheduler on every request. ## CRD (`api/v1alpha1/FlavorGroupCapacity`) One cluster-scoped resource per (flavor group × AZ). Status holds per-flavor fields: `totalCapacityVmSlots` / `totalCapacityHosts` (empty-datacenter scenario via `kvm-report-capacity` pipeline) and `placeableVms` / `placeableHosts` (current-state via `kvm-general-purpose-load-balancing`), plus `totalInstances` and `committedCapacity` aggregated from `CommittedResource` CRDs. ## Controller (`internal/scheduling/reservations/capacity/`) Runs as a `manager.Runnable` on a configurable interval (default 5 min). For each (group × AZ) pair it issues two scheduler probes — one per pipeline. VM slots are computed as `floor(host.effectiveCapacity.memory / flavor.memoryBytes)`. Stale per-flavor values are preserved on partial probe failure; the `Ready` condition reflects whether all probes succeeded. ## Wiring `capacity-controller` added to `enabledControllers` in the Helm chart values. `FlavorGroupCapacity` and `FlavorGroupCapacityList` registered as home-cluster GVKs in the multicluster client config.
…us, for API usage and Quota usage (#800) Moves CR usage computation out of the API handler and into a dedicated reconciler that persists results in CRD status, making usage data available to both the LIQUID API and quota controller.
The setup-claude-code-action was failing with "ERROR: No matching distribution found for litellm==1.83.10" because litellm 1.83.10 declares requires_python >=3.10,<3.14, but the action was using Python 3.14. This downgrades to Python 3.13 and adds a python constraint to the renovate config so it does not get bumped back up. Assisted-by: Claude Code:claude-opus-4-20250514 [Bash] [Read]
The admission webhook in staging runs the old code and rejects the field with "json: unknown field ignoreAllocations". Removing it here unblocks pipeline deployment. A follow-up PR will re-add it once the new webhook has rolled through all regions.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude run command /release 814 |
|
Claude finished @mblos's task in 2m 23s —— View job
|
Co-authored-by: mblos <mblos@users.noreply.github.com>
Test Coverage ReportTest Coverage 📊: 69.1% |
PhilippMatthes
left a comment
There was a problem hiding this comment.
Needs to bump the cortex-nova chart and cortex library chart.
Release digest — 2026-05-07 — #814
cortex v0.0.47 (sha-7d1745d8)
New features:
ProjectQuotaCRD with per-resource, per-AZ quota breakdown and PAYG calculation (#796)FlavorGroupCapacityCRD + background capacity controller for pre-computed per-flavor VM slot capacity per (flavor group × AZ) (#728)POST /commitments/v1/report-capacitynow uses realFlavorGroupCapacityCRD values (replaces placeholder zeros)domain_id+ name on vmware project capacity metrics (#802)domain_idin vmware project commitment KPI (#806)Refactors:
flavor_running_vms,host_running_vms,resource_capacity_kvm) (#807)cortex-ironcore/cortex-podsbundles (#797)cortex-nova(#805)testlib.Ptrreplaced with nativenew()(#801)Fixes:
ignoreAllocationsfrom kvm-report-capacity pipeline to unblock older admission webhook (#812)no such hostDNS errorsidentity-domainsas KPI dependencyClusterRoleBindingto avoidroleRefconflict on redeploy (#804)cortex-nova v0.0.60 (sha-7d1745d8)
Includes cortex v0.0.47. Adds Prometheus datasources and KPI CRD templates for KVM project usage/utilization, and updated RBAC for
FlavorGroupCapacity+ProjectQuotaCRDs.