feat: CR usage controller - moving CR usage calculation into CRD status, for API usage and Quota usage#800
feat: CR usage controller - moving CR usage calculation into CRD status, for API usage and Quota usage#800
Conversation
…us, for API usage and Quota usage
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughExtended CommittedResource status with AcceptedSpec and per-resource usage fields; replaced VM-centric usage with AssignedInstances and UsedResources; added a UsageReconciler plus Prometheus monitor; made change-commitments polling generation-aware with conflict-safe rollback and backoff; added Nova Glance image sync and Image model, migrated server persistence to v3, updated extractors, helm, CLI, manager wiring, and many tests. ChangesCommittedResource Controller, Usage, API & Tests
Nova Images Datasource, Server v3 Migration & Extractors
Sequence Diagram(s)sequenceDiagram
participant API as ChangeCommitmentsAPI
participant CR as CommittedResourceCR
participant Controller as CommittedResourceController
participant Reconciler as UsageReconciler
participant DB as UsageDB
API->>CR: Create/Update CommittedResource (spec)
API->>API: append crWatch{name,generation}
API->>CR: Poll CR (cache) until generation observed
alt Ready observed for expected generation
CR-->>API: Ready=True (ObservedGeneration matches) -> API returns success
else Timeout or Rejected
API->>CR: Rollback spec (RetryOnConflict)
end
Controller->>Controller: Try place Reservations
alt Placement fails and AllowRejection=false
Controller->>Controller: compute retryDelay/exponential backoff
else Placement succeeds
Controller->>CR: set AcceptedSpec and reset retry state
end
Reconciler->>DB: Query project VMs (join images -> os_type)
Reconciler->>CR: Patch Status.AssignedInstances and Status.UsedResources
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (5)
internal/knowledge/datasources/plugins/openstack/nova/nova_api.go (2)
89-97: 💤 Low valueImplicit precondition:
GetAllImagespanics if the datasource type is notImages.
api.glanceis only assigned whenapi.conf.Type == NovaDatasourceTypeImages. Any code path that callsGetAllImageson a non-Images instance will hit a nil-pointer dereference insideglanceimages.List(api.glance, ...). Today the sync router enforces this, but it's an easy regression. Consider either initializing Glance unconditionally (cheap — same provider/availability), or returning a clear error fromGetAllImageswhenapi.glance == nil.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 89 - 97, GetAllImages can panic because api.glance is only set when api.conf.Type == v1alpha1.NovaDatasourceTypeImages; add an explicit nil-check at the start of GetAllImages that returns a clear error (e.g., "glance client not initialized: datasource type is not Images") if api.glance == nil, and/or alternatively initialize api.glance unconditionally in the constructor code shown (the openstack.NewImageV2 call) so api.glance is always non-nil; reference GetAllImages, api.glance, NovaDatasourceTypeImages, and the glanceimages.List call when locating where to add the guard or the unconditional initialization.
459-479: ⚡ Quick winAdd request-timer instrumentation for parity with other Get methods.*
Every other
Get*method in this file wraps the API call(s) withapi.mon.RequestTimer.WithLabelValues(label)for Prometheus tracing.GetAllImagesskips it, leaving Glance pagination unmonitored. Adding the same wrapper keeps observability uniform across datasources.♻️ Proposed change
func (api *novaAPI) GetAllImages(ctx context.Context) ([]Image, error) { + label := Image{}.TableName() + slog.Info("fetching nova data", "label", label) + if api.mon.RequestTimer != nil { + hist := api.mon.RequestTimer.WithLabelValues(label) + timer := prometheus.NewTimer(hist) + defer timer.ObserveDuration() + } var result []Image opts := glanceimages.ListOpts{Limit: 1000} err := glanceimages.List(api.glance, opts).EachPage(ctx, func(_ context.Context, page pagination.Page) (bool, error) { imgs, err := glanceimages.ExtractImages(page) if err != nil { return false, err } for _, img := range imgs { result = append(result, Image{ ID: img.ID, OSType: deriveOSType(img.Properties, img.Tags), }) } return true, nil }) if err != nil { return nil, fmt.Errorf("failed to list Glance images: %w", err) } + slog.Info("fetched", "label", label, "count", len(result)) return result, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 459 - 479, GetAllImages is missing Prometheus request-timer instrumentation; wrap the Glance call with the same timer pattern used by other Get* methods: call api.mon.RequestTimer.WithLabelValues("glance_images").StartTimer() (or the existing label naming convention used elsewhere), assign to a timer variable and defer timer.ObserveDuration(), then run the glanceimages.List(...).EachPage call as before inside that timed scope; update function novaAPI.GetAllImages to include this StartTimer/ObserveDuration pairing.internal/knowledge/datasources/plugins/openstack/nova/nova_types.go (1)
161-168: 💤 Low valueSilent failure when image JSON has unexpected shape.
If the
imagefield is a JSON object but unmarshaling intoimageMapfails (e.g., a future Nova response with a non-stringid), the error is silently dropped andImageRefis left empty. That makes such servers indistinguishable from volume-booted servers downstream and harder to diagnose. Consider returning the error (consistent with the flavor/fault handling above) or at least logging it.♻️ Proposed change
- if err := json.Unmarshal(aux.Image, &imageMap); err == nil { - s.ImageRef = imageMap.ID - } + if err := json.Unmarshal(aux.Image, &imageMap); err != nil { + return err + } + s.ImageRef = imageMap.ID🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_types.go` around lines 161 - 168, The code currently swallows JSON unmarshal errors for aux.Image, leaving s.ImageRef empty; instead, when json.Unmarshal(aux.Image, &imageMap) returns an error you should not drop it—either return the error from the surrounding UnmarshalJSON (or the equivalent parsing function) or log it consistently with the earlier flavor/fault handling. Modify the block that parses aux.Image so that on err != nil you return fmt.Errorf("failed to parse image JSON: %w", err) (or call the same logger used elsewhere) rather than ignoring the error, keeping the existing assignment to s.ImageRef when successful (symbols: aux.Image, imageMap, s.ImageRef).internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go (1)
59-61: ⚡ Quick winAdd a dedicated
TestNovaSyncer_SyncImagestest to match the pattern of every other sync method.Every existing sync method has a corresponding named test (
TestNovaSyncer_SyncServers,SyncHypervisors,SyncFlavors, etc.), butSyncAllImageshas none. The mock is already wired up — the test would be a one-to-one copy of the existing pattern.✅ Proposed test for `SyncAllImages`
+func TestNovaSyncer_SyncImages(t *testing.T) { + dbEnv := testlibDB.SetupDBEnv(t) + testDB := db.DB{DbMap: dbEnv.DbMap} + defer dbEnv.Close() + mon := datasources.Monitor{} + syncer := &NovaSyncer{ + DB: testDB, + Mon: mon, + Conf: v1alpha1.NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages}, + API: &mockNovaAPI{}, + } + + ctx := t.Context() + if err := syncer.Init(ctx); err != nil { + t.Fatalf("failed to init nova syncer: %v", err) + } + n, err := syncer.SyncAllImages(ctx) + if err != nil { + t.Fatalf("expected no error, got %v", err) + } + if n != 1 { + t.Fatalf("expected 1 image, got %d", n) + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go` around lines 59 - 61, Add a new unit test named TestNovaSyncer_SyncImages that mirrors the pattern used by TestNovaSyncer_SyncServers/SyncHypervisors/SyncFlavors: instantiate the NovaSyncer with the mockNovaAPI (which implements GetAllImages returning the sample image), call the sync method (e.g., NovaSyncer.SyncAllImages or the syncer method that triggers image syncing), and assert that the returned/side-effect results match the expected image list and no error occurred; use the same setup/teardown, mock injection, and assertion style as the other sync tests so it fits the existing test suite conventions.internal/scheduling/reservations/commitments/usage_reconciler_monitor.go (1)
18-55: ⚡ Quick win
NewUsageReconcilerMonitorreturns a value type, butDescribe/Collecthave pointer receivers — only*UsageReconcilerMonitorsatisfiesprometheus.Collector.Any caller that tries to use the returned value directly as a
prometheus.Collector(e.g.,prometheus.Register(monitor)wheremonitor := NewUsageReconcilerMonitor()) will get a compile error. The caller is forced to take the address explicitly (&monitor), which is non-obvious given a constructor name.The idiomatic fix is to return
*UsageReconcilerMonitor:♻️ Proposed fix
-func NewUsageReconcilerMonitor() UsageReconcilerMonitor { - m := UsageReconcilerMonitor{ +func NewUsageReconcilerMonitor() *UsageReconcilerMonitor { + m := &UsageReconcilerMonitor{ reconcileDuration: prometheus.NewHistogramVec(prometheus.HistogramOpts{ ... }, []string{"result"}), ... } // Pre-initialize result labels so metrics appear before the first reconcile. m.reconcileDuration.WithLabelValues("success") m.reconcileDuration.WithLabelValues("error") return m }Please verify how
cmd/manager/main.gouses the returned value — if it already passes&monitorto bothprometheus.RegisterandNewUsageReconciler, this is a style concern; if not, it's a compile error waiting to happen.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/usage_reconciler_monitor.go` around lines 18 - 55, NewUsageReconcilerMonitor currently returns a value (UsageReconcilerMonitor) but the methods Describe and Collect have pointer receivers, so only *UsageReconcilerMonitor implements prometheus.Collector; change NewUsageReconcilerMonitor to return *UsageReconcilerMonitor and allocate/return a pointer (e.g., &UsageReconcilerMonitor{...}) so callers can pass it directly to prometheus.Register and NewUsageReconciler without taking the address; then verify usages in cmd/manager/main.go (prometheus.Register(...) and NewUsageReconciler(...)) to ensure they accept the pointer and update any call sites that expect a value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/manager/main.go`:
- Around line 562-571: The code registers a UsageReconciler even when
commitmentsUsageDB is nil (i.e. no datasource configured), leaving the feature
enabled but non-functional; add a guard before creating/registering
UsageReconciler and its UsageReconcilerMonitor that checks
commitmentsConfig.DatasourceName and commitmentsUsageDB and fail fast (return an
error or log.Fatalf) with a clear message if the datasource is not configured;
specifically, before calling commitments.NewUsageReconcilerMonitor(), verify
commitmentsConfig.DatasourceName != "" and commitmentsUsageDB != nil and only
then create the monitor and call
(&commitments.UsageReconciler{...}).SetupWithManager(mgr, multiclusterClient).
In `@internal/scheduling/reservations/commitments/api/report_usage_test.go`:
- Around line 906-923: The helper toCommittedResourceCRD on UsageTestReservation
drops the commitment StartTime causing ordering-sensitive tests to break; update
toCommittedResourceCRD to copy the reservation's StartTime into the
CommittedResource spec (set Spec.StartTime = tr.StartTime or the appropriate
metav1.Time wrapper used by v1alpha1.CommittedResourceSpec) so synthetic CRDs
preserve the original commitment start time used by the reconciler (reference
UsageTestReservation and toCommittedResourceCRD).
In `@internal/scheduling/reservations/commitments/api/usage_test.go`:
- Around line 335-346: The synthetic CommittedResource created as cr lacks the
reservation time window, so add the reservation's StartTime and EndTime into
cr.Spec to mirror production activation windows; set cr.Spec.StartTime and
cr.Spec.EndTime from the test reservation (e.g.,
r.Spec.CommittedResourceReservation.StartTime and
r.Spec.CommittedResourceReservation.EndTime or r.Spec.StartTime/EndTime as
appropriate) so the reconciler sees the same activation window when exercising
CommittedResource handling.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 352-363: The legacy fallback currently calls
FromCommittedResource(*cr) which uses the possibly-mutated current spec;
instead, when cr.Status.AcceptedSpec == nil build a temporary CommittedResource
spec that preserves placement/selector/amount from the last-accepted status:
create tempCR := v1alpha1.CommittedResource{Spec: cr.Spec} then overwrite the
placement-related fields (AZ/availability zone, project, domain, flavor/group
and the resource amount) from the corresponding status fields (e.g. use
cr.Status.AcceptedAmount.Value() and the status fields that record accepted
AZ/project/domain/flavor) before calling state, err =
FromCommittedResource(tempCR); keep the existing state.TotalMemoryBytes =
cr.Status.AcceptedAmount.Value() assignment after successful
FromCommittedResource. This ensures FromCommittedResource uses the last-accepted
placement rather than the current mutated CR.
- Around line 471-477: The slowdown gate should only apply to the generation
that actually failed: change the condition in the
committed_resource_controller.go block that currently checks
cr.Status.ConsecutiveFailures >= maxConsecutiveFailuresForSlowdown to also
require the failure is for the current generation (e.g.
cr.Status.FailureGeneration == cr.Generation or
cr.Status.FailedObservedGeneration == cr.ObjectMeta.Generation). Also update the
code that increments/reset ConsecutiveFailures (the failure-recording logic) to
set FailureGeneration = cr.Generation when you increment failures and to reset
ConsecutiveFailures and FailureGeneration when the spec/generation advances (so
new generations start with a fresh count). This ensures suppression only occurs
for the generation that produced the failures.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go`:
- Around line 46-49: Early-return on non-active commitment states (checked via
cr.Spec.State against CommitmentStatusConfirmed and CommitmentStatusGuaranteed)
leaves stale cr.Status.AssignedInstances and cr.Status.UsedResources intact;
instead, when the commitment is not active (i.e., states
planned/pending/superseded/expired), clear/reset cr.Status.AssignedInstances and
cr.Status.UsedResources to empty/zero and persist the change via the status
update path (e.g., r.Status().Update(ctx, cr)) before returning so stale
assignments are not reported as committed usage. Ensure this logic is placed in
the same reconciliation branch that currently returns early after the
cr.Spec.State check.
---
Nitpick comments:
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go`:
- Around line 89-97: GetAllImages can panic because api.glance is only set when
api.conf.Type == v1alpha1.NovaDatasourceTypeImages; add an explicit nil-check at
the start of GetAllImages that returns a clear error (e.g., "glance client not
initialized: datasource type is not Images") if api.glance == nil, and/or
alternatively initialize api.glance unconditionally in the constructor code
shown (the openstack.NewImageV2 call) so api.glance is always non-nil; reference
GetAllImages, api.glance, NovaDatasourceTypeImages, and the glanceimages.List
call when locating where to add the guard or the unconditional initialization.
- Around line 459-479: GetAllImages is missing Prometheus request-timer
instrumentation; wrap the Glance call with the same timer pattern used by other
Get* methods: call
api.mon.RequestTimer.WithLabelValues("glance_images").StartTimer() (or the
existing label naming convention used elsewhere), assign to a timer variable and
defer timer.ObserveDuration(), then run the glanceimages.List(...).EachPage call
as before inside that timed scope; update function novaAPI.GetAllImages to
include this StartTimer/ObserveDuration pairing.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go`:
- Around line 59-61: Add a new unit test named TestNovaSyncer_SyncImages that
mirrors the pattern used by
TestNovaSyncer_SyncServers/SyncHypervisors/SyncFlavors: instantiate the
NovaSyncer with the mockNovaAPI (which implements GetAllImages returning the
sample image), call the sync method (e.g., NovaSyncer.SyncAllImages or the
syncer method that triggers image syncing), and assert that the
returned/side-effect results match the expected image list and no error
occurred; use the same setup/teardown, mock injection, and assertion style as
the other sync tests so it fits the existing test suite conventions.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_types.go`:
- Around line 161-168: The code currently swallows JSON unmarshal errors for
aux.Image, leaving s.ImageRef empty; instead, when json.Unmarshal(aux.Image,
&imageMap) returns an error you should not drop it—either return the error from
the surrounding UnmarshalJSON (or the equivalent parsing function) or log it
consistently with the earlier flavor/fault handling. Modify the block that
parses aux.Image so that on err != nil you return fmt.Errorf("failed to parse
image JSON: %w", err) (or call the same logger used elsewhere) rather than
ignoring the error, keeping the existing assignment to s.ImageRef when
successful (symbols: aux.Image, imageMap, s.ImageRef).
In `@internal/scheduling/reservations/commitments/usage_reconciler_monitor.go`:
- Around line 18-55: NewUsageReconcilerMonitor currently returns a value
(UsageReconcilerMonitor) but the methods Describe and Collect have pointer
receivers, so only *UsageReconcilerMonitor implements prometheus.Collector;
change NewUsageReconcilerMonitor to return *UsageReconcilerMonitor and
allocate/return a pointer (e.g., &UsageReconcilerMonitor{...}) so callers can
pass it directly to prometheus.Register and NewUsageReconciler without taking
the address; then verify usages in cmd/manager/main.go (prometheus.Register(...)
and NewUsageReconciler(...)) to ensure they accept the pointer and update any
call sites that expect a value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: be511876-1fdb-427f-959d-63e23f2e391a
📒 Files selected for processing (33)
api/v1alpha1/committed_resource_types.goapi/v1alpha1/datasource_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/manager/main.gohelm/bundles/cortex-nova/templates/datasources.yamlhelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_committedresources.yamlinternal/knowledge/datasources/plugins/openstack/controller_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_api.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_types.gointernal/knowledge/extractor/plugins/compute/libvirt_domain_cpu_steal_pct.sqlinternal/knowledge/extractor/plugins/compute/vm_host_residency.sqlinternal/knowledge/extractor/plugins/compute/vm_life_span.sqlinternal/knowledge/extractor/plugins/compute/vrops_hostsystem_resolver.sqlinternal/knowledge/extractor/plugins/compute/vrops_project_noisiness.sqlinternal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/api/usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/config.gointernal/scheduling/reservations/commitments/field_index.gointernal/scheduling/reservations/commitments/reservation_manager.gointernal/scheduling/reservations/commitments/reservation_manager_test.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/commitments/usage_internals_test.gointernal/scheduling/reservations/commitments/usage_reconciler.gointernal/scheduling/reservations/commitments/usage_reconciler_monitor.gotools/visualize-committed-resources/main.go
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go (1)
276-306: ⚡ Quick winCover the
Sync()dispatch for image datasources too.This test proves
SyncAllImages()works, but the controller path goes throughNovaSyncer.Sync(). Switching this assertion tosyncer.Sync(ctx)would also protect the newNovaDatasourceTypeImagesbranch while keeping the DB checks below unchanged.Minimal change
- n, err := syncer.SyncAllImages(ctx) + n, err := syncer.Sync(ctx)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go` around lines 276 - 306, Replace the direct call to SyncAllImages with the controller dispatch path by invoking NovaSyncer.Sync(ctx) in TestNovaSyncer_SyncImages so the NovaDatasourceTypeImages branch is exercised; keep the same assertions (expect no error, expect count n == 1) and the subsequent DB checks for Image table unchanged, referencing the test function TestNovaSyncer_SyncImages and the NovaSyncer methods Sync and SyncAllImages and the NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages} configuration so the test still validates the DB contents for img-1 and OSType windows8Server64Guest.internal/knowledge/datasources/plugins/openstack/nova/nova_api.go (1)
494-523: ⚡ Quick winAdd focused tests for
deriveOSType.This helper now defines the contract for
vmware_ostype,ostype:tags, ambiguous-tag fallback, and"unknown", but the provided tests only mock a precomputedOSType. A small table-driven unit test here would lock down the behavior before CR usage depends on it.Example cases worth pinning down
func TestDeriveOSType(t *testing.T) { tests := []struct { name string properties map[string]any tags []string want string }{ { name: "vmware_ostype wins", properties: map[string]any{"vmware_ostype": "windows8Server64Guest"}, want: "windows8Server64Guest", }, { name: "single ostype tag", tags: []string{"ostype:rhel7_64Guest"}, want: "rhel7_64Guest", }, { name: "multiple ostype tags are ambiguous", tags: []string{"ostype:rhel7_64Guest", "ostype:windows8Server64Guest"}, want: "unknown", }, { name: "fallback to unknown", want: "unknown", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { if got := deriveOSType(tt.properties, tt.tags); got != tt.want { t.Fatalf("deriveOSType() = %q, want %q", got, tt.want) } }) } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go` around lines 494 - 523, Add focused table-driven unit tests for deriveOSType to lock its contract: create TestDeriveOSType that exercises (1) properties with a non-empty string "vmware_ostype" wins, (2) single "ostype:..." tag returns that value, (3) multiple "ostype:..." tags produce "unknown" (ambiguous), (4) no info returns "unknown", and also cover a non-string or empty "vmware_ostype" value falling back to tags; implement using a tests slice with t.Run and compare deriveOSType(tt.properties, tt.tags) to the expected want.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 134-137: The code leaves the
CommittedResource.Status.ConsecutiveFailures count unchanged when
ApplyCommitmentState succeeded but the controller returns early to wait for
child readiness (e.g., the branch that returns ctrl.Result{} with
setNotReady(reason Reserving, "waiting for reservation placement")); update the
reconcile flow so that after a successful ApplyCommitmentState (or any code path
where placement succeeded and we are only passively waiting for child resources)
you reset or clear Status.ConsecutiveFailures (or set a dedicated
lastPlacementFailed boolean) before calling setNotReady or returning without
RequeueAfter; specifically adjust the logic around ApplyCommitmentState, the
setNotReady calls in the Reserving branches and similar branches noted (also the
other branches at the same pattern) to ensure the slowdown gate no longer
observes stale consecutive failure counts.
In `@internal/scheduling/reservations/commitments/usage_reconciler.go`:
- Line 137: The metric usage_reconciler.assignedInstances currently calls
assignedInstances.Set(float64(totalAssigned)) and overwrites values across
projects; update UsageReconcilerMonitor so assignedInstances is either created
with a project_id label and Set is called with that label for each project's
totalAssigned, or change the metric to an aggregate (e.g., maintain a
cluster-wide sum in the reconciler and Set once) depending on intended
semantics—add project_id as a label on the GaugeVec and call
assignedInstances.WithLabelValues(projectID).Set(...) (or implement cluster
aggregation and remove per-project sets); confirm which approach matches
UsageReconcilerMonitor semantics and apply consistently across reconciles.
- Around line 117-131: The statusAge histogram is being observed inside the
inner loop (for _, state := range group) using cr.Status.LastUsageReconcileAt,
causing identical samples to be recorded once per commitment; move the single
observation out of the inner loops so you call r.Monitor.statusAge.Observe(...)
exactly once per reconcile (using
now.Time.Sub(cr.Status.LastUsageReconcileAt.Time).Seconds()) after the loops
complete (or alternatively change to use each
state.Target.Status.LastUsageReconcileAt if you intend per-commitment samples);
update the block around writeUsageStatus/state loop (and variables
written/totalAssigned) so observation happens once per reconcile rather than
inside writeUsageStatus loop.
---
Nitpick comments:
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api.go`:
- Around line 494-523: Add focused table-driven unit tests for deriveOSType to
lock its contract: create TestDeriveOSType that exercises (1) properties with a
non-empty string "vmware_ostype" wins, (2) single "ostype:..." tag returns that
value, (3) multiple "ostype:..." tags produce "unknown" (ambiguous), (4) no info
returns "unknown", and also cover a non-string or empty "vmware_ostype" value
falling back to tags; implement using a tests slice with t.Run and compare
deriveOSType(tt.properties, tt.tags) to the expected want.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go`:
- Around line 276-306: Replace the direct call to SyncAllImages with the
controller dispatch path by invoking NovaSyncer.Sync(ctx) in
TestNovaSyncer_SyncImages so the NovaDatasourceTypeImages branch is exercised;
keep the same assertions (expect no error, expect count n == 1) and the
subsequent DB checks for Image table unchanged, referencing the test function
TestNovaSyncer_SyncImages and the NovaSyncer methods Sync and SyncAllImages and
the NovaDatasource{Type: v1alpha1.NovaDatasourceTypeImages} configuration so the
test still validates the DB contents for img-1 and OSType windows8Server64Guest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4d1aa827-b4c6-4cb0-a435-bb5de63e0a40
📒 Files selected for processing (6)
cmd/manager/main.gointernal/knowledge/datasources/plugins/openstack/nova/nova_api.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/usage_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/manager/main.go
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/scheduling/reservations/commitments/committed_resource_controller.go (1)
347-352:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't drop existing reservations when rolling back a legacy CR without
AcceptedSpec.This fallback still deletes child reservations for CRs that were accepted before
status.acceptedSpecexisted. After that,isRejectedForGeneration()keeps short-circuiting the same generation, so the accepted placement is never recreated until a new spec update arrives. That can deprovision an already-active commitment on its first failed mutation after rollout.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/committed_resource_controller.go` around lines 347 - 352, The code currently deletes child reservations when status.AcceptedSpec is absent, which removes valid placements for legacy-accepted CRs; instead, when AcceptedSpec is nil do not call r.deleteChildReservations(ctx, cr) — preserve existing child reservations and return without deleting (or requeue) so the previously-accepted placement remains intact; adjust the conditional logic around AcceptedSpec and isRejectedForGeneration() to avoid short-circuiting the same generation (i.e., skip teardown path and keep reservation state) and update the comment in committed_resource_controller.go near the AcceptedSpec check to reflect this behavior.
🧹 Nitpick comments (2)
internal/knowledge/datasources/plugins/openstack/nova/nova_api_test.go (1)
541-599: 💤 Low value
TestDeriveOSTypelooks good.The 7 cases cleanly exercise every distinct branch: property precedence, type-mismatch fallthrough, empty-string fallthrough, single/multiple tag disambiguation, and no-input defaults.
One optional gap: none of the cases passes
nilas thepropertiesargument (all usemap[string]any{}). If callers can supplynil— e.g. when a Glance image carries no extra properties — adding a case like:+ { + name: "nil properties falls through to tags", + properties: nil, + tags: []string{"ostype:alpine"}, + want: "alpine", + },would confirm nil-safety at the cost of one extra row. Not urgent since nil-map lookup in Go is safe, but it documents the contract explicitly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api_test.go` around lines 541 - 599, Add a nil-properties test case to TestDeriveOSType to document and verify nil-safety: add a table row where properties is nil (instead of map[string]any{}), provide an appropriate tags slice (e.g. with a single ostype tag or none) and the expected want value, then run the table-driven tests; update the tests slice in TestDeriveOSType that constructs test cases for deriveOSType so it includes this nil case to ensure deriveOSType handles a nil properties map.internal/scheduling/reservations/commitments/usage.go (1)
173-177: 🏗️ Heavy liftAvoid cluster-wide
CommittedResourcescans in the usage path.Both helpers list every
CommittedResourceand then filter in memory by project.CalculateUsage()callsBuildVMAssignmentsFromStatus()on every API request, so report latency grows with total cluster CR count instead of the target project's footprint. Please add aspec.projectIDfield index and useclient.MatchingFieldshere so both the API and the usage reconciler stay on a per-project read path.Also applies to: 205-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/usage.go` around lines 173 - 177, The code currently lists all CommittedResource objects cluster-wide in UsageCalculator.BuildVMAssignmentsFromStatus (and the similar helper at the 205-208 region called by CalculateUsage), causing per-request latency to scale with total CRs; add a field index for spec.projectID on the CommittedResource type and change the client.List calls to use client.MatchingFields{"spec.projectID": projectID} (i.e., pass the projectID via client.MatchingFields when listing) so BuildVMAssignmentsFromStatus and the other helper perform a scoped per-project read instead of a full cluster scan; update any related List usages in CalculateUsage to the same pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 677-678: The SELECT currently uses COALESCE(i.os_type, 'unknown')
which only handles NULLs; change it to use COALESCE(NULLIF(i.os_type, ''),
'unknown') so empty-string image os_type values are normalized to "unknown" —
update the SQL expression that produces AS os_type (the COALESCE around
i.os_type) in internal/scheduling/reservations/commitments/usage.go.
---
Duplicate comments:
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 347-352: The code currently deletes child reservations when
status.AcceptedSpec is absent, which removes valid placements for
legacy-accepted CRs; instead, when AcceptedSpec is nil do not call
r.deleteChildReservations(ctx, cr) — preserve existing child reservations and
return without deleting (or requeue) so the previously-accepted placement
remains intact; adjust the conditional logic around AcceptedSpec and
isRejectedForGeneration() to avoid short-circuiting the same generation (i.e.,
skip teardown path and keep reservation state) and update the comment in
committed_resource_controller.go near the AcceptedSpec check to reflect this
behavior.
---
Nitpick comments:
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_api_test.go`:
- Around line 541-599: Add a nil-properties test case to TestDeriveOSType to
document and verify nil-safety: add a table row where properties is nil (instead
of map[string]any{}), provide an appropriate tags slice (e.g. with a single
ostype tag or none) and the expected want value, then run the table-driven
tests; update the tests slice in TestDeriveOSType that constructs test cases for
deriveOSType so it includes this nil case to ensure deriveOSType handles a nil
properties map.
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 173-177: The code currently lists all CommittedResource objects
cluster-wide in UsageCalculator.BuildVMAssignmentsFromStatus (and the similar
helper at the 205-208 region called by CalculateUsage), causing per-request
latency to scale with total CRs; add a field index for spec.projectID on the
CommittedResource type and change the client.List calls to use
client.MatchingFields{"spec.projectID": projectID} (i.e., pass the projectID via
client.MatchingFields when listing) so BuildVMAssignmentsFromStatus and the
other helper perform a scoped per-project read instead of a full cluster scan;
update any related List usages in CalculateUsage to the same pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d7e6fe85-3793-4214-8307-65f384518531
📒 Files selected for processing (21)
api/v1alpha1/committed_resource_types.goapi/v1alpha1/zz_generated.deepcopy.gohelm/bundles/cortex-nova/values.yamlhelm/library/cortex/files/crds/cortex.cloud_committedresources.yamlinternal/knowledge/datasources/plugins/openstack/nova/nova_api_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.gointernal/knowledge/datasources/plugins/openstack/nova/nova_types.gointernal/scheduling/reservations/commitments/api/change_commitments.gointernal/scheduling/reservations/commitments/api/change_commitments_e2e_test.gointernal/scheduling/reservations/commitments/api/change_commitments_test.gointernal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/api/usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/integration_test.gointernal/scheduling/reservations/commitments/syncer.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/commitments/usage_reconciler.gointernal/scheduling/reservations/commitments/usage_reconciler_monitor.gointernal/scheduling/reservations/commitments/usage_reconciler_test.gotools/visualize-committed-resources/main.go
🚧 Files skipped from review as they are similar to previous changes (13)
- internal/knowledge/datasources/plugins/openstack/nova/nova_sync_test.go
- helm/bundles/cortex-nova/values.yaml
- internal/scheduling/reservations/commitments/api/usage_test.go
- internal/scheduling/reservations/commitments/usage_reconciler_monitor.go
- api/v1alpha1/zz_generated.deepcopy.go
- internal/scheduling/reservations/commitments/integration_test.go
- internal/knowledge/datasources/plugins/openstack/nova/nova_types.go
- internal/scheduling/reservations/commitments/api/change_commitments_e2e_test.go
- helm/library/cortex/files/crds/cortex.cloud_committedresources.yaml
- internal/scheduling/reservations/commitments/api/change_commitments_test.go
- internal/scheduling/reservations/commitments/syncer.go
- tools/visualize-committed-resources/main.go
- internal/scheduling/reservations/commitments/usage_reconciler.go
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
internal/scheduling/reservations/commitments/usage.go (1)
173-188: BothBuildVMAssignmentsFromStatusandbuildCommitmentCapacityMapissue an unbounded full-clusterListon every call.Neither passes a field selector or
client.InNamespaceoption toclient.List. Both functions are called once per project reconcile/API request. In a cluster with manyCommittedResourceCRs spread across namespaces, this means each project query triggers a full watch-cache scan. Consider caching or adding a field-index forspec.projectID(registered at manager startup) so both list calls can useclient.MatchingFieldsto pre-filter, avoiding the in-memory project filter loop.Also applies to: 198-260
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/usage.go` around lines 173 - 188, BuildVMAssignmentsFromStatus and buildCommitmentCapacityMap currently call c.client.List with no selector, causing a full-cluster scan each call; register a field index for v1alpha1.CommittedResource.spec.projectID at manager startup and then change the List invocations in BuildVMAssignmentsFromStatus and buildCommitmentCapacityMap to use client.MatchingFields(map[string]string{"spec.projectID": projectID}) (or client.InNamespace when appropriate) so the API server/cache pre-filters by projectID instead of filtering in-memory; alternatively implement a short-lived cache keyed by projectID to avoid repeated full list calls.internal/scheduling/reservations/commitments/committed_resource_controller.go (1)
368-384: 💤 Low value
retryDelaylooks correct; one minor robustness note.The exponent derivation (
base → 2·base → 4·base → … → 64·base) and theMaxRequeueIntervalcap are sound. One edge case worth considering: ifr.Conf.RequeueIntervalRetry.Durationis misconfigured to0, the loop spins toexp == 6(loop terminates fine, no hang) and the function returns0, which short-circuits any backoff and effectively disables retry pacing for the AllowRejection=false paths. A small early-out (if base <= 0 { return defaultRetry }or equivalent validation at config load) would make the misconfiguration mode obvious instead of silent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/committed_resource_controller.go` around lines 368 - 384, The retryDelay method should guard against a zero or negative base backoff (r.Conf.RequeueIntervalRetry.Duration) which silently disables pacing; add an early-out at the top of retryDelay (before using base and computing exp) that returns a safe default duration (or the validated configured default) when base <= 0, or alternatively surface a config validation error during startup; update retryDelay (function name) to return that default instead of proceeding when r.Conf.RequeueIntervalRetry.Duration is non-positive, keeping the existing doubling logic and MaxRequeueInterval cap (r.Conf.MaxRequeueInterval.Duration) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm/library/cortex/files/crds/cortex.cloud_committedresources.yaml`:
- Around line 33-39: The new printer column and usedResources fields are
inconsistent with spec.resourceType: replace or align the jsonPath references
for UsedCPU (and other occurrences referenced at 130-137 and 361-363) so they
use the same key as the reconciler output (e.g., change
.status.usedResources.cpu to .status.usedResources.cores) or update
spec.resourceType to use "cpu" everywhere; update the printer column name(s)
(UsedCPU) and any descriptions to match the chosen key to ensure the
reconciler-populated field is visible.
- Around line 345-352: The schema for usageObservedGeneration currently allows
negative integers; add a non-negative constraint by setting the OpenAPI
validation minimum to 0 for the usageObservedGeneration property (keep format:
int64 and type: integer) so the CRD only accepts values >= 0 and follows
observedGeneration semantics.
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 416-436: setReadyConditionFalse currently preserves
LastTransitionTime when status and reason are unchanged, causing a
crossing-generation leak; change it so that if the existing condition's
ObservedGeneration != generation (i.e., generation advanced) you treat it as a
transition and set newCond.LastTransitionTime = now regardless of status/reason
and resetTimer flag. Update setReadyConditionFalse to check c.ObservedGeneration
against the provided generation and refresh LastTransitionTime when they differ;
also update the surrounding doc comment to state that a generation change
refreshes the timer so retryDelay and the watch-suppression logic see time spent
only for the current generation.
In `@internal/scheduling/reservations/commitments/e2e_checks.go`:
- Around line 126-146: CheckCommitmentsRoundTrip ignores the config.NoCleanup
flag because e2eRoundTripResource doesn't accept or honor a noCleanup parameter
and always performs deferred deletion; update e2eRoundTripResource (and its call
site in CheckCommitmentsRoundTrip) to accept a noCleanup boolean derived from
config.NoCleanup, then condition the deferred cleanup inside
e2eRoundTripResource on !noCleanup (following the same pattern used in
e2eBatchFlavorGroupResource) so the config.NoCleanup setting prevents test
commitments from being deleted.
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 154-165: The log computes assignedToCommitments by iterating all
entries in vmAssignments (which can be stale and longer than live vms), causing
payg = len(vms)-assignedToCommitments to go negative; change the counting to
derive assignedToCommitments from the live vms slice instead: iterate i from 0
to len(vms)-1 and increment assignedToCommitments only when the corresponding
vmAssignments[i] exists and is non-empty (ensuring you don't read past
vmAssignments), then compute payg as len(vms)-assignedToCommitments before
logging (use the existing variables vmAssignments, vms, assignedToCommitments in
the UsageReconciler context).
---
Nitpick comments:
In
`@internal/scheduling/reservations/commitments/committed_resource_controller.go`:
- Around line 368-384: The retryDelay method should guard against a zero or
negative base backoff (r.Conf.RequeueIntervalRetry.Duration) which silently
disables pacing; add an early-out at the top of retryDelay (before using base
and computing exp) that returns a safe default duration (or the validated
configured default) when base <= 0, or alternatively surface a config validation
error during startup; update retryDelay (function name) to return that default
instead of proceeding when r.Conf.RequeueIntervalRetry.Duration is non-positive,
keeping the existing doubling logic and MaxRequeueInterval cap
(r.Conf.MaxRequeueInterval.Duration) intact.
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 173-188: BuildVMAssignmentsFromStatus and
buildCommitmentCapacityMap currently call c.client.List with no selector,
causing a full-cluster scan each call; register a field index for
v1alpha1.CommittedResource.spec.projectID at manager startup and then change the
List invocations in BuildVMAssignmentsFromStatus and buildCommitmentCapacityMap
to use client.MatchingFields(map[string]string{"spec.projectID": projectID}) (or
client.InNamespace when appropriate) so the API server/cache pre-filters by
projectID instead of filtering in-memory; alternatively implement a short-lived
cache keyed by projectID to avoid repeated full list calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0e96044a-5303-4703-9d02-35144184d5aa
📒 Files selected for processing (16)
api/v1alpha1/committed_resource_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/manager/main.gohelm/library/cortex/files/crds/cortex.cloud_committedresources.yamlinternal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/api/usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/e2e_checks.gointernal/scheduling/reservations/commitments/field_index.gointernal/scheduling/reservations/commitments/integration_test.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/commitments/usage_internals_test.gointernal/scheduling/reservations/commitments/usage_reconciler.gointernal/scheduling/reservations/commitments/usage_reconciler_test.gotools/visualize-committed-resources/main.go
✅ Files skipped from review due to trivial changes (1)
- internal/scheduling/reservations/commitments/field_index.go
🚧 Files skipped from review as they are similar to previous changes (9)
- api/v1alpha1/zz_generated.deepcopy.go
- internal/scheduling/reservations/commitments/api/usage_test.go
- cmd/manager/main.go
- tools/visualize-committed-resources/main.go
- internal/scheduling/reservations/commitments/usage_reconciler_test.go
- internal/scheduling/reservations/commitments/api/report_usage_test.go
- internal/scheduling/reservations/commitments/usage_reconciler.go
- api/v1alpha1/committed_resource_types.go
- internal/scheduling/reservations/commitments/committed_resource_controller_test.go
| - jsonPath: .status.usedResources.memory | ||
| name: UsedMemory | ||
| priority: 1 | ||
| type: string | ||
| - jsonPath: .status.usedAmount | ||
| name: UsedAmount | ||
| - jsonPath: .status.usedResources.cpu | ||
| name: UsedCPU | ||
| priority: 1 |
There was a problem hiding this comment.
Inconsistent resource key naming (cores vs cpu) can break UsedCPU visibility.
spec.resourceType uses cores, but the new printer column and usedResources description use cpu. If reconciler output is keyed by cores, .status.usedResources.cpu will stay empty and clients get mixed semantics.
Suggested schema alignment (one consistent key)
- - jsonPath: .status.usedResources.cpu
+ - jsonPath: .status.usedResources.cores
name: UsedCPU
priority: 1
type: string
...
- (e.g. "memory" in MiB binary SI, "cpu" as core count). Populated by the usage reconciler.
+ (e.g. "memory" in MiB binary SI, "cores" as core count). Populated by the usage reconciler.Also applies to: 130-137, 361-363
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm/library/cortex/files/crds/cortex.cloud_committedresources.yaml` around
lines 33 - 39, The new printer column and usedResources fields are inconsistent
with spec.resourceType: replace or align the jsonPath references for UsedCPU
(and other occurrences referenced at 130-137 and 361-363) so they use the same
key as the reconciler output (e.g., change .status.usedResources.cpu to
.status.usedResources.cores) or update spec.resourceType to use "cpu"
everywhere; update the printer column name(s) (UsedCPU) and any descriptions to
match the chosen key to ensure the reconciler-populated field is visible.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
internal/scheduling/reservations/commitments/usage.go (1)
154-159:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCount committed VMs from the live VM slice, not raw status entries.
vmAssignmentscan still contain stale UUIDs until status is cleaned up, so counting every map entry here can make the loggedpaygvalue go negative. DeriveassignedToCommitmentsfromvmsand testvmAssignments[vm.UUID] != ""instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/usage.go` around lines 154 - 159, The current loop counts committed VMs by iterating vmAssignments which may contain stale entries; change the logic to iterate over the live slice vms and increment assignedToCommitments only when vmAssignments[vm.UUID] != "" so the count reflects actual live VMs. Update the code that sets assignedToCommitments (currently declared as assignedToCommitments := 0) to use the vms slice and the vm.UUID key check against vmAssignments, leaving vmAssignments unchanged.
🧹 Nitpick comments (1)
internal/scheduling/reservations/commitments/api/report_usage_test.go (1)
53-390: ⚡ Quick winAdd one case that asserts a non-default
os_type.None of the table cases set
TestVMUsage.OSTypeorExpectedVMUsage.OSType, so this new path only exercises the"unknown"fallback. A single explicit value here would catch regressions in the DB row → API attribute wiring.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/scheduling/reservations/commitments/api/report_usage_test.go` around lines 53 - 390, Add a new UsageReportTestCase in the testCases table that sets a non-default OS type on the input VM and the matching ExpectedVMUsage to assert DB→API wiring: populate TestVMUsage.OSType (e.g., "windows" or "linux") for the VM created with newTestVMUsage/newTestVMUsageWithEmptyAZ and set the corresponding ExpectedVMUsage.OSType in Expected.VMs for the same UUID so the test validates the non-default OSType is preserved instead of falling back to "unknown".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/scheduling/reservations/commitments/usage_reconciler.go`:
- Around line 109-116: The reconciler currently returns early when
buildCommitmentCapacityMap returns no commitments, which leaves stale
AssignedInstances/UsedResources in the CR status and causes CalculateUsage() to
treat them as active; update the CR status to clear AssignedInstances and
UsedResources (set to empty/zero) before returning from the empty-commitments
path (the branch that checks len(commitmentsByAZFG) == 0 and the similar block
around lines 146-156), using the same status update pattern you use elsewhere
(via the reconciler's client/status update) so subsequent CalculateUsage() sees
no committed resources.
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 209-214: The loop is filtering active commitments using
cr.Spec.State but the function should use the accepted snapshot; after the nil
guard for cr.Status.AcceptedSpec, check cr.Status.AcceptedSpec.State (not
cr.Spec.State) when deciding whether to continue, i.e., replace the state check
that references cr.Spec.State with a check against cr.Status.AcceptedSpec.State
so only the accepted generation controls whether capacity is counted
(references: the loop over allCRs.Items, cr.Spec.State, cr.Status.AcceptedSpec,
and AcceptedSpec.State).
---
Duplicate comments:
In `@internal/scheduling/reservations/commitments/usage.go`:
- Around line 154-159: The current loop counts committed VMs by iterating
vmAssignments which may contain stale entries; change the logic to iterate over
the live slice vms and increment assignedToCommitments only when
vmAssignments[vm.UUID] != "" so the count reflects actual live VMs. Update the
code that sets assignedToCommitments (currently declared as
assignedToCommitments := 0) to use the vms slice and the vm.UUID key check
against vmAssignments, leaving vmAssignments unchanged.
---
Nitpick comments:
In `@internal/scheduling/reservations/commitments/api/report_usage_test.go`:
- Around line 53-390: Add a new UsageReportTestCase in the testCases table that
sets a non-default OS type on the input VM and the matching ExpectedVMUsage to
assert DB→API wiring: populate TestVMUsage.OSType (e.g., "windows" or "linux")
for the VM created with newTestVMUsage/newTestVMUsageWithEmptyAZ and set the
corresponding ExpectedVMUsage.OSType in Expected.VMs for the same UUID so the
test validates the non-default OSType is preserved instead of falling back to
"unknown".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e92b69d4-f307-42e9-ab27-efe10ddb78cb
📒 Files selected for processing (7)
internal/scheduling/reservations/commitments/api/report_usage_test.gointernal/scheduling/reservations/commitments/api/usage_test.gointernal/scheduling/reservations/commitments/committed_resource_controller.gointernal/scheduling/reservations/commitments/committed_resource_controller_test.gointernal/scheduling/reservations/commitments/field_index.gointernal/scheduling/reservations/commitments/usage.gointernal/scheduling/reservations/commitments/usage_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (4)
- internal/scheduling/reservations/commitments/api/usage_test.go
- internal/scheduling/reservations/commitments/field_index.go
- internal/scheduling/reservations/commitments/committed_resource_controller_test.go
- internal/scheduling/reservations/commitments/committed_resource_controller.go
Test Coverage ReportTest Coverage 📊: 69.1% |
### Release digest — 2026-05-07 — [#814](#814) #### cortex v0.0.47 (sha-7d1745d8) **New features:** - `ProjectQuota` CRD with per-resource, per-AZ quota breakdown and PAYG calculation ([#796](#796)) - `FlavorGroupCapacity` CRD + background capacity controller for pre-computed per-flavor VM slot capacity per (flavor group × AZ) ([#728](#728)) - Capacity reporting in `POST /commitments/v1/report-capacity` now uses real `FlavorGroupCapacity` CRD values (replaces placeholder zeros) - CommittedResource usage reconciler — moves usage calculation into CRD status, feeding both LIQUID API and quota controller ([#800](#800)) - KVM OS version label on host capacity metrics ([#810](#810)) - KVM project usage metrics (running VMs / resource usage per project/flavor) ([#803](#803)) - `domain_id` + name on vmware project capacity metrics ([#802](#802)) - `domain_id` in vmware project commitment KPI ([#806](#806)) - Weighing explainer for scheduling decisions ([#808](#808)) **Refactors:** - KVM host capacity metric moved to infrastructure plugins package ([#809](#809)) - Deprecated per-compute KPIs removed (`flavor_running_vms`, `host_running_vms`, `resource_capacity_kvm`) ([#807](#807)) - Bundle-specific RBAC templates moved from library chart into `cortex-ironcore` / `cortex-pods` bundles ([#797](#797)) - Webhook templates moved back into `cortex-nova` ([#805](#805)) - `testlib.Ptr` replaced with native `new()` ([#801](#801)) **Fixes:** - Remove `ignoreAllocations` from kvm-report-capacity pipeline to unblock older admission webhook ([#812](#812)) - Suppress nova scheduling alerts on transient `no such host` DNS errors - Add `identity-domains` as KPI dependency - Rename hypervisor `ClusterRoleBinding` to avoid `roleRef` conflict on redeploy ([#804](#804)) #### cortex-nova v0.0.60 (sha-7d1745d8) Includes cortex v0.0.47. Adds Prometheus datasources and KPI CRD templates for KVM project usage/utilization, and updated RBAC for `FlavorGroupCapacity` + `ProjectQuota` CRDs.
Moves CR usage computation out of the API handler and into a dedicated reconciler that persists results in CRD status, making usage data available to both the LIQUID API and quota controller.