failover reservation misc changes after first qa testing#607
Conversation
* eligibilty resuse dep graph instead of rebuilding all the time * do not create Decisions * set failover to perform less work in the default config * added AZ to failover reservations
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe failover-reservations controller configuration system is refactored to adopt Kubernetes duration types, introduce an Changes
Sequence Diagram(s)sequenceDiagram
participant Controller
participant Config
participant VM Selection
participant Eligibility Checker
participant Scheduler
participant Exclusion Set
Controller->>Config: ApplyDefaults()
Config-->>Controller: Config with defaults applied
Controller->>VM Selection: SelectVMsToProcess()
Note over VM Selection: Use VMSelectionRotationInterval<br/>for deterministic offset
VM Selection-->>Controller: Selected VMs
loop For each selected VM
Controller->>Eligibility Checker: CheckVMsStillEligible(vms)
Eligibility Checker-->>Controller: Ineligible VMs map
end
loop For each remaining eligible VM
Controller->>Scheduler: ScheduleAndBuild(..., excludeHypervisors)
Note over Scheduler: Skip hosts in excludeHypervisors<br/>if LimitOneNewReservationPerHypervisor enabled
Scheduler-->>Controller: New Reservation or reused candidate
alt New Reservation Created
Controller->>Exclusion Set: Add host to excludeHypervisors
Controller->>Scheduler: PatchReservationStatus(newReservation)
else Reuse Candidate
Controller->>Scheduler: PatchReservationStatus(existingReservation)
end
end
Controller->>Controller: Determine requeueAfter based on<br/>MinSuccessForShortInterval/<br/>MaxFailuresForShortInterval
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
internal/scheduling/reservations/failover/reservation_scheduling.go (1)
253-260: Consider pushingexcludeHypervisorsdown into the scheduler request.Right now those hosts are still scored remotely and only discarded afterwards. If the scheduler response is bounded rather than exhaustive, that can also turn a valid placement into a local false negative. Passing the exclusion set via
IgnoreHostswould avoid both problems.Also applies to: 273-279
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/reservation_scheduling.go` around lines 253 - 260, The scheduler currently receives the full host list and only filters out excludeHypervisors after scoring, which can yield false negatives; update scheduleAndBuildNewFailoverReservation to pass excludeHypervisors down into the scheduler request by populating the request's IgnoreHosts (or similar) field so the scheduler will omit those hosts during scoring, and make the same change in the other call site around the block referenced at lines 273-279; locate usages of excludeHypervisors and the scheduler request construction in this file and set IgnoreHosts = keys(excludeHypervisors) (or equivalent) before invoking the scheduler so excluded hypervisors are not considered remotely.internal/scheduling/reservations/failover/reservation_eligibility.go (1)
206-211:removeVMFromReservationdoesn't clean upvmToCurrentHypervisor.When a VM is removed from its last reservation, the entry in
vmToCurrentHypervisorremains. While this doesn't affect correctness (the graph is typically short-lived), it could cause subtle issues if the graph is reused extensively.Consider cleaning up
vmToCurrentHypervisorwhen a VM has no remaining reservations:💡 Optional cleanup suggestion
func (g *DependencyGraph) removeVMFromReservation(vmUUID, resKey, resHost string) { delete(g.vmToReservations[vmUUID], resKey) delete(g.vmToReservationHosts[vmUUID], resHost) delete(g.reservationToVMs[resKey], vmUUID) + // Clean up VM entry if it has no remaining reservations + if len(g.vmToReservations[vmUUID]) == 0 { + delete(g.vmToCurrentHypervisor, vmUUID) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/reservation_eligibility.go` around lines 206 - 211, removeVMFromReservation currently deletes entries from vmToReservations, vmToReservationHosts, and reservationToVMs but leaves vmToCurrentHypervisor populated; update removeVMFromReservation to check vmToReservations[vmUUID] (or length of that map/collection) after deleting the reservation and if it is empty or nil, also delete g.vmToCurrentHypervisor[vmUUID] so the VM's hypervisor entry is cleaned up when it has no remaining reservations.internal/scheduling/reservations/failover/integration_test.go (1)
981-1002: Consider documenting the magic number 3.The limit of 3 hosts simulates Nova's behavior of only considering top-ranked hosts, but this magic number could benefit from a constant or comment explaining why 3 was chosen.
💡 Suggested improvement
+ // Only consider the first 3 hosts (simulating Nova's behavior of using top hosts) + // Nova typically limits evaluation to a small set of top-weighted hosts + const maxHostsToCheck = 3 hostsToCheck := response.Hosts - if len(hostsToCheck) > 3 { - hostsToCheck = hostsToCheck[:3] + if len(hostsToCheck) > maxHostsToCheck { + hostsToCheck = hostsToCheck[:maxHostsToCheck] }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/integration_test.go` around lines 981 - 1002, Replace the magic number 3 used when trimming response.Hosts with a named constant and document its intent: introduce something like maxHostsToConsider (or topHostsToConsider) near the test setup and use hostsToCheck = hostsToCheck[:maxHostsToConsider] (guarded by len check as existing), and add a short comment explaining that this mirrors Nova's behavior of only evaluating the top-ranked hosts; update any references in the loop that rely on that limit (e.g., hostsToCheck, candidateHost) to use the new constant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/scheduling/reservations/failover/config.go`:
- Around line 39-43: The ApplyDefaults logic is currently unable to distinguish
"unset" vs "explicit zero/false" for ShortReconcileInterval and
LimitOneNewReservationPerHypervisor; change the types to presence-sensitive ones
and only apply defaults when the field is nil—specifically, make
ShortReconcileInterval a pointer (*metav1.Duration) and only set the 100ms
fallback when ShortReconcileInterval == nil (leaving an explicit zero duration
untouched), and make LimitOneNewReservationPerHypervisor a *bool and set it to
true only when nil (preserving an explicit false); then update ApplyDefaults()
to check for nil on these fields (instead of checking value) and adjust any
callers/serialization handling accordingly.
In `@internal/scheduling/reservations/failover/reservation_eligibility.go`:
- Around line 60-93: The loop over the allocations map in CheckVMsStillEligible
is non-deterministic; convert the map keys into a sorted slice before iterating
so eligibility checks and subsequent calls into baseGraph (e.g.,
baseGraph.isVMEligibleForReservation and baseGraph.removeVMFromReservation) run
in a stable order. Specifically, after calling getFailoverAllocations(&res)
produce a []string of vm UUIDs, sort it (e.g., using slices.Sort), then iterate
that sorted slice instead of ranging the map keys; keep the existing checks
against vm.CurrentHypervisor and baseGraph.vmToCurrentHypervisor and call
baseGraph.removeVMFromReservation(vm.UUID, resKey, res.Status.Host) when
ineligible.
---
Nitpick comments:
In `@internal/scheduling/reservations/failover/integration_test.go`:
- Around line 981-1002: Replace the magic number 3 used when trimming
response.Hosts with a named constant and document its intent: introduce
something like maxHostsToConsider (or topHostsToConsider) near the test setup
and use hostsToCheck = hostsToCheck[:maxHostsToConsider] (guarded by len check
as existing), and add a short comment explaining that this mirrors Nova's
behavior of only evaluating the top-ranked hosts; update any references in the
loop that rely on that limit (e.g., hostsToCheck, candidateHost) to use the new
constant.
In `@internal/scheduling/reservations/failover/reservation_eligibility.go`:
- Around line 206-211: removeVMFromReservation currently deletes entries from
vmToReservations, vmToReservationHosts, and reservationToVMs but leaves
vmToCurrentHypervisor populated; update removeVMFromReservation to check
vmToReservations[vmUUID] (or length of that map/collection) after deleting the
reservation and if it is empty or nil, also delete
g.vmToCurrentHypervisor[vmUUID] so the VM's hypervisor entry is cleaned up when
it has no remaining reservations.
In `@internal/scheduling/reservations/failover/reservation_scheduling.go`:
- Around line 253-260: The scheduler currently receives the full host list and
only filters out excludeHypervisors after scoring, which can yield false
negatives; update scheduleAndBuildNewFailoverReservation to pass
excludeHypervisors down into the scheduler request by populating the request's
IgnoreHosts (or similar) field so the scheduler will omit those hosts during
scoring, and make the same change in the other call site around the block
referenced at lines 273-279; locate usages of excludeHypervisors and the
scheduler request construction in this file and set IgnoreHosts =
keys(excludeHypervisors) (or equivalent) before invoking the scheduler so
excluded hypervisors are not considered remotely.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a78f3c19-ce4a-48b4-9d55-6c748c86fe4e
📒 Files selected for processing (13)
cmd/main.gohelm/bundles/cortex-nova/templates/pipelines_kvm.yamlhelm/bundles/cortex-nova/values.yamlinternal/scheduling/reservations/failover/config.gointernal/scheduling/reservations/failover/context.gointernal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/failover/controller_test.gointernal/scheduling/reservations/failover/helpers.gointernal/scheduling/reservations/failover/integration_test.gointernal/scheduling/reservations/failover/reservation_eligibility.gointernal/scheduling/reservations/failover/reservation_eligibility_test.gointernal/scheduling/reservations/failover/reservation_scheduling.gointernal/scheduling/reservations/scheduler_client.go
Test Coverage ReportTest Coverage 📊: 68.4% |
Uh oh!
There was an error while loading. Please reload this page.