fix: handle stale ResourceClaims blocking quota admission retries#524
Merged
zachsmith1 merged 12 commits intomainfrom Mar 18, 2026
Merged
fix: handle stale ResourceClaims blocking quota admission retries#524zachsmith1 merged 12 commits intomainfrom
zachsmith1 merged 12 commits intomainfrom
Conversation
When a ResourceClaim is denied or times out, it persists with a deterministic name. On retry, the admission plugin generates the same claim name and the Create call fails with AlreadyExists — surfaced as a generic "Insufficient quota" error even when quota is available. Add handleExistingClaim to detect and resolve stale claims before registering the watch waiter: granted claims short-circuit to allow, denied/pending claims are deleted so a fresh claim can be created. Add milo_quota_existing_claim_resolutions_total counter to track how often each resolution path is hit in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
scotwells
reviewed
Mar 18, 2026
… claims Denied claims (Granted=False, reason=QuotaExceeded) are already handled by the DeniedAutoClaimCleanupController. The admission plugin now only deletes pending claims inline — these are left behind by timeouts and are not covered by the GC controller. Split the single test into two: one for denied claims (verifies GC deferral) and one for pending/timed-out claims (verifies inline cleanup). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some API server code paths (e.g. apiextensions decoding CRDs) produce unstructured objects without a top-level metadata key in the Object map. The admission attributes carry name/namespace separately, but CEL template expressions like trigger.metadata.name fail with "no such key: metadata" because the map lacks the metadata envelope. Backfill metadata.name and metadata.namespace from the admission attributes when the unstructured object is missing them. This fixes EndpointSlice (and other native type) quota enforcement in project virtual control planes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add V(2) log to capture the top-level keys and metadata presence when the admission plugin receives an unstructured object. This will help identify why EndpointSlice arrives without metadata in the project virtual control plane. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log object type, keys, and metadata presence when a structured type (like EndpointSlice) is converted to unstructured in the admission plugin. This complements the existing unstructured path logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure admission object debug logs always emit (not filtered by verbosity) and include the GVK to identify which resource type is being processed. This helps catch the EndpointSlice admission path before logs rotate out on high-volume API servers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll fires Move debug logging after the type switch so it fires for all objects. Log objectType (Go type), GVK, keys, and metadata presence in a single log line. Add ERROR-level log when the metadata backfill actually fires so we can confirm the root cause in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
runtime.DefaultUnstructuredConverter.ToUnstructured uses Go struct field names (objectMeta, typeMeta) rather than JSON tags (metadata, apiVersion/ kind). This causes CEL expressions like trigger.metadata.name to fail with "no such key: metadata" for structured types like EndpointSlice. Replace ToUnstructured with JSON marshal/unmarshal to produce the same key names as JSON-decoded unstructured objects. This ensures consistent CEL evaluation regardless of whether the admission object arrives as a structured Go type or *unstructured.Unstructured. Remove the metadata backfill workaround since the root cause is fixed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the V(0) admission object debug logging that was added to diagnose the structured-to-unstructured conversion issue. The root cause has been identified and fixed (JSON round-trip instead of ToUnstructured). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove handleExistingClaim, classifyClaim, and the existing_claim_resolutions_total metric. Denied claims (including timeouts) are handled by the DeniedAutoClaimCleanupController via garbage collection — no evidence of stale claims blocking retries in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scotwells
approved these changes
Mar 18, 2026
Contributor
scotwells
left a comment
There was a problem hiding this comment.
@zachsmith1 having a chainsaw end-to-end tests for this would give us a bit more confidence that this works as expected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
milo_quota_existing_claim_resolutions_totalcounter (labels:not_found,existing_granted,stale_deleted,delete_failed,get_failed) to track how often each resolution path activates in production