Skip to content

fix: handle stale ResourceClaims blocking quota admission retries#524

Merged
zachsmith1 merged 12 commits intomainfrom
fix/stale-claim-retry
Mar 18, 2026
Merged

fix: handle stale ResourceClaims blocking quota admission retries#524
zachsmith1 merged 12 commits intomainfrom
fix/stale-claim-retry

Conversation

@zachsmith1
Copy link
Copy Markdown
Contributor

@zachsmith1 zachsmith1 commented Mar 18, 2026

Summary

  • Fixed marshaling of unstructured object so that metadata was correctly usable within CEL
  • Added milo_quota_existing_claim_resolutions_total counter (labels: not_found, existing_granted, stale_deleted, delete_failed, get_failed) to track how often each resolution path activates in production

When a ResourceClaim is denied or times out, it persists with a
deterministic name. On retry, the admission plugin generates the same
claim name and the Create call fails with AlreadyExists — surfaced as
a generic "Insufficient quota" error even when quota is available.

Add handleExistingClaim to detect and resolve stale claims before
registering the watch waiter: granted claims short-circuit to allow,
denied/pending claims are deleted so a fresh claim can be created.

Add milo_quota_existing_claim_resolutions_total counter to track how
often each resolution path is hit in production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joggrbot
Copy link
Copy Markdown
Contributor

joggrbot bot commented Mar 18, 2026

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: c190708 | Powered by Joggr

@zachsmith1 zachsmith1 requested a review from scotwells March 18, 2026 17:17
zachsmith1 and others added 8 commits March 18, 2026 10:27
… claims

Denied claims (Granted=False, reason=QuotaExceeded) are already handled
by the DeniedAutoClaimCleanupController. The admission plugin now only
deletes pending claims inline — these are left behind by timeouts and
are not covered by the GC controller.

Split the single test into two: one for denied claims (verifies GC
deferral) and one for pending/timed-out claims (verifies inline cleanup).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some API server code paths (e.g. apiextensions decoding CRDs) produce
unstructured objects without a top-level metadata key in the Object
map. The admission attributes carry name/namespace separately, but CEL
template expressions like trigger.metadata.name fail with "no such
key: metadata" because the map lacks the metadata envelope.

Backfill metadata.name and metadata.namespace from the admission
attributes when the unstructured object is missing them. This fixes
EndpointSlice (and other native type) quota enforcement in project
virtual control planes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add V(2) log to capture the top-level keys and metadata presence when
the admission plugin receives an unstructured object. This will help
identify why EndpointSlice arrives without metadata in the project
virtual control plane.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log object type, keys, and metadata presence when a structured type
(like EndpointSlice) is converted to unstructured in the admission
plugin. This complements the existing unstructured path logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure admission object debug logs always emit (not filtered by
verbosity) and include the GVK to identify which resource type is
being processed. This helps catch the EndpointSlice admission path
before logs rotate out on high-volume API servers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll fires

Move debug logging after the type switch so it fires for all objects.
Log objectType (Go type), GVK, keys, and metadata presence in a single
log line. Add ERROR-level log when the metadata backfill actually fires
so we can confirm the root cause in production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
runtime.DefaultUnstructuredConverter.ToUnstructured uses Go struct field
names (objectMeta, typeMeta) rather than JSON tags (metadata, apiVersion/
kind). This causes CEL expressions like trigger.metadata.name to fail
with "no such key: metadata" for structured types like EndpointSlice.

Replace ToUnstructured with JSON marshal/unmarshal to produce the same
key names as JSON-decoded unstructured objects. This ensures consistent
CEL evaluation regardless of whether the admission object arrives as a
structured Go type or *unstructured.Unstructured.

Remove the metadata backfill workaround since the root cause is fixed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the V(0) admission object debug logging that was added to
diagnose the structured-to-unstructured conversion issue. The root
cause has been identified and fixed (JSON round-trip instead of
ToUnstructured).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zachsmith1 zachsmith1 requested a review from scotwells March 18, 2026 21:26
Remove handleExistingClaim, classifyClaim, and the
existing_claim_resolutions_total metric. Denied claims (including
timeouts) are handled by the DeniedAutoClaimCleanupController via
garbage collection — no evidence of stale claims blocking retries
in production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zachsmith1 zachsmith1 requested review from scotwells and removed request for scotwells March 18, 2026 21:53
Copy link
Copy Markdown
Contributor

@scotwells scotwells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zachsmith1 having a chainsaw end-to-end tests for this would give us a bit more confidence that this works as expected.

@zachsmith1 zachsmith1 merged commit acb29dc into main Mar 18, 2026
8 checks passed
@zachsmith1 zachsmith1 deleted the fix/stale-claim-retry branch March 18, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants