Implement placement shim /traits APIs#729
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 55 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis pull request adds distributed trait management to the cortex placement shim. It introduces a Kubernetes-based distributed lock package, refactors trait handlers to support local ConfigMap storage with filtering capabilities, adds upstream synchronization for custom traits, and extends Helm configuration to support trait persistence and RBAC permissions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Shim as Shim Handler
participant Lock as Lease Lock
participant ConfigMap as Local ConfigMap
participant Upstream as Upstream Server
Client->>Shim: PUT /traits/{name}<br/>(enableTraits=true)
activate Shim
Shim->>Shim: Validate CUSTOM_ prefix
Shim->>ConfigMap: Check trait existence<br/>(merged static+dynamic)
Shim->>Lock: AcquireLock(traits-lock)
activate Lock
Lock-->>Shim: Lock acquired
deactivate Lock
Shim->>ConfigMap: Create/Update dynamic<br/>traits ConfigMap
ConfigMap-->>Shim: Success
Shim->>Upstream: PUT /traits/{name}<br/>(best-effort sync)
Upstream-->>Shim: 200/5xx
Shim->>Lock: ReleaseLock(traits-lock)
Lock-->>Shim: Released
Shim-->>Client: 201 (new) / 204 (existing)
deactivate Shim
sequenceDiagram
participant Client
participant Shim as Shim Handler
participant StaticCM as Static Traits ConfigMap
participant CustomCM as Custom Traits ConfigMap
Client->>Shim: GET /traits?name=in:foo
activate Shim
Shim->>StaticCM: Fetch static traits
StaticCM-->>Shim: {traits JSON}
Shim->>CustomCM: Fetch custom traits
CustomCM-->>Shim: {traits JSON}
Shim->>Shim: Merge + Parse both<br/>ConfigMaps
Shim->>Shim: Filter by name=in:foo
Shim-->>Client: 200 {traits: [filtered]}
deactivate Shim
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Split the Lease-based locking code from handle_traits.go into a reusable pkg/resourcelock library. The new ResourceLocker provides AcquireLock / ReleaseLock with functional options and a holder-identity safety check on release. Refine traits handlers: add structured logging (matching handle_resource_providers.go style), OpenStack API doc links, and inline comments on key code paths. Rename default configMapName to cortex-placement-shim-traits.
The controller-runtime informer cache needs list/watch on Leases to set up its reflector. Without these, the shim fails at startup with a forbidden error on leases.coordination.k8s.io.
When HandleUpdateTrait creates a CUSTOM_* trait locally, also fire a
PUT /traits/{name} to upstream placement so that forwarded endpoints
like PUT /resource_providers/{uuid}/traits can reference the trait.
The sync is best-effort: errors are logged but never fail the request,
and the call is skipped entirely if upstream is not configured.
The upstream placement API requires authentication. Clone the incoming
request headers into the best-effort PUT /traits/{name} sync call so
that the X-Auth-Token is forwarded.
When enableTraits is true, traits are served from ConfigMaps. The static list may be empty (no Helm-managed traits configured), so the "at least one trait" assertion must only apply when forwarding to upstream placement. Also skip the "show known trait" sub-test when the list is empty to avoid an index-out-of-range.
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (4)
internal/shim/placement/handle_traits.go (2)
438-444: URL-escape the trait name when building the upstream path.
url.JoinPath(u.Path, "/traits/"+name)doesn't percent-encodename. TheCUSTOM_prefix + charset validation upstream limits exposure, but in this handlernamecomes straight from the path parameter without charset validation beyondHasPrefix("CUSTOM_"). Defensive encoding avoids surprises if validation is ever loosened:♻️ Proposed refactor
- u.Path, err = url.JoinPath(u.Path, "/traits/"+name) + u.Path, err = url.JoinPath(u.Path, "/traits/", url.PathEscape(name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/shim/placement/handle_traits.go` around lines 438 - 444, The upstream trait path is built using the raw variable name (name) which isn't percent-encoded; update the path construction in the handler that creates the upstream request (the code just before http.NewRequestWithContext in handle_traits.go) to URL-escape the trait segment (e.g., use url.PathEscape(name) when composing the "/traits/{name}" portion or use url.JoinPath with an escaped segment) so the upstream path is safely percent-encoded; then proceed to create the request (http.NewRequestWithContext) and preserve the existing header forwarding (req.Header = incomingHeader.Clone()).
222-256: Synchronous best-effort upstream sync adds tail latency to PUT /traits.
syncTraitToUpstreamis described as best-effort (errors swallowed), but it's awaited inline beforew.WriteHeader(http.StatusCreated). A slow/hanging upstream (e.g., TCP connect delays, slow DNS) blocks the client response. Even with the request context's timeout guarding it, this propagates upstream latency to every trait creation. Consider firing this asynchronously with a detached context after the local write succeeds, or add an explicit short timeout for just the sync request:♻️ Proposed refactor
- log.Info("created custom traits configmap with new trait", "trait", name) - s.syncTraitToUpstream(ctx, name, r.Header) - w.WriteHeader(http.StatusCreated) + log.Info("created custom traits configmap with new trait", "trait", name) + go s.syncTraitToUpstream(context.WithoutCancel(ctx), name, r.Header.Clone()) + w.WriteHeader(http.StatusCreated)(apply similarly at line 255)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/shim/placement/handle_traits.go` around lines 222 - 256, The PUT /traits handler currently calls s.syncTraitToUpstream(ctx, name, r.Header) synchronously after the local write, which makes client responses wait on upstream latency; change this to perform the upstream sync asynchronously with a detached context (or a new context with a short timeout) so the handler returns immediately after successful local persist (after s.writeTraits and s.Update). Locate the call to syncTraitToUpstream in the handler (following log.Info("added custom trait to configmap", "trait", name) and before w.WriteHeader(http.StatusCreated)), and replace the blocking call with a goroutine that invokes syncTraitToUpstream using either context.Background() (detached) or a context.WithTimeout(ctx, shortDuration) so errors remain best-effort and do not add tail latency to the response.internal/shim/placement/handle_traits_e2e.go (1)
61-326: Repeated HTTP request scaffolding inviting extraction +deferaccumulation.Each step repeats
NewRequestWithContext→ setX-Auth-Token→ setOpenStack-API-Version→Do→defer resp.Body.Close(). With ~10 sequential calls in one function, everyresp.Bodystays open until function exit, and the file is noisy. A small helper (e.g.,doE2EReq(ctx, sc, method, url string) (*http.Response, error)that also closes the body inline when the body isn't consumed) would shrink this file substantially and fix the lingering-open-body hygiene issue. Non-blocking.♻️ Proposed helper sketch
func doTraitReq(ctx context.Context, sc *serviceClient, method, url string) (int, []byte, error) { req, err := http.NewRequestWithContext(ctx, method, url, http.NoBody) if err != nil { return 0, nil, err } req.Header.Set("X-Auth-Token", sc.TokenID) req.Header.Set("OpenStack-API-Version", "placement 1.6") req.Header.Set("Accept", "application/json") resp, err := sc.HTTPClient.Do(req) if err != nil { return 0, nil, err } defer resp.Body.Close() body, _ := io.ReadAll(resp.Body) return resp.StatusCode, body, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/shim/placement/handle_traits_e2e.go` around lines 61 - 326, The test function in handle_traits_e2e.go repeats HTTP request scaffolding (NewRequestWithContext, setting X-Auth-Token/OpenStack-API-Version, sc.HTTPClient.Do, and defer resp.Body.Close()) causing noisy code and accumulating defers; extract a small helper (e.g., doTraitReq or doE2EReq) that accepts (ctx, sc *serviceClient, method, url string), builds the request, sets headers, performs Do, reads and closes resp.Body inside the helper, and returns status + body (or parsed JSON) and error; then replace callers in the e2e flow (the GET /traits list, GET /traits/{name}, PUT/DELETE requests, filtered GET, etc.) to use the helper to eliminate repeated code and avoid leaving multiple resp.Body defers open.internal/shim/placement/handle_traits_test.go (1)
112-202: Consider consolidating the List-filter tests into a single table-driven test.
TestHandleListTraitsLocal{,Merged,Filtered,StartsWith,FilterUnknown}share identical plumbing and differ only in seed data, query string, and expected traits. A struct-driven form would reduce duplication and keep the file short, in line with the project's test-style guidelines. Fully optional — current form is readable.As per coding guidelines: "Use struct-based test cases when applicable, but limit yourself to the most relevant cases" and "Test files should be short and contain only necessary test cases".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/shim/placement/handle_traits_test.go` around lines 112 - 202, Multiple nearly-identical tests (TestHandleListTraitsLocal, TestHandleListTraitsLocalMerged, TestHandleListTraitsLocalFiltered, TestHandleListTraitsLocalStartsWith, TestHandleListTraitsLocalFilterUnknown) duplicate setup and assertion logic; convert them into a single table-driven test that iterates cases containing seed traits, optional merged traits, query string, and expected result. Implement a slice of structs with fields like name, seed ([]string), merged (string or []string), query (string), want ([]string) and loop over cases calling newTraitShim, serveHandler with s.HandleListTraits, decoding into traitsListResponse and asserting lengths and values; keep existing helper functions (newTraitShim, serveHandler, traitsListResponse) and preserve each case's assertions but remove duplicated test functions. Ensure case names are used with t.Run to keep failures scoped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-placement-shim/templates/clusterrole.yaml`:
- Around line 24-44: Replace the cluster-scoped binding with a namespace-scoped
Role + RoleBinding: create a Role resource (with the same rules for resources
"configmaps" and "leases" under apiGroups "" and "coordination.k8s.io" and verbs
get,list,watch,create,update,delete as applicable) instead of a ClusterRole, and
change the ClusterRoleBinding to a RoleBinding that binds that Role to the
ServiceAccount in {{ .Release.Namespace }}; update template names (e.g., replace
references to the ClusterRole/ClusterRoleBinding templates with Role/RoleBinding
templates) and ensure the RoleBinding.subject references the ServiceAccount and
metadata.namespace is set to {{ .Release.Namespace }} so permissions are
strictly namespaced.
In `@helm/library/cortex-shim/templates/deployment.yaml`:
- Around line 48-58: The template currently always emits the POD_NAMESPACE
downward-API env var and can duplicate a user-provided one; update the env block
in deployment.yaml to first check the user-supplied map
(.Values.deployment.container.env) for a "POD_NAMESPACE" key (e.g. using hasKey
or index) and only render the downward-API POD_NAMESPACE when the user has not
supplied it, leaving the existing range over .Values.deployment.container.env
unchanged so user values take precedence.
In `@internal/shim/placement/handle_traits_e2e.go`:
- Around line 91-113: The test in handle_traits_e2e.go currently asserts that
GET /traits/{name} returns 200 for knownTrait, but upstream Placement may return
204; update the check in the Phase 1 block that inspects resp.StatusCode (and
the analogous Phase 2 check around the later GET of the custom trait) to accept
either http.StatusOK (200) or http.StatusNoContent (204) as success; ensure the
log message still reports success when either status is returned and reference
knownTrait when logging the result; also align this behavior with the
HandleShowTrait handler comment that documents passthrough returning 204.
In `@internal/shim/placement/handle_traits_test.go`:
- Around line 204-226: Update the two local-handler tests so their expected HTTP
status matches the updated HandleShowTrait behavior: change
TestHandleShowTraitLocalFound and TestHandleShowTraitLocalFoundCustom to expect
http.StatusNoContent (204) instead of http.StatusOK (200), keeping
TestHandleShowTraitLocalNotFound as http.StatusNotFound; ensure these tests
reference s.HandleShowTrait so they remain consistent with the new handler
semantics and the passthrough tests.
In `@internal/shim/placement/handle_traits.go`:
- Line 187: The lockerID currently generated as fmt.Sprintf("shim-%d",
time.Now().UnixNano()) does not uniquely identify the replica and can collide;
change it to include the replica identity (e.g., append os.Hostname() or the
POD_NAME env value) so the HolderIdentity check in ReleaseLock is meaningful,
and update both places where lockerID is built (the request lock path and the
deletion path in HandleDeleteTrait). Preferably compute this replica-wide
identifier once on Shim creation (store as a field on the Shim struct, e.g.,
shim.replicaID) and use that stored value when constructing lockerID rather than
recreating it per-request.
- Around line 187-197: The deferred ReleaseLock is using the request context
(r.Context()) so if the request is cancelled the ReleaseLock call fails and the
lease remains until expiry; change the defer to use a detached context with a
short timeout instead of ctx: create a new context.Context via
context.WithTimeout(context.Background(), <small duration>) (and call its
cancel) and pass that context into s.resourceLocker.ReleaseLock(ctxRelease,
s.traitsLockName(), lockerID) inside the defer; apply the same change in both
the handler that calls s.resourceLocker.AcquireLock (the code around
s.resourceLocker.AcquireLock, s.resourceLocker.ReleaseLock, s.traitsLockName(),
lockerID) and in HandleDeleteTrait to ensure ReleaseLock is always invoked with
a non-cancelled short-lived context.
In `@internal/shim/placement/shim_test.go`:
- Around line 461-480: The test TestConfigValidateEnableTraitsRequiresConfig can
pass spuriously if POD_NAMESPACE is already set in the environment; before
calling c.validate() for the negative case that expects an error after setting
c.Traits.ConfigMapName, unset/clear the POD_NAMESPACE env var so the validation
fails deterministically, then set POD_NAMESPACE (e.g., via t.Setenv or
os.Setenv) before the final positive validation; adjust the test around calls to
config.validate() and references to c.Traits.ConfigMapName to ensure the
negative assertion occurs with POD_NAMESPACE unset and the success assertion
occurs with it set.
In `@internal/shim/placement/shim.go`:
- Around line 158-159: The error string produced when checking
os.Getenv("POD_NAMESPACE") should be lowercase to satisfy lint rules; update the
errors.New(...) call inside the POD_NAMESPACE empty check (the block using
os.Getenv("POD_NAMESPACE") == "") to use a fully lowercase message such as
"pod_namespace environment variable is required when features.enableTraits is
true" while keeping the rest of the condition and logic unchanged.
In `@pkg/resourcelock/resourcelock.go`:
- Around line 117-124: The AcquireLock logic currently does unconditional
time.Sleep calls (e.g., after apierrors.IsAlreadyExists in the Create path and
similar places) which ignores context cancellation and can block until
retryInterval; replace those sleeps with a context-aware wait: compute remaining
:= time.Until(deadline) and wait := min(rl.retryInterval, remaining), then use a
select { case <-time.After(wait): continue; case <-ctx.Done(): return ctx.Err()
} so the goroutine returns promptly on cancellation; apply the same change to
all sleep locations in AcquireLock (the blocks around rl.client.Create, the
competing update paths, and any other unconditional time.Sleep spots) ensuring
you reference rl.retryInterval, deadline, ErrLockHeld, and ctx in the updated
logic.
- Around line 52-55: WithLeaseDuration currently truncates sub-second durations
causing 0-second leases; change it to round up and enforce a minimum of 1
second: compute seconds = int32(math.Ceil(d.Seconds())) (import math) and if
seconds < 1 set seconds = 1, then assign rl.leaseDuration = seconds; keep
function name WithLeaseDuration and struct ResourceLocker, and ensure
negative/zero inputs become the 1-second minimum.
---
Nitpick comments:
In `@internal/shim/placement/handle_traits_e2e.go`:
- Around line 61-326: The test function in handle_traits_e2e.go repeats HTTP
request scaffolding (NewRequestWithContext, setting
X-Auth-Token/OpenStack-API-Version, sc.HTTPClient.Do, and defer
resp.Body.Close()) causing noisy code and accumulating defers; extract a small
helper (e.g., doTraitReq or doE2EReq) that accepts (ctx, sc *serviceClient,
method, url string), builds the request, sets headers, performs Do, reads and
closes resp.Body inside the helper, and returns status + body (or parsed JSON)
and error; then replace callers in the e2e flow (the GET /traits list, GET
/traits/{name}, PUT/DELETE requests, filtered GET, etc.) to use the helper to
eliminate repeated code and avoid leaving multiple resp.Body defers open.
In `@internal/shim/placement/handle_traits_test.go`:
- Around line 112-202: Multiple nearly-identical tests
(TestHandleListTraitsLocal, TestHandleListTraitsLocalMerged,
TestHandleListTraitsLocalFiltered, TestHandleListTraitsLocalStartsWith,
TestHandleListTraitsLocalFilterUnknown) duplicate setup and assertion logic;
convert them into a single table-driven test that iterates cases containing seed
traits, optional merged traits, query string, and expected result. Implement a
slice of structs with fields like name, seed ([]string), merged (string or
[]string), query (string), want ([]string) and loop over cases calling
newTraitShim, serveHandler with s.HandleListTraits, decoding into
traitsListResponse and asserting lengths and values; keep existing helper
functions (newTraitShim, serveHandler, traitsListResponse) and preserve each
case's assertions but remove duplicated test functions. Ensure case names are
used with t.Run to keep failures scoped.
In `@internal/shim/placement/handle_traits.go`:
- Around line 438-444: The upstream trait path is built using the raw variable
name (name) which isn't percent-encoded; update the path construction in the
handler that creates the upstream request (the code just before
http.NewRequestWithContext in handle_traits.go) to URL-escape the trait segment
(e.g., use url.PathEscape(name) when composing the "/traits/{name}" portion or
use url.JoinPath with an escaped segment) so the upstream path is safely
percent-encoded; then proceed to create the request (http.NewRequestWithContext)
and preserve the existing header forwarding (req.Header =
incomingHeader.Clone()).
- Around line 222-256: The PUT /traits handler currently calls
s.syncTraitToUpstream(ctx, name, r.Header) synchronously after the local write,
which makes client responses wait on upstream latency; change this to perform
the upstream sync asynchronously with a detached context (or a new context with
a short timeout) so the handler returns immediately after successful local
persist (after s.writeTraits and s.Update). Locate the call to
syncTraitToUpstream in the handler (following log.Info("added custom trait to
configmap", "trait", name) and before w.WriteHeader(http.StatusCreated)), and
replace the blocking call with a goroutine that invokes syncTraitToUpstream
using either context.Background() (detached) or a context.WithTimeout(ctx,
shortDuration) so errors remain best-effort and do not add tail latency to the
response.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa833f1c-8a66-4e43-906b-ec5d4e456e04
📒 Files selected for processing (11)
helm/bundles/cortex-placement-shim/templates/clusterrole.yamlhelm/bundles/cortex-placement-shim/templates/configmap-traits.yamlhelm/bundles/cortex-placement-shim/values.yamlhelm/library/cortex-shim/templates/deployment.yamlinternal/shim/placement/handle_traits.gointernal/shim/placement/handle_traits_e2e.gointernal/shim/placement/handle_traits_test.gointernal/shim/placement/shim.gointernal/shim/placement/shim_test.gopkg/resourcelock/resourcelock.gopkg/resourcelock/resourcelock_test.go
- HandleShowTrait now returns 204 (matching OpenStack Placement spec) - ReleaseLock defers use detached context to survive request cancellation - lockerID includes os.Hostname() for unique replica identity - AcquireLock respects context cancellation instead of blind time.Sleep - WithLeaseDuration rounds up sub-second values and clamps to [1, MaxInt32] - Lowercase error string for POD_NAMESPACE validation - Guard POD_NAMESPACE in test with t.Setenv for environment isolation - Helm deployment template skips POD_NAMESPACE if user already provides it
Test Coverage ReportTest Coverage 📊: 70.0% |
Feature-gated /traits endpoints that serve traits locally from Kubernetes
ConfigMaps instead of forwarding to upstream placement.
When conf.features.enableTraits is true, the shim reads from two ConfigMaps:
a static one managed by Helm (standard traits) and a dynamic one created
on-demand by the shim (CUSTOM_* traits). Writes to the dynamic ConfigMap are
serialized across replicas via a Lease-based distributed lock.
New pkg/resourcelock library: reusable Lease-backed AcquireLock/ReleaseLock
extracted from the traits code, designed for short-lived locks.
Helm: new configmap-traits.yaml template, Lease RBAC, configMapName and
static traits in values.