test(ats): tighten smoke-test timeout and rerun budget#655
Merged
Conversation
Cut the per-attempt deployment-readiness timeout from 560s to 180s and reduce @pytest.mark.flaky reruns from 5 to 1. The module-scoped `deployment` fixture is recreated on every rerun, so the full timeout was paid each attempt (1 initial + 5 reruns = 6 attempts, 6 x 560s ~= 56 min worst case). With the new budget, worst case drops to ~6 min while real failures still get a second chance. 180s is still ~3x the realistic ready time for the single-replica muster Deployment in kind. Same fix as giantswarm/mcp-kubernetes#406.
QuentinBisson
added a commit
that referenced
this pull request
May 12, 2026
#647) * test(ats): tighten smoke-test timeout and rerun budget (#655) Cut the per-attempt deployment-readiness timeout from 560s to 180s and reduce @pytest.mark.flaky reruns from 5 to 1. The module-scoped `deployment` fixture is recreated on every rerun, so the full timeout was paid each attempt (1 initial + 5 reruns = 6 attempts, 6 x 560s ~= 56 min worst case). With the new budget, worst case drops to ~6 min while real failures still get a second chance. 180s is still ~3x the realistic ready time for the single-replica muster Deployment in kind. Same fix as giantswarm/mcp-kubernetes#406. * chore(deps): update dependency requests to ~=2.34.0 (#658) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * refactor(aggregator,workflow): introduce consumer-defined TokenBroker and EntityProvider ports [1/8] Define the hexagonal seam ahead of the broker bounded-context refactor. No callers yet, no behavior change. - internal/aggregator/token_broker.go: TokenBroker interface (full surface: Begin/Complete flow, GetToken, ExchangeToken, RevokeSession, Introspect, WatchAuthEvents) plus port-owned value types BeginRequest, FlowURL, Session, Token, ExchangeRequest, Claims, AuthEvent, AuthEventType. - internal/aggregator/entity_provider.go: EntityProvider interface plus EntityChange[T] / EntityChangeType. MCPServer and Workflow are temporary type aliases to internal/api/{MCPServer,Workflow}; PR 6 swaps them for port-owned structs when the EntityProvider migration lands. - internal/workflow/token_broker.go: narrower TokenBroker interface (GetToken + ExchangeToken only) with its own Token and ExchangeRequest value types. The broker domain will structurally satisfy both this and the aggregator's view in PR 4. 193 LOC added, 0 removed. Each consumer owns its own port. No shared pkg/contracts/ or ports.go kitchen sink. * refactor(aggregator): widen TokenBroker with SessionIssuer and RevokeUser Two intent-level additions surfaced when refactoring the aggregator's existing api.GetOAuthHandler() call sites against the port: - SessionIssuer(ctx, sessionID) (string, error) -- returns the IdP issuer bound to a session. Replaces the intent of api.OAuthHandler's FindTokenWithIDToken(sessionID).Issuer fallback, which exposed the cache shape ("find any token entry that happens to have an ID token") instead of the underlying question ("what issuer does this session belong to"). - RevokeUser(ctx, subject) error -- "sign out everywhere" for a user. Mirror of RevokeSession; covers api.OAuthHandler.DeleteTokensByUser without naming the storage. Cache-mutating operations (StoreToken, ClearTokenByIssuer, FindTokenWithIDToken, DeleteTokensByX) are intentionally NOT added: they reach into the broker's token store, which is implementation detail and must remain invisible across the seam. The broker self-manages its cache via mcp-oauth's session lifecycle hooks; aggregator-side logic that currently mutates the store (storeIDTokenForSSO etc.) moves into the broker domain in PR 4. The litmus test for membership on this interface: "would this method make sense over gRPC when the broker extracts to its own pod (Phase 7+)?" Storage mutators fail it; intent-level ops pass. HTTP wiring (GetHTTPHandler, GetCallbackPath, GetCIMD*) is also not added: http.Handler doesn't cross gRPC. A Mount(*http.ServeMux) method on the broker is the right shape and lands in a follow-up (PR 5 or its own PR). No behavior change. No callers yet. * refactor(aggregator): narrow TokenBroker to currently-consumable surface Drop Introspect, WatchAuthEvents, Claims, AuthEvent, and AuthEventType from the port. Both methods are Phase-4 surface and have no consumer today; their underlying broker capabilities (RFC 7662-style introspection, internal event streaming) don't exist on broker.Manager yet. Keeping them on the port forces every adapter to ship stub implementations, which the project's no-stubs rule forbids. They come back in Phase 4 -- once mcp-oauth's validate-token path is hoisted onto broker.Manager and the SessionCreation/Revocation/TokenRefresh callbacks (currently wired in internal/aggregator/server.go) are inverted into a broker-owned event stream. At that point Claims, AuthEvent, and AuthEventType land alongside the real implementations. Remaining port surface (7 methods, all intent-level, all backed by an existing or PR-4-added broker.Manager method): BeginOAuthFlow, CompleteOAuthFlow, GetToken, ExchangeToken, RevokeSession, RevokeUser, SessionIssuer. * refactor(aggregator): defer BeginOAuthFlow / CompleteOAuthFlow to PR 5 Drop the two manual-auth-flow methods plus their types (BeginRequest, FlowURL, Session) from PR 1's port surface. They're consumed by auth_tools.go and the OAuth callback handler -- PR 5 territory -- not by PR 4's auth_resource.go refactor. broker.Manager's existing HandleCallback returns only error, not a Session with subject/issuer. Mapping CompleteOAuthFlow onto it cleanly requires either widening Manager's return (a Phase 4 concern aligned with breaking out the session lifecycle) or a token-store re-read. Neither belongs in PR 4. PR 5 brings them back paired with the Manager-side changes that make the return shape honest. Port now: GetToken, ExchangeToken, RevokeSession, RevokeUser, SessionIssuer (5 methods). Each has a direct broker.Manager equivalent. * refactor(aggregator): defer ExchangeToken to PR 5 Drop ExchangeToken / ExchangeRequest from PR 1's port surface for the same reason BeginOAuthFlow / CompleteOAuthFlow were dropped: the underlying broker.Manager.ExchangeTokenForRemoteCluster takes an *api.TokenExchangeConfig (remote IdP endpoint, audience, connector ID), which the abstract ExchangeRequest shape doesn't carry. Mapping requires either widening ExchangeRequest with broker implementation detail (TokenExchangeConfig is a broker concern) or having the broker discover the config from registered server state. That design choice belongs in PR 5, paired with connection_helper.go's refactor where the actual call sites live and the right shape becomes obvious. PR 1's port is now: GetToken, RevokeSession, RevokeUser, SessionIssuer. 4 methods, each maps directly to an existing broker.Manager method (GetTokenByIssuer, DeleteTokensBySession, DeleteTokensByUser, plus the PR 4-added SessionIssuer). The port stays the design doc; each subsequent PR widens it as call sites are migrated. * review: drop narrative comments from PR 1 ports; narrow workflow.TokenBroker Three review fixes against muster's "comments describe present invariants only" rule: - internal/aggregator/token_broker.go: drop the PR 4/5 + Phase 4 roadmap block from the type doc. Keep the design invariants (consumer-defined port, storage mutators absent, gRPC litmus test). - internal/aggregator/entity_provider.go: rewrite MCPServer's alias doc to describe the present alias rationale rather than the future PR-6 migration. - internal/workflow/token_broker.go: drop ExchangeToken and ExchangeRequest. The aggregator port dropped them when no consumer surfaced; workflow has none either. The port comes back when an actual workflow caller materializes, paired with the call-site design. * review: tighten PR 1 doc comments Per code review: keep only invariants and non-obvious WHY; drop narrative phrasing. No behavior change. - TokenBroker doc: trim from 11 lines to 6. Drop the cross-file "Workflow defines its own view in internal/workflow/token_broker.go" pointer (godoc surfaces the cross-package type by name) and the explanatory "methods here are shaped so...". - Token doc (both files): one short sentence per type. Drop the "deployment time" detail; the invariant is just that callers do not branch. - workflow.TokenBroker: drop the "consumer-defined; broker provides an adapter that structurally satisfies both" line -- restated from the aggregator side, redundant in the workflow file. - EntityProvider.MCPServer doc: drop the "consumers reference the port type, which is what matters when the underlying definition later moves" forward-looking phrasing; keep only the present alias rationale. * review: narrow PR 1 to currently-consumed surface Drop RevokeSession, RevokeUser, and EntityProvider per code-review: - RevokeSession / RevokeUser have no production consumers in the current 7-PR stack. The session-revoke path runs through PR 5c's RecordSessionRevoked instead. Leaving the two methods on the port forces every adapter (in-process, mock, future gRPC) to carry empty surface. Apply the same gating the port used for Introspect / Watch / BeginOAuthFlow / ExchangeToken: methods land in the PR that wires their first consumer. - EntityProvider is defined but unused anywhere in the stack. The reconciler migration that consumes it is a separate PR (PR 6); the interface should land there alongside its adapters and tests, not here as speculative surface. UpdateStatus's status any also fails the gRPC litmus test -- a port-owned status struct will follow when PR 6 introduces the interface. PR 1's port now: GetToken, SessionIssuer (2 methods). Each grows the port in its own PR. * review: drop speculative workflow.TokenBroker port No workflow consumer references this interface. Per the rule "port grows only when a consumer needs it" (the rule used to delete the EntityProvider port in this same PR), the workflow port is removed until a workflow consumer materializes. * review: trim narrative comment on TokenBroker port * review: rename port parameter audience -> issuer GetToken's third argument was named 'audience' but every implementation, adapter, call site, and doc comment in the stack treats it as the issuer URL (the iss claim of the cached token). RFC 8707 audience specifically means the resource server identifier the token is bound to (aud claim), which is distinct from the issuer and only surfaces in RFC 8693 token-exchange in later PRs. Rename to issuer to remove the drift before downstream PRs ossify it. * review: trim narrative from TokenBroker doc comment * refactor(broker): rename internal/oauth/ -> internal/broker/ [2/8] (#648) * refactor(broker): rename internal/oauth/ -> internal/broker/ [2/8] Pure move. Package rename only, no behavior change. - git mv 20 .go files and 2 embedded templates from internal/oauth/ to internal/broker/. - Rewrite package declaration: oauth -> broker (20 files). - Update first line of doc.go: "Package oauth ..." -> "Package broker ...". - Update 3 import sites: * internal/aggregator/manager.go: import + 7 symbol references. * internal/aggregator/server.go: import + 1 symbol reference (musteroauth.DecodeEncryptionKey -> broker.DecodeEncryptionKey; external mcp-oauth aliased as `oauth` stays unaliased). * internal/server/oauth_http.go: import + 2 symbol references. (This file moves to internal/broker/http/ in PR 3.) - Fix one live doc reference in pkg/oauth/doc.go pointing at internal/oauth. ADRs under docs/explanation/decisions/ and the historical CHANGELOG entry mentioning internal/oauth are intentionally left as-is (immutable history). No back-compat shim left in internal/oauth/ -- only two aggregator files imported it, both updated inline. This introduces a transient direct internal/aggregator/ -> internal/broker/ import that PR 4-5 removes when the local adapter takes over; PR 8 enforces the seam with a CI rule. Diff: +33/-33 lines (git detected the renames). ~3050 LOC moved. make test (-race) and golangci-lint clean across all packages. * review: pin TLS 1.2 minimum on broker's custom-CA HTTP client gosec G402 fires when tls.Config has no MinVersion pin. The cleanup PR adds the same line to internal/oauth/manager.go; once that merges and this branch rebases, the rename brings it forward and this commit becomes redundant. Kept here so PR 648 CI passes independently of cleanup PR merge order. * review: refresh broker/doc.go for rename rationale + correct pkg/oauth NewClient example internal/broker/doc.go now opens by acknowledging the package was renamed from 'oauth'/'OAuth 2.1 proxy' to 'broker' and why (token-flow coordination beyond pure OAuth proxy). pkg/oauth/doc.go's usage example showed oauth.NewClient(httpClient, logger), which has never matched the functional-option signature; fix to oauth.NewClient(oauth.WithHTTPClient(httpClient)). * review: trim historical narrative from broker doc.go * refactor(broker): move oauth_http to internal/broker/http (brokerhttp) [3/8] (#649) * refactor(broker): move oauth_http to internal/broker/http as package brokerhttp [3/8] Pure move. No behavior change. - git mv internal/server/oauth_http.go -> internal/broker/http/oauth_http.go - git mv internal/server/oauth_http_test.go -> internal/broker/http/oauth_http_test.go - package renamed server -> brokerhttp (not "http" -- the file imports net/http extensively and "package http" would shadow it; aliasing every http.X call in the file is louder than the package-name divergence). - Add internal/broker/http/doc.go. Cross-file deps handled: oauth_http.go used GetIDToken / ContextWithIDToken from internal/server/token_provider.go (same package previously). After the move those calls are qualified server.X and the file imports internal/server/. token_provider.go stays in internal/server/ because the aggregator still imports it for ContextWithIDToken / GetIDTokenFromContext across four files; moving it would expand the seam blast radius. Test file got the same treatment (4 functions qualified, internal/server imported). Caller updates: - internal/aggregator/server.go: split the import. internal/server stays for context helpers (GetIDTokenFromContext, still used at L1467); internal/broker/http added under alias brokerhttp for *OAuthHTTPServer and NewOAuthHTTPServer (2 references). internal/server/ now contains only doc.go + token_provider.go. The doc.go still describes the moved OAuth HTTP server; PR 4 or a follow-up should rewrite it to describe the remaining context-helper surface only. Diff: 4 files changed, +17 / -14 (renames detected at >=98% similarity). make test -race and golangci-lint clean across all packages. * review: rewrite internal/server/doc.go for the post-move package shape internal/server/ used to host the OAuth HTTP server (now at internal/broker/http/). The package now contains only token_provider.go -- three small ID-token context helpers. The previous doc.go described the moved content (ADR 005, OAuth endpoints, RFC 7591 registration) and would mislead readers. New doc.go describes the present package: OIDC ID-token context helpers used by the aggregator and the broker HTTP adapter to thread the authenticated user's ID token between layers. No narrative reference to the move; ADRs under docs/explanation/decisions/ retain the history. * review: tighten brokerhttp doc.go Drop the "named brokerhttp rather than http to avoid shadowing net/http" sentence -- it explains a past naming decision rather than a present invariant, and the reason becomes evident the moment anyone tries to rename the package and the import collision surfaces. Keep only the present description (what the package exposes, who imports it, and the import-boundary invariant). * review: add internal/server/token_provider_test.go After PR 3 moved oauth_http_test.go out, internal/server/ had no in-package tests. token_provider.go's three functions were exercised cross-package, which under-reports coverage attribution. This brings them back inline. Three tests, table-driven where applicable: ContextWithIDToken round trip, GetIDTokenFromContext absence, GetIDToken across nil / no-extra / present / wrong-type cases. * review: wire oauthHTTPServer.Shutdown into aggregator shutdown; drop unused UserInfo alias The OAuth HTTP server's Shutdown method calls into mcp-oauth's subserver Shutdown which releases rate-limiter and token-store goroutines; it had no callers. Wire it into AggregatorServer.Shutdown alongside the other HTTP server cleanups. UserInfo type alias in internal/server/token_provider.go had zero call sites; drop it (and the now-unused providers import). * refactor(broker): move ID token helpers from internal/server into package broker ContextWithIDToken, GetIDTokenFromContext and GetIDToken are broker-domain helpers: they encode the OIDC ID token that the broker's HTTP middleware puts on the request context and that the aggregator reads back when it delegates to broker ports. Hosting them under internal/server made internal/server a vestigial package (3 funcs + a doc) and forced both the broker HTTP adapter and the aggregator to import internal/server for what is plainly broker functionality. Moves token_provider.go -> internal/broker/idtoken.go (package broker) and removes internal/server entirely. Callers in internal/broker/http and internal/aggregator switch from server.X to broker.X. * review: extract testServiceName to clear goconst finding * refactor(broker): thread ID token explicitly across the broker→aggregator seam Delete broker.ContextWithIDToken / broker.GetIDTokenFromContext and the package-level contextKey/idTokenKey. The broker→aggregator seam now carries the ID token via explicit parameters: - brokerhttp.OAuthHTTPServer.SetOnAuthenticated callback widens to func(ctx, sessionID, idToken string). - aggregator.EstablishConnectionWithTokenForwarding adds an idToken parameter and uses it directly. - aggregator.EstablishConnectionWithTokenExchange adds an idToken parameter and drops the now-unused musterIssuer parameter. - aggregator.establishSSOConnection adds an idToken parameter; producers in initSSOForSession and adminReconnectServer pass it through. - aggregator.getIDTokenForForwarding becomes lookupIDTokenForSession with no ctx parameter; the OAuth proxy store is the sole source for call sites that don't receive an ID token from their caller (MCP tool dispatch, header-func closures). Tests follow the new signatures and assert the forwarded ID token reaches the onAuthenticated callback. Forwarded-bearer flow without parseable JWT exp loses its ctx-based fallback; the existing warning at injectExternalIDToken already prescribes re-auth in that case, so the behaviour now matches the documented contract. * review: restore lookupIDTokenForSession debug logs and tighten doc comments - internal/aggregator/connection_helper.go: emit hit/miss Debug lines so the SSO token-source observability surface that getIDTokenForForwarding had is preserved after the rename. - internal/aggregator/auth_resource.go: keep the one-sentence pointer that manual-auth servers go through core_auth_login rather than this path. - internal/broker/idtoken.go: drop the "Kubernetes OIDC authentication requires the ID token" sentence — that is caller motivation, not a GetIDToken invariant. * review: drop call-site references from broker package and test double - internal/broker/doc.go: remove the "TokenBroker port in internal/aggregator consumes this package" sentence; the package's role doesn't depend on naming its consumer. - internal/aggregator/connection_helper_test.go: rephrase the mock's purpose generically rather than coupling it to one test function name. --------- Co-authored-by: Timo Derstappen <timo@giantswarm.io> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The smoke-test ATS job worst-case runtime is dominated by the module-scoped
deploymentfixture being recreated on every flaky rerun. Each rerun pays the full readiness timeout, and the rerun budget is set to 5:This PR:
timeoutfrom560→180(still ~3× the realistic ready time for the single-replicamusterDeployment in kind).@pytest.mark.flaky(reruns=5, reruns_delay=10)→@pytest.mark.flaky(reruns=1, reruns_delay=15). One retry is enough for transient kube-API hiccups;wait_for_deployments_to_runis already a polling loop, so additional reruns don't shake anything loose.Worst-case runtime drops from ~57 min to ~6 min while real failures still get a second chance.
This does not address why the deployment occasionally fails to become ready in CI — that's a separate issue.
Same fix as giantswarm/mcp-kubernetes#406; sibling PRs in
klaus,mcp-prometheus, anddex-app.Made with Cursor