Skip to content

test(ats): tighten smoke-test timeout and rerun budget#655

Merged
teemow merged 1 commit into
mainfrom
chore/ats-shorter-timeouts
May 11, 2026
Merged

test(ats): tighten smoke-test timeout and rerun budget#655
teemow merged 1 commit into
mainfrom
chore/ats-shorter-timeouts

Conversation

@teemow
Copy link
Copy Markdown
Member

@teemow teemow commented May 11, 2026

Summary

The smoke-test ATS job worst-case runtime is dominated by the module-scoped deployment fixture being recreated on every flaky rerun. Each rerun pays the full readiness timeout, and the rerun budget is set to 5:

1 initial + 5 reruns = 6 attempts
6 × 560 s ≈ 56 min worst case

This PR:

  • Cuts the per-attempt timeout from 560180 (still ~3× the realistic ready time for the single-replica muster Deployment in kind).
  • Reduces @pytest.mark.flaky(reruns=5, reruns_delay=10)@pytest.mark.flaky(reruns=1, reruns_delay=15). One retry is enough for transient kube-API hiccups; wait_for_deployments_to_run is already a polling loop, so additional reruns don't shake anything loose.

Worst-case runtime drops from ~57 min to ~6 min while real failures still get a second chance.

This does not address why the deployment occasionally fails to become ready in CI — that's a separate issue.

Same fix as giantswarm/mcp-kubernetes#406; sibling PRs in klaus, mcp-prometheus, and dex-app.

Made with Cursor

Cut the per-attempt deployment-readiness timeout from 560s to 180s and
reduce @pytest.mark.flaky reruns from 5 to 1.

The module-scoped `deployment` fixture is recreated on every rerun, so
the full timeout was paid each attempt (1 initial + 5 reruns = 6 attempts,
6 x 560s ~= 56 min worst case). With the new budget, worst case drops to
~6 min while real failures still get a second chance. 180s is still ~3x
the realistic ready time for the single-replica muster Deployment in kind.

Same fix as giantswarm/mcp-kubernetes#406.
@teemow teemow requested a review from a team as a code owner May 11, 2026 11:13
@teemow teemow merged commit 2e15b60 into main May 11, 2026
7 checks passed
@teemow teemow deleted the chore/ats-shorter-timeouts branch May 11, 2026 11:21
QuentinBisson added a commit that referenced this pull request May 12, 2026
#647)

* test(ats): tighten smoke-test timeout and rerun budget (#655)

Cut the per-attempt deployment-readiness timeout from 560s to 180s and
reduce @pytest.mark.flaky reruns from 5 to 1.

The module-scoped `deployment` fixture is recreated on every rerun, so
the full timeout was paid each attempt (1 initial + 5 reruns = 6 attempts,
6 x 560s ~= 56 min worst case). With the new budget, worst case drops to
~6 min while real failures still get a second chance. 180s is still ~3x
the realistic ready time for the single-replica muster Deployment in kind.

Same fix as giantswarm/mcp-kubernetes#406.

* chore(deps): update dependency requests to ~=2.34.0 (#658)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* refactor(aggregator,workflow): introduce consumer-defined TokenBroker and EntityProvider ports [1/8]

Define the hexagonal seam ahead of the broker bounded-context refactor.
No callers yet, no behavior change.

- internal/aggregator/token_broker.go: TokenBroker interface (full surface:
  Begin/Complete flow, GetToken, ExchangeToken, RevokeSession, Introspect,
  WatchAuthEvents) plus port-owned value types BeginRequest, FlowURL,
  Session, Token, ExchangeRequest, Claims, AuthEvent, AuthEventType.
- internal/aggregator/entity_provider.go: EntityProvider interface plus
  EntityChange[T] / EntityChangeType. MCPServer and Workflow are temporary
  type aliases to internal/api/{MCPServer,Workflow}; PR 6 swaps them for
  port-owned structs when the EntityProvider migration lands.
- internal/workflow/token_broker.go: narrower TokenBroker interface
  (GetToken + ExchangeToken only) with its own Token and ExchangeRequest
  value types. The broker domain will structurally satisfy both this and
  the aggregator's view in PR 4.

193 LOC added, 0 removed. Each consumer owns its own port. No shared
pkg/contracts/ or ports.go kitchen sink.

* refactor(aggregator): widen TokenBroker with SessionIssuer and RevokeUser

Two intent-level additions surfaced when refactoring the aggregator's existing
api.GetOAuthHandler() call sites against the port:

- SessionIssuer(ctx, sessionID) (string, error) -- returns the IdP issuer
  bound to a session. Replaces the intent of api.OAuthHandler's
  FindTokenWithIDToken(sessionID).Issuer fallback, which exposed the cache
  shape ("find any token entry that happens to have an ID token") instead
  of the underlying question ("what issuer does this session belong to").
- RevokeUser(ctx, subject) error -- "sign out everywhere" for a user.
  Mirror of RevokeSession; covers api.OAuthHandler.DeleteTokensByUser
  without naming the storage.

Cache-mutating operations (StoreToken, ClearTokenByIssuer,
FindTokenWithIDToken, DeleteTokensByX) are intentionally NOT added: they
reach into the broker's token store, which is implementation detail and
must remain invisible across the seam. The broker self-manages its cache
via mcp-oauth's session lifecycle hooks; aggregator-side logic that
currently mutates the store (storeIDTokenForSSO etc.) moves into the
broker domain in PR 4. The litmus test for membership on this interface:
"would this method make sense over gRPC when the broker extracts to its
own pod (Phase 7+)?" Storage mutators fail it; intent-level ops pass.

HTTP wiring (GetHTTPHandler, GetCallbackPath, GetCIMD*) is also not added:
http.Handler doesn't cross gRPC. A Mount(*http.ServeMux) method on the
broker is the right shape and lands in a follow-up (PR 5 or its own PR).

No behavior change. No callers yet.

* refactor(aggregator): narrow TokenBroker to currently-consumable surface

Drop Introspect, WatchAuthEvents, Claims, AuthEvent, and AuthEventType from
the port. Both methods are Phase-4 surface and have no consumer today; their
underlying broker capabilities (RFC 7662-style introspection, internal event
streaming) don't exist on broker.Manager yet. Keeping them on the port
forces every adapter to ship stub implementations, which the project's
no-stubs rule forbids.

They come back in Phase 4 -- once mcp-oauth's validate-token path is
hoisted onto broker.Manager and the SessionCreation/Revocation/TokenRefresh
callbacks (currently wired in internal/aggregator/server.go) are inverted
into a broker-owned event stream. At that point Claims, AuthEvent, and
AuthEventType land alongside the real implementations.

Remaining port surface (7 methods, all intent-level, all backed by an
existing or PR-4-added broker.Manager method): BeginOAuthFlow,
CompleteOAuthFlow, GetToken, ExchangeToken, RevokeSession, RevokeUser,
SessionIssuer.

* refactor(aggregator): defer BeginOAuthFlow / CompleteOAuthFlow to PR 5

Drop the two manual-auth-flow methods plus their types (BeginRequest,
FlowURL, Session) from PR 1's port surface. They're consumed by
auth_tools.go and the OAuth callback handler -- PR 5 territory -- not by
PR 4's auth_resource.go refactor.

broker.Manager's existing HandleCallback returns only error, not a Session
with subject/issuer. Mapping CompleteOAuthFlow onto it cleanly requires
either widening Manager's return (a Phase 4 concern aligned with breaking
out the session lifecycle) or a token-store re-read. Neither belongs in
PR 4. PR 5 brings them back paired with the Manager-side changes that
make the return shape honest.

Port now: GetToken, ExchangeToken, RevokeSession, RevokeUser,
SessionIssuer (5 methods). Each has a direct broker.Manager equivalent.

* refactor(aggregator): defer ExchangeToken to PR 5

Drop ExchangeToken / ExchangeRequest from PR 1's port surface for the same
reason BeginOAuthFlow / CompleteOAuthFlow were dropped: the underlying
broker.Manager.ExchangeTokenForRemoteCluster takes an *api.TokenExchangeConfig
(remote IdP endpoint, audience, connector ID), which the abstract
ExchangeRequest shape doesn't carry. Mapping requires either widening
ExchangeRequest with broker implementation detail (TokenExchangeConfig
is a broker concern) or having the broker discover the config from
registered server state.

That design choice belongs in PR 5, paired with connection_helper.go's
refactor where the actual call sites live and the right shape becomes
obvious.

PR 1's port is now: GetToken, RevokeSession, RevokeUser, SessionIssuer.
4 methods, each maps directly to an existing broker.Manager method
(GetTokenByIssuer, DeleteTokensBySession, DeleteTokensByUser, plus the
PR 4-added SessionIssuer). The port stays the design doc; each
subsequent PR widens it as call sites are migrated.

* review: drop narrative comments from PR 1 ports; narrow workflow.TokenBroker

Three review fixes against muster's "comments describe present invariants
only" rule:

- internal/aggregator/token_broker.go: drop the PR 4/5 + Phase 4 roadmap
  block from the type doc. Keep the design invariants (consumer-defined
  port, storage mutators absent, gRPC litmus test).
- internal/aggregator/entity_provider.go: rewrite MCPServer's alias doc
  to describe the present alias rationale rather than the future PR-6
  migration.
- internal/workflow/token_broker.go: drop ExchangeToken and
  ExchangeRequest. The aggregator port dropped them when no consumer
  surfaced; workflow has none either. The port comes back when an actual
  workflow caller materializes, paired with the call-site design.

* review: tighten PR 1 doc comments

Per code review: keep only invariants and non-obvious WHY; drop narrative
phrasing. No behavior change.

- TokenBroker doc: trim from 11 lines to 6. Drop the cross-file "Workflow
  defines its own view in internal/workflow/token_broker.go" pointer
  (godoc surfaces the cross-package type by name) and the explanatory
  "methods here are shaped so...".
- Token doc (both files): one short sentence per type. Drop the
  "deployment time" detail; the invariant is just that callers do not
  branch.
- workflow.TokenBroker: drop the "consumer-defined; broker provides an
  adapter that structurally satisfies both" line -- restated from the
  aggregator side, redundant in the workflow file.
- EntityProvider.MCPServer doc: drop the "consumers reference the port
  type, which is what matters when the underlying definition later moves"
  forward-looking phrasing; keep only the present alias rationale.

* review: narrow PR 1 to currently-consumed surface

Drop RevokeSession, RevokeUser, and EntityProvider per code-review:

- RevokeSession / RevokeUser have no production consumers in the
  current 7-PR stack. The session-revoke path runs through PR 5c's
  RecordSessionRevoked instead. Leaving the two methods on the port
  forces every adapter (in-process, mock, future gRPC) to carry empty
  surface. Apply the same gating the port used for Introspect / Watch /
  BeginOAuthFlow / ExchangeToken: methods land in the PR that wires
  their first consumer.
- EntityProvider is defined but unused anywhere in the stack. The
  reconciler migration that consumes it is a separate PR (PR 6); the
  interface should land there alongside its adapters and tests, not
  here as speculative surface. UpdateStatus's status any also fails the
  gRPC litmus test -- a port-owned status struct will follow when PR 6
  introduces the interface.

PR 1's port now: GetToken, SessionIssuer (2 methods). Each grows the
port in its own PR.

* review: drop speculative workflow.TokenBroker port

No workflow consumer references this interface. Per the rule "port
grows only when a consumer needs it" (the rule used to delete the
EntityProvider port in this same PR), the workflow port is removed
until a workflow consumer materializes.

* review: trim narrative comment on TokenBroker port

* review: rename port parameter audience -> issuer

GetToken's third argument was named 'audience' but every implementation,
adapter, call site, and doc comment in the stack treats it as the issuer
URL (the iss claim of the cached token). RFC 8707 audience specifically
means the resource server identifier the token is bound to (aud claim),
which is distinct from the issuer and only surfaces in RFC 8693
token-exchange in later PRs. Rename to issuer to remove the drift before
downstream PRs ossify it.

* review: trim narrative from TokenBroker doc comment

* refactor(broker): rename internal/oauth/ -> internal/broker/ [2/8] (#648)

* refactor(broker): rename internal/oauth/ -> internal/broker/ [2/8]

Pure move. Package rename only, no behavior change.

- git mv 20 .go files and 2 embedded templates from internal/oauth/ to
  internal/broker/.
- Rewrite package declaration: oauth -> broker (20 files).
- Update first line of doc.go: "Package oauth ..." -> "Package broker ...".
- Update 3 import sites:
  * internal/aggregator/manager.go: import + 7 symbol references.
  * internal/aggregator/server.go: import + 1 symbol reference
    (musteroauth.DecodeEncryptionKey -> broker.DecodeEncryptionKey;
    external mcp-oauth aliased as `oauth` stays unaliased).
  * internal/server/oauth_http.go: import + 2 symbol references.
    (This file moves to internal/broker/http/ in PR 3.)
- Fix one live doc reference in pkg/oauth/doc.go pointing at internal/oauth.

ADRs under docs/explanation/decisions/ and the historical CHANGELOG entry
mentioning internal/oauth are intentionally left as-is (immutable history).

No back-compat shim left in internal/oauth/ -- only two aggregator files
imported it, both updated inline. This introduces a transient direct
internal/aggregator/ -> internal/broker/ import that PR 4-5 removes when
the local adapter takes over; PR 8 enforces the seam with a CI rule.

Diff: +33/-33 lines (git detected the renames). ~3050 LOC moved.
make test (-race) and golangci-lint clean across all packages.

* review: pin TLS 1.2 minimum on broker's custom-CA HTTP client

gosec G402 fires when tls.Config has no MinVersion pin. The cleanup PR
adds the same line to internal/oauth/manager.go; once that merges and
this branch rebases, the rename brings it forward and this commit
becomes redundant. Kept here so PR 648 CI passes independently of
cleanup PR merge order.

* review: refresh broker/doc.go for rename rationale + correct pkg/oauth NewClient example

internal/broker/doc.go now opens by acknowledging the package was
renamed from 'oauth'/'OAuth 2.1 proxy' to 'broker' and why (token-flow
coordination beyond pure OAuth proxy). pkg/oauth/doc.go's usage example
showed oauth.NewClient(httpClient, logger), which has never matched the
functional-option signature; fix to oauth.NewClient(oauth.WithHTTPClient(httpClient)).

* review: trim historical narrative from broker doc.go

* refactor(broker): move oauth_http to internal/broker/http (brokerhttp) [3/8] (#649)

* refactor(broker): move oauth_http to internal/broker/http as package brokerhttp [3/8]

Pure move. No behavior change.

- git mv internal/server/oauth_http.go -> internal/broker/http/oauth_http.go
- git mv internal/server/oauth_http_test.go -> internal/broker/http/oauth_http_test.go
- package renamed server -> brokerhttp (not "http" -- the file imports
  net/http extensively and "package http" would shadow it; aliasing every
  http.X call in the file is louder than the package-name divergence).
- Add internal/broker/http/doc.go.

Cross-file deps handled: oauth_http.go used GetIDToken / ContextWithIDToken
from internal/server/token_provider.go (same package previously). After the
move those calls are qualified server.X and the file imports
internal/server/. token_provider.go stays in internal/server/ because the
aggregator still imports it for ContextWithIDToken / GetIDTokenFromContext
across four files; moving it would expand the seam blast radius.

Test file got the same treatment (4 functions qualified, internal/server
imported).

Caller updates:
- internal/aggregator/server.go: split the import. internal/server stays
  for context helpers (GetIDTokenFromContext, still used at L1467);
  internal/broker/http added under alias brokerhttp for *OAuthHTTPServer
  and NewOAuthHTTPServer (2 references).

internal/server/ now contains only doc.go + token_provider.go. The doc.go
still describes the moved OAuth HTTP server; PR 4 or a follow-up should
rewrite it to describe the remaining context-helper surface only.

Diff: 4 files changed, +17 / -14 (renames detected at >=98% similarity).
make test -race and golangci-lint clean across all packages.

* review: rewrite internal/server/doc.go for the post-move package shape

internal/server/ used to host the OAuth HTTP server (now at
internal/broker/http/). The package now contains only token_provider.go --
three small ID-token context helpers. The previous doc.go described the
moved content (ADR 005, OAuth endpoints, RFC 7591 registration) and would
mislead readers.

New doc.go describes the present package: OIDC ID-token context helpers
used by the aggregator and the broker HTTP adapter to thread the
authenticated user's ID token between layers. No narrative reference to
the move; ADRs under docs/explanation/decisions/ retain the history.

* review: tighten brokerhttp doc.go

Drop the "named brokerhttp rather than http to avoid shadowing net/http"
sentence -- it explains a past naming decision rather than a present
invariant, and the reason becomes evident the moment anyone tries to
rename the package and the import collision surfaces. Keep only the
present description (what the package exposes, who imports it, and the
import-boundary invariant).

* review: add internal/server/token_provider_test.go

After PR 3 moved oauth_http_test.go out, internal/server/ had no
in-package tests. token_provider.go's three functions were exercised
cross-package, which under-reports coverage attribution. This brings
them back inline.

Three tests, table-driven where applicable: ContextWithIDToken round
trip, GetIDTokenFromContext absence, GetIDToken across nil / no-extra /
present / wrong-type cases.

* review: wire oauthHTTPServer.Shutdown into aggregator shutdown; drop unused UserInfo alias

The OAuth HTTP server's Shutdown method calls into mcp-oauth's
subserver Shutdown which releases rate-limiter and token-store
goroutines; it had no callers. Wire it into AggregatorServer.Shutdown
alongside the other HTTP server cleanups.

UserInfo type alias in internal/server/token_provider.go had zero
call sites; drop it (and the now-unused providers import).

* refactor(broker): move ID token helpers from internal/server into package broker

ContextWithIDToken, GetIDTokenFromContext and GetIDToken are broker-domain
helpers: they encode the OIDC ID token that the broker's HTTP middleware
puts on the request context and that the aggregator reads back when it
delegates to broker ports. Hosting them under internal/server made
internal/server a vestigial package (3 funcs + a doc) and forced both the
broker HTTP adapter and the aggregator to import internal/server for what
is plainly broker functionality.

Moves token_provider.go -> internal/broker/idtoken.go (package broker)
and removes internal/server entirely. Callers in internal/broker/http and
internal/aggregator switch from server.X to broker.X.

* review: extract testServiceName to clear goconst finding

* refactor(broker): thread ID token explicitly across the broker→aggregator seam

Delete broker.ContextWithIDToken / broker.GetIDTokenFromContext and the
package-level contextKey/idTokenKey. The broker→aggregator seam now carries
the ID token via explicit parameters:

- brokerhttp.OAuthHTTPServer.SetOnAuthenticated callback widens to
  func(ctx, sessionID, idToken string).
- aggregator.EstablishConnectionWithTokenForwarding adds an idToken parameter
  and uses it directly.
- aggregator.EstablishConnectionWithTokenExchange adds an idToken parameter
  and drops the now-unused musterIssuer parameter.
- aggregator.establishSSOConnection adds an idToken parameter; producers in
  initSSOForSession and adminReconnectServer pass it through.
- aggregator.getIDTokenForForwarding becomes lookupIDTokenForSession with no
  ctx parameter; the OAuth proxy store is the sole source for call sites
  that don't receive an ID token from their caller (MCP tool dispatch,
  header-func closures).

Tests follow the new signatures and assert the forwarded ID token reaches
the onAuthenticated callback.

Forwarded-bearer flow without parseable JWT exp loses its ctx-based fallback;
the existing warning at injectExternalIDToken already prescribes re-auth in
that case, so the behaviour now matches the documented contract.

* review: restore lookupIDTokenForSession debug logs and tighten doc comments

- internal/aggregator/connection_helper.go: emit hit/miss Debug lines so the
  SSO token-source observability surface that getIDTokenForForwarding had is
  preserved after the rename.
- internal/aggregator/auth_resource.go: keep the one-sentence pointer that
  manual-auth servers go through core_auth_login rather than this path.
- internal/broker/idtoken.go: drop the "Kubernetes OIDC authentication
  requires the ID token" sentence — that is caller motivation, not a
  GetIDToken invariant.

* review: drop call-site references from broker package and test double

- internal/broker/doc.go: remove the "TokenBroker port in internal/aggregator
  consumes this package" sentence; the package's role doesn't depend on
  naming its consumer.
- internal/aggregator/connection_helper_test.go: rephrase the mock's purpose
  generically rather than coupling it to one test function name.

---------

Co-authored-by: Timo Derstappen <timo@giantswarm.io>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant