Skip to content

docs: translator series final state (PR 16/16)#724

Open
QuentinBisson wants to merge 8 commits into
feat/mcpserver-suspended-cleanup-core-servicefrom
docs/translator-agw-final-state
Open

docs: translator series final state (PR 16/16)#724
QuentinBisson wants to merge 8 commits into
feat/mcpserver-suspended-cleanup-core-servicefrom
docs/translator-agw-final-state

Conversation

@QuentinBisson
Copy link
Copy Markdown
Contributor

@QuentinBisson QuentinBisson commented May 19, 2026

Step 16/16 of the agentgateway adapter series. Plan: /home/quentin/.claude/plans/declarative-forging-lynx.md.

Summary

Ships the documented end-state of the muster-in-front-of-agentgateway intermediate topology and the BDD scenarios that prove the regression + feature surface the pivot preserves. Two artefacts in one PR — docs and scenarios are both forms of executable specification; landing them together keeps docs from drifting from runtime reality.

Stacked on #723 (PR 15/16). PRs 11–15 must merge first; this PR's diff base shifts to main after.

Notes on scope drift vs the plan

  • Helm RBAC for agentgateway.dev/v1alpha1 and gateway.networking.k8s.io/v1 ships here. It arguably belongs with PR 3 (feat(agentgateway): PR 3/16 — K8s emitter (cluster mode) #692, cluster-mode k8s.Applier) or PR 11 (feat(aggregator): PR 11+15a — upstream-proxy rewrite + spec.suspended landing early #720, cluster-mode wiring) — without these grants the muster ServiceAccount 403s on every emit. Landing it in the docs+test PR means cluster-mode was effectively un-deployable across the prior stack. Bundling it here was the pragmatic choice given how late the gap was found; future series should land RBAC with the first PR that emits the resources.
  • No retitling of PR 1–15 scenarios. The plan called out PR 16 as the natural home for restructuring scenarios added in earlier PRs of the series to match the final architecture. After auditing the as-shipped scenario names this turned out to be a no-op: scenarios from PRs 1–15 (agentgateway-subprocess-data-plane, mcpserver-mixed-transports, mcpserver-suspended-resumed, session-multi-user-tool-isolation, etc.) already describe what they exercise under muster-in-front. No renames were required.

Documentation

  • docs/operations/installation.md — new Data plane: muster + agentgateway section with Filesystem mode, Cluster mode — one MCPServer CRD suffices, and Pause / resume an MCPServer subsections. Covers binary resolver precedence (MUSTER_AGW_BINARY env → ~/.config/muster/bin cache → pinned GitHub release with SHA-256), the combined <configPath>/agentgateway/agentgateway.yaml, listener defaults (muster on 8080), MUSTER_AGW_UPSTREAM_URL and the Helm value muster.agentgateway.upstreamURL, the per-MCPServer emitted resources table, the NotSupportedInCluster stdio condition, and three equivalent ways to flip spec.suspended (kubectl patch, YAML edit, core_mcpserver_update). Closes with the deprecation note on core_service_list / core_service_status and points readers at core_mcpserver_list / _get / _reconnect.
  • No docs/runbooks/deploy-muster-with-agentgateway.mddocs/runbooks/ does not exist on this branch (prompt scoped the runbook as conditional).

Helm RBAC

  • helm/muster/templates/rbac.yaml — grants the muster ServiceAccount full CRUD on agentgateway.dev/v1alpha1 AgentgatewayBackend + AgentgatewayPolicy and gateway.networking.k8s.io/v1 HTTPRoute. Without this the cluster-mode k8s.Applier would 403 on every emit. Also drops the stale teleport identity mention from the secrets-rule comment (teleport was removed in Remove Teleport authentication support #687).

BDD scenarios (internal/testing/scenarios/)

Regression

Scenario Locks
mcpserver-required-audiences-config.yaml (new) spec.auth.requiredAudiences round-trips through create + get + update; multi-value preserved
mcpserver-token-exchange-config.yaml (new) RFC 8693 standalone tokenExchange (non-Teleport) CRUD; mutually-exclusive-with-authorizationServer CEL rejection
mcpserver-authorization-server-override.yaml (extended) #599 override survives core_mcpserver_update
session-multi-user-tool-isolation.yaml (existing) ADR-006 subject-scoped tool visibility
mcpserver-suspended-resumed.yaml (from #723) spec.suspended pause / resume cycle

Feature

Scenario Locks
agentgateway-subprocess-data-plane.yaml (existing) Filesystem-mode auto-spawn + streamable-http tool calls via subprocess data plane
agentgateway-subprocess-stdio-target.yaml (new) Filesystem-mode stdio MCPServer spawned by agentgateway via mcp.targets[].stdio
mcpserver-auth-required-through-agentgateway.yaml (existing) Upstream-proxy routing preserves OAuth challenge (WWW-Authenticate + auth_required state)
mcpserver-reconnect.yaml (new) core_mcpserver_reconnect deregister + re-register round trip; tool calls survive
mcpserver-mixed-transports.yaml (existing) Stdio + streamable-http co-existence with spec.suspended toggle

End-to-end

Coverage Where
Apply / delete MCPServer round trip mcpserver-crud.yaml + mcpserver-delete.yaml + mcpserver-lifecycle.yaml (existing)
Clean SIGTERM ordering (reconciler → agw → orch) internal/testing/integration/agw_subprocess_integration_test.go (existing)
Subprocess crash + restart with exponential backoff internal/agentgateway/subprocess/manager_test.go: TestManager_Restart_OnCrash_WithBackoff (existing)
Binary resolver precedence + SHA-256 verification internal/agentgateway/binary/resolver_test.go (existing)
Cluster-mode CRD emission with OwnerReferences cascade internal/reconciler/agentgateway/k8s/applier_test.go (existing)

Out of scope

Documents the muster-in-front-of-agentgateway intermediate topology
and ships the BDD scenarios that prove the regression + feature
surface the pivot preserves.

- docs/operations/installation.md: filesystem-mode subprocess auto-spawn
  (binary resolver precedence, combined agentgateway.yaml location,
  listener defaults), cluster-mode CRD emission (AgentgatewayBackend +
  HTTPRoute + AgentgatewayPolicy per MCPServer with OwnerReference
  cascade), spec.suspended pause/resume across both modes, and the
  deprecation note on core_service_list / core_service_status.
- helm/muster/templates/rbac.yaml: grant the muster ServiceAccount full
  CRUD on agentgateway.dev/v1alpha1 AgentgatewayBackend +
  AgentgatewayPolicy and gateway.networking.k8s.io/v1 HTTPRoute so the
  cluster-mode reconciler can emit the agentgateway config stack.
- internal/testing/scenarios:
  - mcpserver-required-audiences-config.yaml (regression):
    spec.auth.requiredAudiences round-trips through create + get +
    update, including the multi-value case.
  - mcpserver-token-exchange-config.yaml (regression): RFC 8693
    standalone tokenExchange (non-Teleport) CRUD + the
    mutually-exclusive-with-authorizationServer rejection.
  - mcpserver-authorization-server-override.yaml (extended): the #599
    override survives core_mcpserver_update.
  - mcpserver-reconnect.yaml (feature): core_mcpserver_reconnect
    force-reconnect with tool-call survival before and after.
  - agentgateway-subprocess-stdio-target.yaml (feature): filesystem
    stdio MCPServer spawned by agentgateway via mcp.targets[].stdio
    and proxied through the subprocess data plane.
…-core-service' into docs/translator-agw-final-state
QuentinBisson and others added 3 commits May 20, 2026 10:32
…-core-service' into docs/translator-agw-final-state
docs/operations/installation.md described the BDD port-override and
SIGTERM drain in terms of internal Go symbols (aggregator.AgentgatewayPort,
yamlapply.WithListenerPort). aggregator.AgentgatewayPort isn't even a
real config field — the BDD harness sets an internal AggregatorConfig
struct field, not anything user-visible. Replace the prose with a
user-facing description: muster picks an unused loopback port at
startup so parallel instances coexist; the SIGTERM drain waits up to
ten seconds.

The deprecation paragraph below it still framed core_service_start /
core_service_stop / core_service_restart as removed-but-noted; lead
with what users should do (spec.suspended via core_mcpserver_update,
plus core_mcpserver_reconnect for the force-reconnect verb).

mcpserver-reconnect.yaml:
- Use core_mcpserver_get (canonical) instead of core_service_status
  (deprecated alias). State assertions move to the capitalised CRD
  values ("Connected"/"Disconnected").
- reconnect-non-existent-fails asserted on "Failed to reconnect", an
  English error-prefix string carried by HandleErrorWithPrefix. A
  refactor of that helper would break the scenario without breaking
  any unit test. Assert on the input name ("no-such-server") instead,
  which the underlying NotFoundError formats into its message.

mcpserver-token-exchange-config.yaml and
mcpserver-required-audiences-config.yaml: tighten scope claims in the
description and header comment so it's clear these scenarios lock the
schema field round-trip, with on-wire RFC 8693 / cross-cluster SSO
covered by oauth-sso-token-exchange-basic.yaml against the mock Dex +
mock backend topology. requiredAudiences' aggregation helper is
unit-tested in internal/api/handlers_mcpserver_test.go::TestCollectRequiredAudiences.
@QuentinBisson QuentinBisson marked this pull request as ready for review May 20, 2026 14:31
@QuentinBisson QuentinBisson requested a review from a team as a code owner May 20, 2026 14:31
@QuentinBisson QuentinBisson changed the title docs+test: translator series final state (PR 16/16) docs: translator series final state (PR 16/16) May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant