Skip to content

feat(server): pluggable request-auth framework (management + runtime)#204

Merged
abhinav-galileo merged 12 commits intomainfrom
abhi/management-auth-framework
Apr 30, 2026
Merged

feat(server): pluggable request-auth framework (management + runtime)#204
abhinav-galileo merged 12 commits intomainfrom
abhi/management-auth-framework

Conversation

@abhinav-galileo
Copy link
Copy Markdown
Collaborator

@abhinav-galileo abhinav-galileo commented Apr 28, 2026

Summary

Pluggable request-auth framework that handles both auth flows the
system needs:

  • Management. Online check on every request. The default
    authorizer authenticates the credential and authorizes the
    operation; in production this is HttpUpstreamAuthProvider
    forwarding to a configurable upstream service.
  • Runtime. Two-phase exchange-then-verify. A target-bearing call
    presents a long-lived credential plus (target_type, target_id) to
    a token exchange endpoint; the server mints a short-lived HS256 JWT
    bound to that target. Subsequent runtime calls verify the JWT
    locally, with no upstream round-trip on the hot path.

Both flows route through the same primitives (Operation vocabulary
on endpoints, Principal returned, RequestAuthorizer Protocol
installed); a per-operation registry lets a deployment point
management ops at one provider and runtime ops at another.

Migrates the /control-bindings endpoint family onto the framework
and ships the runtime token exchange endpoint. The runtime resolution
path itself (/evaluation etc.) is wired in a follow-up; its
provider override (LocalJwtVerifyProvider) is already in place when
the runtime secret is configured.

Module layout

server/src/agent_control_server/auth_framework/
  __init__.py                   # public API
  core.py                       # Operation, Principal, RequestAuthorizer, require_operation, registry
  config.py                     # configure_auth_from_env, RuntimeAuthConfig, set_runtime_auth_config
  runtime_token.py              # HS256 mint / verify helpers, UpstreamGrantExpiredError
  providers/
    __init__.py
    header.py                   # HeaderAuthProvider + DEFAULT_OPERATION_ACCESS
    http_upstream.py            # HttpUpstreamAuthProvider (forward + parse grant)
    local_jwt.py                # LocalJwtVerifyProvider (hot-path JWT verify)

server/src/agent_control_server/endpoints/
  auth.py                       # POST /api/v1/auth/runtime-token-exchange

auth.py (legacy local credential check) is unchanged;
HeaderAuthProvider re-uses _validate_api_key from it. Non-binding
routes still go through the legacy router-level gate; their migration
happens in follow-up PRs.

Operation vocabulary

class Operation(StrEnum):
    # Wired on endpoints in this PR.
    CONTROL_BINDINGS_READ = "control_bindings.read"
    CONTROL_BINDINGS_WRITE = "control_bindings.write"
    RUNTIME_TOKEN_EXCHANGE = "runtime.token_exchange"

    # Reserved; not yet wired on endpoints.
    CONTROLS_READ = "controls.read"
    CONTROLS_CREATE = "controls.create"
    CONTROLS_UPDATE = "controls.update"
    CONTROLS_DELETE = "controls.delete"
    RUNTIME_USE = "runtime.use"

Per-operation authorizer registry

set_authorizer(authorizer, operation=...) overrides the default for
one operation. Without operation=, it becomes the default for every
operation that does not have a specific binding. Used to route
management ops through one provider and Operation.RUNTIME_USE
through LocalJwtVerifyProvider:

set_authorizer(HttpUpstreamAuthProvider(...))                 # default
set_authorizer(LocalJwtVerifyProvider(secret=...),             # override
               operation=Operation.RUNTIME_USE)

require_operation(op) consults the override first, falls back to
the default. The local-credential path (no override installed) routes
everything to HeaderAuthProvider; the no-auth flow
(api_key_enabled=False) is preserved end-to-end.

require_operation accepts an optional context_builder so the
endpoint can surface request-shaped context (path / query / body
fields) to the authorizer. The body-bearing binding endpoints, the
target-filtered list endpoint, and the runtime token exchange
endpoint all forward (target_type, target_id) so an upstream that
resolves the target's owning project has the identifiers it needs to
make a project-level decision.

Providers (three ship in-tree)

HeaderAuthProvider: local-credential path, single namespace.

  • Maps each Operation to one of three access levels (PUBLIC,
    AUTHENTICATED, ADMIN); single source of truth in
    DEFAULT_OPERATION_ACCESS.
  • Reuses the existing local API-key + session-cookie credential
    check from auth.py, so behavior matches the previous
    require_admin_key path verbatim.
  • Returns a normalized runtime.use scope only for
    Operation.RUNTIME_TOKEN_EXCHANGE, so the exchange endpoint can
    uniformly require runtime.use in principal.scopes across every
    provider; there is no implicit fallback that could escalate an
    upstream-supplied empty scope grant.
  • The no-auth flow (api_key_enabled=False) is preserved: every
    operation succeeds with a non-admin Principal. Pinned by a
    regression test.
  • Always returns DEFAULT_NAMESPACE_KEY. The namespace header lookup
    branch is preserved but inert until non-binding write endpoints are
    threaded.

HttpUpstreamAuthProvider: generic upstream-delegating provider.

  • Forwards caller credentials (X-API-Key, Authorization,
    Cookie) on a POST to a configurable URL with
    {operation, context?}.
  • Optional service-to-service token header for upstream trust.
  • Parses the upstream response into a Principal: namespace_key,
    is_admin, caller_id, plus optional grant fields (target_type,
    target_id, scopes, expires_at) so the runtime token exchange
    can mint from the same response.
  • Maps 200 to Principal; 401 / 403 / 404 to matching error;
    5xx, network errors, malformed payloads, naive (tzinfo-less)
    expires_at, and partial target grants (only one of target_type
    / target_id) all fail closed (502/503).

LocalJwtVerifyProvider: hot-path runtime verifier.

  • Reads a Bearer token from Authorization, verifies signature
    against the runtime secret, checks domain == "runtime", the
    issuer, expiry, and that the token's scope covers the requested
    Operation.
  • Returns a Principal with the bound (namespace_key, target_type, target_id) so runtime endpoints inherit the namespace and target
    binding without re-deriving them.
  • When the dependency surfaces target_type / target_id via
    context_builder, the provider also enforces that they match the
    token's binding; runtime endpoints get the request-target check
    for free.

Runtime token shape

HS256, dedicated secret (AGENT_CONTROL_RUNTIME_TOKEN_SECRET),
issuer agent-control/server. Claims:

Claim Purpose
domain Pinned to runtime; tokens minted here MUST not be accepted on management endpoints.
namespace_key The namespace the token authorizes within. Required for mint and verify; preserved end-to-end so a token minted for one namespace cannot be used to resolve controls in another.
actor_id Caller identity surfaced from the upstream grant.
scopes Granted runtime capabilities (e.g., ["runtime.use"]). The exchange endpoint refuses to mint when principal.scopes does not contain runtime.use, including the case where the upstream's grant explicitly lists an empty scope set.
target_type / target_id Bind the token to one target.
iat / exp Bounded lifetime. The local TTL is capped by the upstream grant's expires_at so the local token can never outlive its grant.
jti Random identifier; reserved for future revocation.

mint_runtime_token rejects an upstream_expires_at whose
tzinfo is None or whose utcoffset() is None with
RuntimeTokenError so a custom authorizer that supplies a naive
datetime surfaces as a typed auth error rather than a raw TypeError
deeper in the comparison.

Runtime token exchange endpoint

POST /api/v1/auth/runtime-token-exchange
{ "target_type": "...", "target_id": "..." }
  • Authenticated and authorized via Operation.RUNTIME_TOKEN_EXCHANGE
    through the default authorizer (typically
    HttpUpstreamAuthProvider in production). The authorizer's
    context_builder forwards the requested target to the upstream so
    it can authorize against the right resource.
  • Refuses with 503 when AGENT_CONTROL_RUNTIME_TOKEN_SECRET is not
    configured.
  • Mints a local token from Principal.scopes /
    Principal.grant_expires_at, capped by the configured TTL (default
    300s).
  • When the provider's Principal carries a target binding, the
    endpoint verifies it matches the requested target before minting.
  • An upstream grant whose expires_at is already in the past
    surfaces as 502 (UpstreamGrantExpiredError), distinct from the
    503 misconfigured-server path so the public status reflects which
    side the operator should investigate.

Response: { token, expires_at, target_type, target_id, scopes }.

Storage namespace under the framework

The migrated binding endpoints take the storage namespace_key from
get_namespace_key (the same resolver the rest of the server uses),
not from principal.namespace_key. The auth chain still runs through
require_operation for authentication and authorization, but the
row's namespace is sourced from the resolver so binding writes and
runtime reads stay in lockstep until auth-derived namespace
resolution lands across /controls, /policies, /agents, and
/evaluation together. The principal's namespace is observed (and
used by LocalJwtVerifyProvider for its own contract) but is not
used to pick the row's storage namespace at this stage.

Migrated endpoints

All seven /api/v1/control-bindings* endpoints now use
Depends(require_operation(...)):

Method Path Operation Context forwarded
PUT /control-bindings control_bindings.write body: target_type, target_id
GET /control-bindings control_bindings.read query: target_type, target_id (when present)
GET /control-bindings/{binding_id} control_bindings.read N/A (namespace-wide)
PATCH /control-bindings/{binding_id} control_bindings.write N/A (namespace-wide)
DELETE /control-bindings/{binding_id} control_bindings.write N/A (namespace-wide)
PUT /control-bindings/by-key control_bindings.write body: target_type, target_id
POST /control-bindings/by-key:delete control_bindings.write body: target_type, target_id

The four binding-id-based routes are documented as namespace-wide:
their target identifiers are not available before the binding row is
loaded, and require_operation is single-pass. Clients whose
authorization model requires per-target permissions are steered to
the natural-key endpoints and the target-filtered list, all of which
forward the target to the authorizer. Two-phase auth on the by-id
routes is a follow-up.

New: POST /api/v1/auth/runtime-token-exchange (operation
runtime.token_exchange).

The framework-protected routers (/control-bindings, /auth) are
mounted with the existing non-validating get_api_key_from_header
Security extractor as a router-level dependency. require_operation
still owns runtime authentication and authorization; the Security
dependency exists purely so the generated OpenAPI spec advertises
X-API-Key on these routes for downstream SDK generation.

Generated client

The TypeScript wrapper exposes both auth and controlBindings
getters alongside the existing surface, so consumers using the
public client can call runtimeTokenExchange and the binding API
without reaching into the generated internals.

Env vars

Var Default Purpose
AGENT_CONTROL_AUTH_MODE header Default authorizer: header or http_upstream.
AGENT_CONTROL_AUTH_UPSTREAM_URL none Required when mode is http_upstream.
AGENT_CONTROL_AUTH_UPSTREAM_TIMEOUT_SECONDS 5.0 Per-request timeout.
AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN none Optional upstream service token.
AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER X-Agent-Control-Service-Token Header name for the service token.
AGENT_CONTROL_RUNTIME_TOKEN_SECRET none Required to enable runtime auth + the exchange endpoint. Validated at startup; rejected if shorter than 32 bytes.
AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS 300 Local token TTL ceiling (capped further by the upstream grant). Validated at startup.

configure_auth_from_env parses both runtime fields once at startup
into a frozen RuntimeAuthConfig. The exchange endpoint and
LocalJwtVerifyProvider read the same object, so the mint and verify
sides cannot drift apart on a process. When the runtime secret is
absent, RUNTIME_USE falls through to the default authorizer; this
is logged at WARNING so an operator can immediately see what trust
model is in effect. RUNTIME_USE is reserved and not wired to
/evaluation in this PR, so this fallback does not affect the
runtime hot path yet. The follow-up that wires runtime endpoints
should explicitly choose legacy fallback or fail-closed JWT-only
behavior.

Out of scope (follow-ups)

  • Migrate /controls CRUD onto require_operation using the
    reserved CONTROLS_* operations.
  • Wire Operation.RUNTIME_USE on the runtime resolution path
    (/evaluation, etc.) and the SDK side of the runtime exchange.
    The provider override is already in place when the runtime secret
    is configured.
  • Migrate /agents/initAgent onto require_operation. The
    HttpUpstreamAuthProvider's context_builder should forward the
    request's target_type / target_id to the upstream so the
    upstream can authorize against the requested resource.
  • Auth-derived get_namespace_key so the binding endpoints can use
    the principal's namespace for storage along with the rest of the
    server.
  • Two-phase auth for the four binding-id-based routes
    (GET/PATCH/DELETE /control-bindings/{binding_id}) so they can
    forward target context to the upstream.
  • Drop auth.py's require_admin_key once every management
    endpoint is migrated.

Stacking

Stacked on PR #203 (abhi/data-model-v1); rebased onto its
current head 8adc328 so the merged effective-controls contract,
namespace-threaded agent endpoints, and savepoint-protected binding
writes are the base this PR builds on. Will rebase onto main once
#203 merges.

Test plan

  • 55 framework + endpoint tests covering:
    • Default coverage: every Operation member has a default access
      mapping (regression guard).
    • HeaderAuthProvider: PUBLIC bypass, AUTHENTICATED + ADMIN paths
      route to the legacy validator with the right require_admin
      flag, no-auth mode passes admin operations, namespace-header
      lookup currently inert, unknown operation raises, normalized
      runtime.use scope returned for RUNTIME_TOKEN_EXCHANGE.
    • HttpUpstreamAuthProvider: 200 happy path with realistic JSON
      wire shapes (ISO datetime + JSON array scopes round-trip),
      service token forwarding, 401/403/404 mapping, 5xx fail-closed,
      network-error fail-closed, strict-grant rejection on wrong-typed
      is_admin / malformed scopes / bad expires_at / non-string
      target fields, partial target grant rejected, naive expires_at
      rejected.
    • require_operation factory: routes through the installed
      authorizer, per-operation overrides take precedence, clearing an
      override falls back to the default, get_authorizer raises
      when nothing is set.
    • Lifecycle: reconfiguring without the runtime secret drops the
      previous LocalJwtVerifyProvider override; teardown clears
      every authorizer; secret shorter than 32 bytes raises at
      startup; invalid TTL raises at startup.
    • Runtime token mint / verify: round-trip, wrong-secret rejection,
      expiry rejection, TTL capped by upstream grant, management-domain
      token refused on runtime verify, missing-namespace rejection,
      already-expired upstream grant raises UpstreamGrantExpiredError,
      naive upstream_expires_at raises RuntimeTokenError.
    • LocalJwtVerifyProvider: target-bound Principal, namespace
      carried from token, missing token returns 401, wrong scope
      returns 403, non-Bearer header returns 401, target-context match
      enforcement (mismatch on type or id returns 403).
    • Exchange endpoint: 503 without secret, mint when configured,
      target mismatch rejected (400), missing target rejected (422),
      grant-without-runtime-use rejected (no privilege escalation),
      explicit empty-scope grant rejected (no fallback escalation),
      target context forwarded to authorizer, non-default namespace
      propagates into the token, full exchange-then-verify round trip,
      already-expired upstream grant surfaces as 502 distinct from the
      503 misconfigured-server path.
  • Full server suite: 676 passed.
  • make lint clean.
  • make typecheck clean.
  • make sdk-ts-generate-check clean.
  • TypeScript SDK regenerated alongside the new endpoint
    (auth-runtime-token-exchange, request/response models, Auth
    and ControlBindings groups exposed via the public client).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

@abhinav-galileo abhinav-galileo changed the title feat(server): pluggable request-auth framework + migrate control bindings feat(server): pluggable request-auth framework (management + runtime) Apr 28, 2026
@abhinav-galileo abhinav-galileo marked this pull request as ready for review April 28, 2026 21:46
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from b87b27f to 8ecb871 Compare April 29, 2026 18:56
Comment thread server/src/agent_control_server/endpoints/auth.py Outdated
Comment thread server/src/agent_control_server/main.py
Comment thread sdks/typescript/src/generated/sdk/sdk.ts
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from 70c8229 to e5f9654 Compare April 29, 2026 22:42
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from e5f9654 to 84db093 Compare April 29, 2026 23:14
abhinav-galileo added a commit that referenced this pull request Apr 29, 2026
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from 84db093 to 7698c07 Compare April 29, 2026 23:31
Comment thread server/src/agent_control_server/auth_framework/config.py
Base automatically changed from abhi/data-model-v1 to main April 30, 2026 17:04
Endpoints declare a generic Operation; an installed RequestAuthorizer
decides whether the request is allowed and returns the resolved
Principal (namespace + admin flag + caller id). Two providers ship
in-tree:

- HeaderAuthProvider: OSS / single-namespace default. Maps each
  Operation to one of three access levels (PUBLIC / AUTHENTICATED /
  ADMIN) and reuses the legacy local credential check; behavior matches
  the previous require_admin_key path verbatim. V1 ignores the
  X-Namespace-Key header and always returns the default namespace
  because non-binding write endpoints still hardcode it; the branch is
  preserved for a follow-up that lifts the lock.
- HttpUpstreamAuthProvider: forwards caller credentials to a
  configurable upstream URL. Maps 401/403/404 directly; fail-closed
  (503) on 5xx and network errors; rejects malformed principals (502).

Control-binding endpoints now declare CONTROL_BINDINGS_READ /
CONTROL_BINDINGS_WRITE via require_operation(...) and read the
resolved namespace from the returned Principal. The router is mounted
without the legacy router-level gate so the framework owns
authentication and authorization end-to-end.

Reserved Operation members for controls.* and runtime.use are defined
but not yet wired; their migrations land in follow-up PRs.
Rename so the framework's vocabulary is factual:

- OssAccessLevel -> AccessLevel
- OSS_OPERATION_ACCESS -> DEFAULT_OPERATION_ACCESS
- Comments / docstrings: replace "OSS / single-namespace" framing with
  factual descriptions of the local-credential path.

Drop the unjustified MANAGEMENT_ prefix on environment variables;
this PR only configures one auth flow:

- AGENT_CONTROL_MANAGEMENT_AUTH_MODE -> AGENT_CONTROL_AUTH_MODE
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_URL -> AGENT_CONTROL_AUTH_UPSTREAM_URL
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_TIMEOUT_SECONDS -> AGENT_CONTROL_AUTH_UPSTREAM_TIMEOUT_SECONDS
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_SERVICE_TOKEN -> AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN
- AGENT_CONTROL_MANAGEMENT_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER -> AGENT_CONTROL_AUTH_UPSTREAM_SERVICE_TOKEN_HEADER

Add a regression test for the no-auth flow: when api_key_enabled is
False, even admin operations succeed with a non-admin Principal,
matching the pre-framework local-auth behavior.
Completes the framework's auth coverage. Management and runtime are
genuinely different protocols, and they now route through different
authorizers via the per-operation registry:

- Per-operation override on the registry. set_authorizer(authorizer,
  operation=...) overrides the default for one operation; calls
  without operation= become the default for everything else. Used to
  point Operation.RUNTIME_USE at LocalJwtVerifyProvider while leaving
  the default authorizer (header or http_upstream) for management.

- Runtime token mint/verify. HS256 JWT, dedicated secret
  (AGENT_CONTROL_RUNTIME_TOKEN_SECRET), short TTL capped by the
  upstream grant's expiry. domain="runtime" claim pins the token to
  the runtime path. Issuer is agent-control/server.

- LocalJwtVerifyProvider verifies the Bearer token, checks the scope
  covers the requested Operation, and returns a Principal with the
  bound (target_type, target_id) so endpoints can match the request
  target.

- POST /api/v1/auth/runtime-token-exchange. Authenticates via the
  default authorizer (typically HttpUpstreamAuthProvider in
  production, which forwards the credential to the configured
  upstream) and mints a local runtime token from the resulting
  Principal. Refuses with 503 when the runtime secret is not
  configured.

- Principal grew target_type, target_id, scopes, grant_expires_at
  fields so providers can surface the upstream grant's binding and
  the exchange endpoint can mint a token from it. HttpUpstreamAuthProvider
  parses the matching optional fields from the upstream JSON response.

- Configuration: AGENT_CONTROL_AUTH_* configures the default authorizer;
  AGENT_CONTROL_RUNTIME_TOKEN_SECRET (+ optional
  AGENT_CONTROL_RUNTIME_TOKEN_TTL_SECONDS) enables the runtime override.
  Without the secret, runtime endpoints fall through to the default
  authorizer.

Tests: 18 new unit + integration tests covering the registry overrides,
token round-trip / wrong-secret / expired / wrong-domain rejection,
JWT-verify provider behavior (target binding, missing token, wrong
scope, non-Bearer header), and the exchange endpoint (503 without
secret, mint when configured, target mismatch, missing target,
context forwarded to authorizer, full exchange-then-verify round trip).

The TypeScript SDK regenerates with the new endpoint surface
(runtime-token-exchange) — committed alongside.
…es/grant

Five hardening changes prompted by review:

- Runtime tokens carry namespace_key. mint_runtime_token now requires
  it; the JWT payload includes it; verify_runtime_token rejects tokens
  without it; LocalJwtVerifyProvider returns the token's namespace on
  the resulting Principal instead of always defaulting. Otherwise a
  token minted for org A would resolve runtime controls in the default
  namespace once /evaluation is wired to RUNTIME_USE.

- Exchange endpoint refuses to add runtime.use to a grant that omits
  it. If the upstream returned an explicit scope set without
  runtime.use, the credential is not authorized for runtime use on
  this target — minting one anyway would be privilege escalation.
  Defaulting to runtime.use is preserved only when the provider
  returned no scoped grant (e.g., local header path).

- HttpUpstreamAuthProvider parses the upstream response with a strict
  Pydantic model (strict=True). Wrong-typed is_admin, malformed
  scopes, bad expires_at, and non-string target fields fail closed
  with 502 instead of being silently coerced or dropped. Unknown
  fields are still tolerated so the upstream can evolve.

- LocalJwtVerifyProvider enforces target context match when the
  dependency surfaces it. Future runtime endpoints can declare a
  context_builder that extracts target_type/target_id from the
  request; the provider verifies the token's binding matches and
  rejects with 403 otherwise.

- Auth provider lifecycle. configure_auth_from_env tracks installed
  providers; teardown_auth (called from FastAPI lifespan shutdown)
  closes any aclose-able providers — releases the
  HttpUpstreamAuthProvider's owned httpx.AsyncClient.

Tests: nine new cases covering token-namespace round-trip, target
context mismatch on type and id, strict grant rejection across each
malformed field, the privilege-escalation guard, and a full
non-default-namespace round trip through the exchange endpoint.
… on reconfigure

Two follow-up fixes from review:

- HttpUpstreamAuthProvider validates against the raw response bytes via
  _UpstreamGrant.model_validate_json instead of round-tripping through
  response.json() and model_validate. Pydantic's JSON parser accepts
  ISO datetimes and JSON arrays (the actual wire shapes any HTTP
  service produces) while strict=True still rejects type-coercion
  bugs like "false" -> True or non-string entries in scopes. Adds a
  regression test that pins the JSON wire shape: ISO expires_at +
  array scopes now round-trip correctly.

- configure_auth_from_env clears any prior default and operation
  overrides before installing fresh ones; teardown_auth clears them
  too. Without this, removing the runtime token secret between two
  configure calls left the previous LocalJwtVerifyProvider override
  installed on Operation.RUNTIME_USE — silent inconsistency where the
  config path said runtime should fall through but the registry
  disagreed. Adds a regression test that exercises the full
  configure-then-reconfigure path.
A target binding is only meaningful as a (target_type, target_id)
pair. The previous schema allowed each field independently, so a
malformed grant carrying only target_type would pass type validation
and the exchange endpoint's per-field equality check would fall
through (the upstream's None never trips the != against the request
body), letting the endpoint mint a token bound to whatever target_id
the request asked for.

Add a model validator on _UpstreamGrant that fails closed when exactly
one of the two fields is set; both supplied or both omitted is the
only acceptable shape. Pydantic's ValidationError surfaces as 502 like
every other malformed-grant case.

Tests cover both half-supplied shapes (target_type only and target_id
only). Also drop two stale comments referring to upstream-specific
implementation choices that bled in earlier — the framework is
generic.
Two distinct timing-related fail-closed gaps:

1. Pydantic with strict=True still accepts a naive ISO datetime for the
   upstream's expires_at because strict only enforces types, not tz.
   Comparing the resulting naive datetime against datetime.now(UTC) at
   mint time raises TypeError and surfaces as a 500. Add a field
   validator on _UpstreamGrant.expires_at that rejects naive datetimes,
   so a malformed grant fails closed with a 502 alongside the rest of
   the strict-grant rejections.

2. mint_runtime_token would happily mint when upstream_expires_at <=
   issued_at, returning a 200 with an exp claim already in the past.
   Introduce UpstreamGrantExpiredError(RuntimeTokenError) and raise it
   in that case. The exchange endpoint maps this distinct error class
   to a 502 (upstream returned bad data) rather than the existing 503
   (server misconfigured), so the public status reflects which side
   the operator should investigate.

Tests:

- _UpstreamGrant rejects naive expires_at -> 502 (parser fail-closed).
- mint_runtime_token raises UpstreamGrantExpiredError when the grant is
  already past or exactly at issued_at.
- Exchange endpoint surfaces the expired grant as 502 (vs 503 for the
  misconfigured-server path).
…g endpoints

The seven /control-bindings endpoints were migrated onto require_operation
in #204, but none supplied a context_builder. Upstream authorizers that
resolve the target's owning project (e.g., Galileo's
check_management_access) need (target_type, target_id) to make a
project-level decision; without them the upstream returns 400 and the
provider fails closed with 503.

Two builders, four endpoints wired:

- _binding_body_context — reads target_type/target_id from the request
  body. Wired on PUT "", PUT "/by-key", POST "/by-key:delete".
- _binding_list_context — reads target_type/target_id from query params
  when the GET list endpoint is target-scoped. Wired on GET "".

The header provider's behavior is unchanged because it ignores context.
Validated end-to-end against the live api PR #6350 + authz PR #145
stack: GET with target filter, PUT with owned target, foreign-target
404, no-auth 401 all behave correctly.

Out of scope (separate follow-up): the binding_id-based endpoints
(GET/PATCH/DELETE /{binding_id}) need a 2-phase auth — look up the
binding by namespace+id to discover its target, then auth-check with
target context. That's a deeper change to the require_operation contract
and is tracked separately.
… startup, advertise APIKeyHeader

Five review issues against the auth framework:

1. Empty upstream scopes: the exchange endpoint previously fell back to
   minting a runtime.use token whenever principal.scopes was falsey,
   which is the same shape an upstream produces by returning an explicit
   ``"scopes": []``. The fallback is removed; the endpoint now requires
   runtime.use to be present in principal.scopes for every provider.
   HeaderAuthProvider explicitly grants runtime.use only when authorizing
   Operation.RUNTIME_TOKEN_EXCHANGE, so the local path keeps its V1
   behavior while upstream privilege escalation is closed off.

2. Runtime config consolidation: AGENT_CONTROL_RUNTIME_TOKEN_SECRET and
   the TTL are now parsed once at startup into a frozen RuntimeAuthConfig
   that the mint side and the LocalJwtVerifyProvider verify side both
   read. configure_auth_from_env raises at startup on misconfiguration
   instead of producing a runtime 500 from an invalid TTL or a too-short
   secret.

3. Runtime token secret strength: HS256 needs >= 32 bytes of secret
   material; values shorter than that are rejected at startup.

4. RUNTIME_USE fallback warning: when no runtime secret is configured
   the LocalJwtVerifyProvider override is not installed (V1 behavior
   unchanged), but the startup log now warns that RUNTIME_USE will fall
   through to the default authorizer, giving operators a clear signal
   to either configure the secret or accept the long-lived-credential
   trust model.

5. OpenAPI security entries: the framework-protected routers
   (/control-bindings, /auth) are now mounted with the existing
   non-validating get_api_key_from_header Security extractor as a
   router-level dependency. require_operation still owns runtime
   authentication and authorization; the Security dependency exists
   purely so the generated OpenAPI spec advertises X-API-Key on these
   routes for downstream SDK generation. Confirmed: server/.generated/
   openapi.json now lists ``security: [{APIKeyHeader: []}]`` on every
   framework-protected operation.

The TypeScript wrapper AgentControlClient is also extended with an
``auth`` getter so the runtimeTokenExchange method generated under the
Auth group is reachable through the public client.

A new fixture (``runtime_config_enabled``) replaces the previous
os.environ patching in test_runtime_token_exchange_endpoint.py so tests
exercise the same config singleton production uses; one new test pins
the empty-scope rejection.
…ding routes as namespace-wide

Two review issues:

1. ``mint_runtime_token`` now rejects a naive ``upstream_expires_at``
   with ``RuntimeTokenError`` instead of letting the comparison against
   ``datetime.now(UTC)`` raise a raw ``TypeError`` (which surfaces as a
   500). The HTTP-upstream parser already rejects timezone-less
   ``expires_at`` on the wire, but custom authorizers and tests can
   still call the helper directly; the lower-level API is now
   self-contained.

2. The four binding-id-based routes (GET/PATCH/DELETE
   ``/control-bindings/{binding_id}``) are documented as namespace-wide
   in the OpenAPI summary and docstrings. Per-target authorization is
   not possible on these routes today because ``require_operation`` is
   single-pass and the target identifiers are only discoverable after
   the binding row is loaded. Clients whose authorization model needs
   per-target permissions are explicitly steered to the natural-key
   endpoints (``PUT /by-key``, ``POST /by-key:delete``) and the
   target-filtered list, all of which forward
   ``(target_type, target_id)`` to the authorizer. Two-phase auth for
   the by-id routes is tracked as a separate follow-up.

Also: TypeScript SDK regenerated to pick up the new endpoint summaries.
…ten tzinfo guard

Two review issues:

1. Binding endpoints previously used ``principal.namespace_key`` for
   the row's storage namespace. With HeaderAuthProvider this was always
   the default namespace, so the V1 contract held; with
   HttpUpstreamAuthProvider returning an org-scoped namespace, binding
   writes would land in that namespace while initAgent / GET
   /agents/{name}/controls / /evaluation still resolved through
   ``get_namespace_key`` (V1 default), making target-bound controls
   invisible to runtime resolution. The seven binding endpoints now
   read storage namespace from ``get_namespace_key`` so writes and
   reads stay in lockstep until auth-derived namespace resolution
   lands across every endpoint. The auth chain still runs via
   ``require_operation`` for authentication and authorization; the
   resolved Principal is no longer used to pick the storage namespace.

2. The ``mint_runtime_token`` tzinfo guard now also checks
   ``utcoffset() is None`` so a custom ``tzinfo`` subclass that returns
   None from ``utcoffset()`` is rejected at the helper boundary
   instead of raising a raw ``TypeError`` from the comparison below.

TypeScript SDK regenerated to pick up the binding-endpoint docstring
updates.
…inctly

- _load_runtime_ttl_seconds enforces a 1-day maximum on the configured
  TTL so a misconfigured value cannot mint long-lived tokens. The
  upstream-grant ceiling in mint_runtime_token only fires when the
  upstream surfaces an expiry; this cap closes the configuration gap.
- HttpUpstreamAuthProvider distinguishes 429 from the catch-all 503
  branch with a rate-limit-specific detail and a Retry-After hint, and
  names the unexpected status in the catch-all detail so operators can
  tell the two failure modes apart in logs.
@abhinav-galileo abhinav-galileo force-pushed the abhi/management-auth-framework branch from dee7742 to ec77366 Compare April 30, 2026 17:15
@abhinav-galileo abhinav-galileo merged commit fae0ad3 into main Apr 30, 2026
6 checks passed
@abhinav-galileo abhinav-galileo deleted the abhi/management-auth-framework branch April 30, 2026 17:24
galileo-automation pushed a commit that referenced this pull request May 2, 2026
## [2.5.0](ts-sdk-v2.4.0...ts-sdk-v2.5.0) (2026-05-02)

### Features

* **sdk-ts:** expose debug logger option ([66aba97](66aba97))
* **sdk:** add config driven sink selection ([#176](#176)) ([64c169f](64c169f))
* **server:** namespace scoping and control bindings ([#203](#203)) ([15ed4fd](15ed4fd))
* **server:** pluggable request-auth framework (management + runtime) ([#204](#204)) ([fae0ad3](fae0ad3)), closes [#203](#203)

### Bug Fixes

* **server:** add httpx to runtime dependencies ([#205](#205)) ([b4dff6f](b4dff6f))
@galileo-automation
Copy link
Copy Markdown
Collaborator

🎉 This PR is included in version 2.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants