fix: unblock SP calls for UC in production#310
Conversation
Phase 1 of 2 — remove pre-policy 401 on missing x-forwarded-user; let the
volume policy decide via { id: <sp-id>, isServicePrincipal: true }.
asUser(req) keeps strict throw semantics; single logger.debug on fallback
replaces the dev-mode logger.warn. Startup no-explicit-policy warning
broadened to mention header-less HTTP. Tests rewritten to codify new
contract. Two pre-existing auto-generated files reformatted to match biome.
Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Phase 2 of 2 — non-behavioral polish. JSDoc on FilePolicyUser.id / isServicePrincipal broadened to describe header-less HTTP as a valid SP call origin. VolumeHandle JSDoc notes asUser(req) throws AuthenticationError.missingToken regardless of NODE_ENV. Files-plugin docs paragraphs that implied x-forwarded-user was mandatory have been reworded. Auto-regenerated typedoc for FilePolicyUser included. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Adds optional `auth: "service-principal" | "on-behalf-of-user"` to both VolumeConfig and IFilesConfig. Resolution order: volume.auth ?? plugin.auth ?? "service-principal". Removes the undocumented `bypassPolicy` parameter from createVolumeAPI (zero callsites in packages/ or apps/, so no consumers to migrate). Phase 1 of files-per-volume-auth-mode — config surface only, no routing or SDK identity change yet. _resolveAuth is unused at this point and will be wired in Phase 2. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Adds `_extractObiUser` to the files plugin and wires identity selection into `_enforcePolicy` so it branches on `_resolveAuth(volumeKey)`: - service-principal volumes: existing inline extraction (unchanged) - on-behalf-of-user volumes: require x-forwarded-access-token + user headers; 401 in production when missing, dev-fallback to SP with a single warn in development Only the policy-user identity changes here. UC SDK calls still execute as the service principal until Phase 3 wires `runInUserContext`. Phase 2 of files-per-volume-auth-mode. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Wires the seven read handlers (list, read, download, raw, exists, metadata, preview) through a new `_runWithAuth(req, volumeKey, fn)` helper that: - runs `fn` directly on SP volumes — byte-for-byte identical to today - wraps `fn` in `runInUserContext` on OBO volumes when both x-forwarded headers are present, so the SDK call and `getCurrentUserId()` resolve to the end user's identity. Cache keys use `getCurrentUserId()`, so per-user cache isolation falls out for free. Policy still gates first; `_enforcePolicy` already 401s in production when OBO headers are missing, so the dev-fallback path inside `_runWithAuth` is only reachable under `NODE_ENV === "development"`. Also rewrites the cache invalidation comment and the `VolumeAPI` / `VolumeHandle` / `exports()` JSDoc to describe both modes (previously all three claimed "operations execute as the service principal"). Phase 3 of files-per-volume-auth-mode — first end-to-end demoable slice. Write routes (Phase 4) and `VolumeHandle.asUser` SDK identity (Phase 5) are not yet wired. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Wires `_handleUpload`, `_handleMkdir`, and `_handleDelete` through `_runWithAuth` so write traffic on OBO volumes executes as the end user. SP volumes are byte-for-byte identical (no-op wrap fall-through). The upload handler does a hand-rolled `fetch PUT` and calls `client.config.authenticate(headers)` to populate auth headers — the whole body now runs inside the user-context wrap, so the user-token client is what authenticates the outgoing request. Adds the non-negotiable upload-headers contract test: triggers an OBO upload and asserts the outgoing fetch carries `Authorization: Bearer USER-TOKEN-FOO` (not the SP's marker), proving the chain runInUserContext → getWorkspaceClient → client.config.authenticate → fetch headers is intact end-to-end. Verified the test fails meaningfully if the wrap is removed. Phase 4 of files-per-volume-auth-mode. Phase 5 (`VolumeHandle.asUser` SDK identity fix) is next. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
`appKit.files("vol").asUser(req).<op>()` previously only swapped the
user identity passed to the volume policy — the underlying SDK call
still ran as the service principal. After this change, every method on
the returned handle is wrapped in `runInUserContext` with the request's
user identity, so the UC SDK call genuinely executes as the end user.
This is a hard override at the SDK level: it applies even on
`auth: "service-principal"` volumes, where it was previously a no-op
for SDK identity.
Implementation: a new `_wrapVolumeAPIInUserContext` helper wraps each
method in the returned `VolumeAPI`. Reuses `_buildUserContextOrNull` so
the dev-mode fallback (`NODE_ENV === "development"` + missing token)
skips the wrap and continues to execute as the SP — matching pre-OBO
behavior locally without a reverse proxy.
Strict `_extractUser` is unchanged — `asUser` still throws
`AuthenticationError.missingToken` in production when the user header
is missing.
Programmatic OBO note: `appKit.files("obo-vol").<op>()` (no req, no
asUser) cannot synthesize a user identity and continues to execute
against whatever client `getWorkspaceClient()` resolves to at the call
site (typically the SP at the top level). The OBO volume default
applies to HTTP route traffic via `_runWithAuth`. For programmatic
per-user execution, `asUser(req)` is the supported path.
Phase 5 of files-per-volume-auth-mode.
BREAKING CHANGE: programmatic callers of
appKit.files("vol").asUser(req).<op>() that expected the SDK call to
execute as the service principal (with only the policy seeing the
user) now see the SDK call execute as the user. Audit programmatic
asUser callers and remove the asUser wrap if SP credentials were the
desired behavior.
Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
…phase 6)
Tags every Files plugin trace span with `files.auth_mode` set to either
`"service-principal"` or `"on-behalf-of-user"`, reflecting what
operationally happened (whether the SDK call ran inside
`runInUserContext`). Two complementary plumbing paths land on the same
attribute key:
- HTTP routes: route handlers compute the effective mode via a new
`_effectiveAuthMode(req, volumeKey)` helper and thread it into the
existing `PluginExecutionSettings.telemetryInterceptor.attributes`
shape — `TelemetryInterceptor` already passes those attributes to
`tracer.startActiveSpan`, so the attribute lands on the existing
`plugin.execute` span without new infrastructure.
- Programmatic API (`exports()` SP path + `asUser` OBO path): wraps each
VolumeAPI method in a new `_withAuthModeSpan(operation, mode, fn)`
helper that calls `this.telemetry.startActiveSpan("files.<op>", ...)`.
Programmatic calls previously created NO span, so this introduces new
`files.<op>` spans on programmatic surface — a small observability
enrichment beyond pure attribute plumbing. Spans respect the plugin's
`traces` config; the noop tracer takes over when traces are disabled.
Updates `getResourceRequirements()` JSDoc to document the per-volume
permission split: SP volumes need the SP to hold `WRITE_VOLUME`; OBO
volumes need the end user (not the SP) to hold it. Adds a
`// TODO: extend plugin-manifest.schema.json` marker for the eventual
schema-level fix.
Phase 6 of files-per-volume-auth-mode. Phase 7 (docs, playground,
changelog) is next.
Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Final polish phase for files-per-volume-auth-mode: JSDoc: - Expand `VolumeConfig.auth` and `IFilesConfig.auth` with SP and OBO example blocks plus resolution-order notes. - Extend `FilePolicyUser.isServicePrincipal` JSDoc with the full six-row matrix covering SP/OBO x HTTP/programmatic x header presence, including the `asUser(req)` rows. Docs (`docs/docs/plugins/files.md`): - New "Auth modes" section: two modes, resolution order, per-mode permission requirements, prod/dev OBO behavior, manifest-scope limitation, side-by-side config examples. - New `usersOnly` policy example using `isServicePrincipal: false` and a "Policy user matrix" subsection. - Rewrites every "all operations execute as the service principal" paragraph to describe both modes. - Dedicated `asUser(req)` subsection with the honest limitation that programmatic OBO defaults don't apply without `asUser`. Dev-playground: - Server: new `obo_demo` volume with `auth: "on-behalf-of-user"` and a `usersOnly` policy; new smoke route `GET /policy/obo-volume` using `asUser(req)`. `app.yaml` gets the matching `DATABRICKS_VOLUME_OBO_DEMO` resource binding. - Client: `policy-matrix.route.tsx` gets a "Per-volume OBO mode" section with two probes (HTTP route + programmatic smoke) so the end-to-end OBO path is exercisable from the UI. Changelog: - New `## Unreleased` section in `CHANGELOG.md` (lives above the release-it generated `[0.24.0]` block) covering the per-volume auth feature, the `files.auth_mode` span, the `asUser` SDK identity behavior change, the `bypassPolicy` removal, and the honest limitation about programmatic OBO without `asUser`. Generated: - `types.generated.ts` and `plugin-manifest.generated.ts` are regenerated formatting (line-collapsing) from `pnpm build`'s generator step. Content unchanged. Backpressure: pnpm build, pnpm docs:build, pnpm check:fix, pnpm -r typecheck, pnpm test (1666 passing) all clean. Phase 7 of files-per-volume-auth-mode — feature is shippable. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
…che, spans
Addresses four findings from the multi-model review of the files OBO
feature.
1. asUser privilege confusion in production (CRITICAL, security)
`_extractUser` now requires BOTH `x-forwarded-user` AND
`x-forwarded-access-token` in production — throws
`AuthenticationError.missingToken` when either is missing. Previously
a request with the user header but no token returned a SP-wrapped
API where the policy saw the request as a real end user
(isServicePrincipal: undefined) but the SDK ran with SP credentials,
so policies like `usersOnly: !user.isServicePrincipal` were
satisfiable while wielding the SP's broader UC grants. CWE-639.
Dev fallback now marks the returned policy user as
`isServicePrincipal: true` and warns (was debug) so policies can't
be tricked even in dev.
2. Double WorkspaceClient allocation per OBO request (HIGH, perf)
New `_resolveAuthForRequest(req, volumeKey)` returns
{ mode, userCtx } and builds the UserContext at most once per
request. `_runWithAuth(userCtx, fn)` now takes the pre-built ctx.
Removes `_effectiveAuthMode` (no longer needed). Every OBO HTTP
request previously constructed two WorkspaceClients; cache hits
paid that cost too. Now: one allocation, even on cache hits.
3. Write-cache invalidation on OBO + wrong path arg (HIGH, correctness)
Two related bugs at `_invalidateListCache`: (a) cache keys included
`getCurrentUserId()`, so user A's write only busted user A's list
cache — user B continued to see stale listings; (b) the `path` arg
was the file path instead of the parent directory.
Fix: list-cache is disabled on OBO volumes (no cross-user staleness
possible). On SP volumes, invalidation now derives the parent
directory via `parentDirectory()`, falling back to the `"__root__"`
sentinel for root-level writes — matches `_handleList`'s key shape.
Cache delete + path resolution are wrapped in best-effort try/catch
so an invalidation failure cannot convert a successful write into
HTTP 500.
4. Duplicate `files.<op>` spans on programmatic calls (HIGH, perf)
`FilesConnector.<op>` already opens a `files.<op>` span via its
`traced()` decorator. The Phase 6 `_withAuthModeSpan` opened an
identical span on top, doubling span allocation/export volume on
every programmatic call.
Fix: new `runWithFilesSpanAttributes(attrs, fn)` exported from
`connectors/files/client.ts` uses AsyncLocalStorage to make ambient
attributes available to the connector's `traced()` decorator.
`_withAuthModeSpan` renamed to `_withAuthModeAttributes` — no longer
creates a span; just sets `files.auth_mode` on the connector's
existing span via the ALS channel. Programmatic calls now produce
exactly one `files.<op>` span. HTTP route attribute path
(`PluginExecutionSettings.telemetryInterceptor.attributes`) is
unchanged.
Tests: 8 new regression guards covering each fix's failure mode (1674
total, was 1666). The pre-existing `DownloadResponse` unused-import
warning is also resolved as cleanup fallout.
User-facing notes (for changelog follow-up):
- asUser(req) throws on missing token in production (was silent SP
fallback — the previous behavior was a privilege-confusion bug).
- OBO volume reads are no longer cached. Trade-off: every OBO read
hits the SDK; in exchange, no cross-user staleness. SP volume reads
still cache.
- Programmatic appKit.files(vol).<op>() emits one `files.<op>` span
instead of two. The `files.auth_mode` attribute now lands on the
connector's existing span.
Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
Two follow-up review fixes on top of 3f98628. A — _invalidateListCache not awaited (medium, correctness) Write handlers (_handleUpload, _handleMkdir, _handleDelete) called _invalidateListCache without `await`, so res.json() shipped before cache invalidation completed. A client that issued a follow-up GET /list in the same tick could hit the stale cache. Fix: _invalidateListCache is now genuinely `async` and all three call sites `await` it before sending the response. Best-effort try/catch around cache.delete and connector.resolvePath remains, so an invalidation failure cannot convert a successful write into 500. Regression test installs a deferred-promise cache.delete and asserts res.json is NOT called until the delete settles. Verified the test fails when the `await` is removed and passes when it is added back. B — hardcoded ports in integration tests (medium, CI hygiene) plugin.integration.test.ts bound to fixed offsets of TEST_PORT (9880, 9881, 9882). Concurrent CI runs or stale test processes risked EADDRINUSE flakiness. Fix: bind `port: 0` so the OS assigns ephemeral ports. New getListeningPort() helper waits for the server's `listening` event and returns the assigned port for building localBase URLs. Chose port-0 over supertest because the tests build their own AppKit instance, hold the Server for afterAll cleanup, and use real fetch calls — moving to supertest would have required restructuring the fixture lifecycle. Tests: 1675 (was 1674 + 1 new). typecheck + biome clean. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Signed-off-by: Atila Fassina <atila@fassina.eu> # Conflicts: # CHANGELOG.md # apps/dev-playground/server/index.ts
- _readPathQuery helper rejects array/object query params with 400 across all 7 path-bearing handlers (was: req.query.path raw-cast) - /read streams via connector.download + size-enforcing TransformStream capped at maxReadSize (new VolumeConfig field, default 10MB); /read no longer participates in the read-tier cache - Upload TransformStream enforces bytesReceived <= contentLength when declared, closing a per-user policy bypass where small Content-Length + larger body would exceed an approved upload size - _enforcePolicy 401 and _handleApiError 4xx now return generic public bodies (Unauthorized / standard HTTP STATUS_CODES); raw error.message goes to server-side logs only (CWE-209) Co-authored-by: Isaac Co-authored-by: Orca <help@stably.ai> Signed-off-by: Atila Fassina <atila@fassina.eu>
Two HIGH correctness findings from multi-model review of sp-files vs main:
1. /read could leak a partial 200 body before triggering 413. Once
nodeStream.pipe(res) began, headers were sent and the size-cap
fallback became res.destroy() mid-stream — breaking the all-or-nothing
contract. Replaced the pipe with a bounded getReader() drain into
chunks; on overflow we respond 413 atomically (and reader.cancel()
the upstream), otherwise res.send(Buffer.concat(chunks, bytesRead))
ships the body in one shot. Memory still bounded by maxReadSize.
2. _invalidateListCache deleted only "__root__" for root-level writes,
but _handleList keys ?path=/ listings under connector.resolvePath("/").
A client listing via ?path=/ saw stale data after a sibling write.
Invalidator now deletes both keys (with try/catch around resolvePath
and a no-op when it equals the sentinel).
113 plugin tests + 28 integration tests + workspace typecheck green.
Co-authored-by: Isaac
Signed-off-by: Atila Fassina <atila@fassina.eu>
There was a problem hiding this comment.
Pull request overview
This PR adds per-volume authentication mode support to the Files plugin so Unity Catalog SDK calls can execute either as the service principal (default) or on-behalf-of-user (OBO) using forwarded user headers, unblocking UC volume access for OBO apps in production and improving telemetry + request hardening.
Changes:
- Introduces
authconfiguration at both plugin (IFilesConfig.auth) and volume (VolumeConfig.auth) levels, wiring HTTP routes andVolumeHandle.asUser(req)throughrunInUserContextfor OBO execution. - Adds
files.auth_modetelemetry tagging across HTTP + programmatic code paths (implemented via AsyncLocalStorage attribute propagation into connector spans). - Hardens HTTP behavior: stricter header/token requirements in production, safer error responses, upload
Content-Lengthenforcement, and/readresponse-size limiting viamaxReadSize.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/appkit/src/plugins/files/types.ts | Adds new config fields (auth, maxReadSize) and updates public API docs to reflect SP vs OBO execution semantics. |
| packages/appkit/src/plugins/files/policy.ts | Expands FilePolicyUser docs to clarify id/isServicePrincipal behavior across auth modes and call paths. |
| packages/appkit/src/plugins/files/plugin.ts | Implements auth resolution, OBO context wiring (runInUserContext), telemetry attributes, query parsing hardening, cache behavior changes, and updated error handling. |
| packages/appkit/src/connectors/files/client.ts | Adds ALS-based span attribute propagation (runWithFilesSpanAttributes) merged into connector spans. |
| packages/appkit/src/plugins/files/tests/plugin.test.ts | Adds extensive unit coverage for auth inheritance, OBO identity selection, cache behavior, invalidation timing, and telemetry attributes. |
| packages/appkit/src/plugins/files/tests/plugin.integration.test.ts | Improves integration test reliability by using ephemeral ports and adds coverage for header-less SP-mode behavior. |
| docs/docs/plugins/files.md | Documents per-volume auth modes, policy-user matrix, and programmatic asUser(req) semantics. |
| docs/docs/api/appkit/Interface.FilePolicyUser.md | Updates generated API docs for FilePolicyUser fields and behavior matrix. |
| CHANGELOG.md | Adds an Unreleased entry describing new auth modes, telemetry attribute, behavior changes, and limitations. |
| apps/dev-playground/server/index.ts | Adds an OBO demo volume + probe route to validate end-to-end OBO behavior. |
| apps/dev-playground/client/src/routes/policy-matrix.route.tsx | Adds UI smoke tests/panels for the per-volume OBO demo. |
| apps/dev-playground/app.yaml | Adds env wiring for the OBO demo volume. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… msg, docs Four findings from the GitHub Copilot review on PR #310: 1. _invalidateListCache missed absolute UC root paths. A write to /Volumes/<c>/<s>/<v>/foo.txt has parentDirectory("/Volumes/<c>/<s>/<v>"), which the prior code didn't recognize as root-level — the rootless "__root__" listing stayed stale. Now the invalidator computes the volume root via connector.resolvePath(""), recognizes it as root-level, and invalidates all three keys "__root__", <volumeRoot>, <volumeRoot>/. Replaces the previous resolvePath("/") attempt that was a no-op (resolvePath rejects "/" since it's not /Volumes/...). 2. _extractObiUser threw "Missing x-forwarded-access-token" in every non-(token && userId) production case, so a request with only the token (but no x-forwarded-user) got a misleading server-side error. Now distinguishes: !token throws "missing token", token-but-no-userId throws "missing user header". HTTP body remains generic "Unauthorized" either way. 3. docs/plugins/files.md said OBO volumes get per-user cache isolation automatically. Code disables read/list cache entirely on OBO (cross-user invalidation isn't possible with the current key scheme). Doc updated to match. 4. docs/plugins/files.md said asUser(req) throws when x-forwarded-user is absent. Code requires BOTH x-forwarded-user AND x-forwarded-access-token in production. Doc updated at both occurrences. Tests: existing root-level invalidation test rewritten as a parameterized matrix covering both relative ("newdir") and absolute ("/Volumes/.../newdir") root-level writes; both must invalidate __root__ + <volumeRoot> + <volumeRoot>/. 188 plugin tests + workspace typecheck green. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Agentic review — performance & placement notes (Xavier panel)This comment summarizes three medium-severity performance / cost-of-operations findings from an automated review pass (Xavier remoras on Where to look: PR “Files changed” → 1) HTTP
|
Typo fix per review (OBO = on-behalf-of-user, not "Obi"). Touches the declaration, the single call site in `_enforcePolicy`, and one test-comment reference. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
The NOTE claimed SDK calls "still use the service principal until a later phase wires the actual OBO runInUserContext plumbing" — but that phase shipped: every handler now wraps in `_runWithAuth` / `runInUserContext` when the volume resolves to OBO. Replace the NOTE with the current truth: the policy and SDK identities are selected separately and converge per the policy-user matrix. Co-authored-by: Isaac Signed-off-by: Atila Fassina <atila@fassina.eu>
Add on-behalf-of-user (OBO) support to the files plugin so SDK calls run as the end user, unblocking Unity Catalog volume access for OBO apps deployed on Databricks Apps.
What changed
Per-volume opt-in to OBO via a new
authfield on each volume config:When
auth: "on-behalf-of-user":x-forwarded-user+x-forwarded-access-token(401 inproduction if either is missing; dev mode falls back to SP identity with a
warning).
runInUserContextso UC sees the user, not the SP.files.auth_modespan attribute distinguishes OBO vs SP traffic intelemetry.
appKit.files("vol").asUser(req).read(...)works on any volume(forces OBO regardless of the volume's default auth mode).
Note
OBO read tier intentionally disables caching (the cache layer keys by user; a write by user A wouldn't invalidate user B's view of the same path). Documented in the code; future work tracked for per-(volume, path) generation counters.
What's NOT changed
auth: "service-principal"semantics with no migration.WRITE_VOLUMEto the SP for every volume. OBO volumes need the user to also have UC permission, which has to be communicated out-of-band until the manifest schema gains a per-volume auth scope. (TODO noted ingetResourceRequirements.)Hardening
Content-Lengthparsing (strict regex, rejects negative / partial /array values; stream-level byte counter as the real gate).
_handleApiErrorno longer reflects raw SDKerror.messageon 4xx (CWE-209); generic body, raw to server log./readroute is atomic: drains the body into a memory buffer up tomaxReadSize; on overflow responds 413 atomically (no partial 200 body leak)."__root__"andconnector.resolvePath("/")so?path=/listings stay fresh._extractObiUseris strict in production (both headers required), warning + SP fallback only in dev._invalidateListCacheis awaited before the HTTP success response so a follow-upGET /listin the same tick can't observe stale data.Tests
asUser()programmatic path,files.auth_modespan attribute on every code path, root-level cache invalidation key parity.Test plan
packages/appkit/src/plugins/files/tests/plugin.integration.test.ts)pnpm -r typecheck