Fix U2M/M2M OAuth for SPOG (unified) AWS hosts#374
Conversation
SPOG / unified hosts (a single *.databricks.com custom URL that fronts
workspaces across accounts, addressed with ?o=<workspaceId>) failed OAuth
in two ways:
1. InferCloudFromHost returned Unknown for bare *.databricks.com hosts, so
the U2M authenticator errored ("unhandled cloud type") before any OAuth.
2. For AWS/GCP, GetEndpoint discovered the OIDC endpoint by hitting
https://<host>/oidc directly. On a SPOG host that resolves to the
account-agnostic account-console authorize endpoint, which mints a token
for the caller's default account. The target workspace (owned by a
different account) then rejects it with 400 "Invalid Token".
Fixes:
- InferCloudFromHost: classify bare *.databricks.com hosts as AWS (checked
after Azure/GCP so those remain correctly classified).
- GetEndpoint (AWS/GCP): resolve the OIDC issuer via
/.well-known/databricks-config, substituting the {account_id} placeholder,
so unified/SPOG hosts use their account-rooted endpoint. Normal workspace
hosts advertise https://<host>/oidc, so the issuer is identical to before.
Any failure (absent endpoint, non-200, unparseable, timeout) falls back to
the previous bare-host issuer, so existing behavior is preserved. This
mirrors databricks-sdk-go's config host-metadata resolution.
Adds oauth_test.go covering cloud inference (incl. SPOG and the
GCP/Azure disambiguation), {account_id} substitution, and the
databricks-config failure/fallback paths.
Tested end-to-end against staging via the real driver auth paths:
- AWS non-SPOG (e2-dogfood...cloud.databricks.com): U2M and M2M pass.
- AWS SPOG (dogfood.staging.databricks.com): U2M and M2M pass.
- Azure SPOG (dogfood-spog.staging.azuredatabricks.net): U2M passes
(Azure path unchanged; account resolution handled by the host redirector).
- Full unit suite passes with no regressions.
Co-authored-by: Isaac
| client := &http.Client{Timeout: hostConfigTimeout} | ||
|
|
||
| meta, ok := fetchHostMetadata(ctx, client, url) | ||
| if !ok || meta.OIDCEndpoint == "" { |
There was a problem hiding this comment.
[HIGH] resolveOIDCIssuer guards only !ok || meta.OIDCEndpoint == ""; it never checks meta.AccountID. If metadata returns oidc_endpoint: ".../accounts/{account_id}" with account_id: "", substituteAccountID (line 98) yields .../accounts/ — an account-less, malformed issuer. Execution has already committed past the fallback here, so oidc.NewProvider fails and GetEndpoint returns a hard error rather than degrading to the bare host.
This directly contradicts this function's own doc comment: "Any failure (… missing field …) falls back to the bare-host issuer."
Fix: when the post-substitution string still contains {account_id} (or OIDCEndpoint has the placeholder while AccountID == ""), return fallback. Flagged independently by the devil's-advocate, test, language, and architecture reviewers.
Posted by code-review-squad · /full-review · feedback: #code-review-squad-feedback
| fallback := fmt.Sprintf("https://%s/oidc", hostName) | ||
|
|
||
| url := fmt.Sprintf("https://%s/.well-known/databricks-config", hostName) | ||
| client := &http.Client{Timeout: hostConfigTimeout} |
There was a problem hiding this comment.
[MEDIUM] &http.Client{Timeout: hostConfigTimeout} has a nil Transport, so it uses http.DefaultTransport — bypassing WithTransport, WithSkipTLSHostVerify (connector.go:378-399), and the driver's PooledTransport. A user behind a corporate proxy (programmatic transport), with a custom CA / mTLS, or with InsecureSkipVerify set in dev will have this fetch fail TLS/connect and silently fall back to bare-host discovery — the exact broken path this PR fixes — with no diagnostic. (Env-var proxies still work; http.DefaultTransport honors HTTPS_PROXY.)
Not a regression: the pre-existing oidc.NewProvider(ctx, …) already uses the default client (no oidc.ClientContext installed). But it does mean the fix is a no-op in exactly the enterprise/dev environments where SPOG is most likely exercised. fetchHostMetadata already takes an injected *http.Client; thread the driver's configured client/transport (or at least cfg.TLSConfig) down through GetEndpoint → resolveOIDCIssuer. This also closes the testability gap on resolveOIDCIssuer.
Posted by code-review-squad · /full-review · feedback: #code-review-squad-feedback
|
|
||
| meta, ok := fetchHostMetadata(ctx, client, url) | ||
| if !ok || meta.OIDCEndpoint == "" { | ||
| return fallback |
There was a problem hiding this comment.
[MEDIUM] Every non-200, redirect-to-HTML, malformed JSON, or wrong-account_id response collapses to the same silent bare-host fallback. A genuinely misconfigured SPOG host then re-produces the opaque 400 Invalid Token this PR exists to eliminate, with zero breadcrumb as to why.
The repo already uses zerolog/log in these packages — a log.Debug()/log.Warn() on the fallback path would pay for itself in support time. (security + devil's-advocate)
Posted by code-review-squad · /full-review · feedback: #code-review-squad-feedback
| } | ||
| } | ||
|
|
||
| func TestResolveOIDCIssuer_substitutesAccountID(t *testing.T) { |
There was a problem hiding this comment.
[MEDIUM] TestResolveOIDCIssuer_substitutesAccountID (and _workspaceHostUnchanged at line 90) are misnamed: they call fetchHostMetadata + substituteAccountID directly and never call resolveOIDCIssuer. So the actual fallback wiring (!ok || OIDCEndpoint == "" → bare host), the URL-format construction, the oidc_endpoint == "" guard, and the empty-account_id case (see the High finding) are all uncovered. TestFetchHostMetadata_failuresFallBack proves failure detection (ok=false), not the fallback string mapping. And GetEndpoint end-to-end — the M2M TokenURL actually coming out account-rooted, the whole point of the PR — is untested.
Suggested fix
add a table-driven resolveOIDCIssuer test against an httptest server covering substitute / unchanged / empty-endpoint-fallback / failure-fallback / empty-account_id. (test + devil's-advocate)
/full-review · feedback: #code-review-squad-feedback
| // For normal workspace hosts this resolves to https://<host>/oidc, identical to | ||
| // the previous behavior. | ||
| issuerURL := resolveOIDCIssuer(ctx, hostName) | ||
| ctx = oidc.InsecureIssuerURLContext(ctx, issuerURL) |
There was a problem hiding this comment.
[MEDIUM] oidc.InsecureIssuerURLContext disables the check that the discovered config's issuer field matches the requested issuer. Pre-PR this was benign because issuerURL was always pinned to https://<host>/oidc. Post-PR the issuer is taken from a host-supplied JSON document (line 50) and may point at a different host. Not exploitable without controlling the target host's HTTPS response, but it widens the trust surface and deserves an explicit decision:
- validate the resolved issuer host against an allowlist of Databricks suffixes before passing to
NewProvider, or - drop the
InsecureIssuerURLContextoverride on the AWS/GCP path now that the issuer is resolved dynamically, so go-oidc re-enforces issuer matching.
Posted by code-review-squad · /full-review · feedback: #code-review-squad-feedback
| // substituteAccountID resolves the {account_id} placeholder in the advertised | ||
| // oidc_endpoint. Workspace hosts have no placeholder and are returned unchanged. | ||
| func substituteAccountID(meta hostMetadata) string { | ||
| return strings.ReplaceAll(meta.OIDCEndpoint, "{account_id}", meta.AccountID) |
There was a problem hiding this comment.
[LOW] substituteAccountID returns meta.OIDCEndpoint verbatim with no scheme check. If the metadata doc returns oidc_endpoint: "http://.../oidc", oidc.NewProvider issues a plaintext discovery GET and the derived authorize/token URLs are HTTP — OAuth traffic in cleartext. The hard-coded fallback is always HTTPS, so this only appears on the new metadata-driven path.
Suggested fix
a one-line strings.HasPrefix(resolved, "https://") guard (fall back otherwise) closes it and matches the stated "fall back on anything unexpected" intent. Gated on controlling the (HTTPS-served) metadata response.
/full-review · feedback: #code-review-squad-feedback
| fallback := fmt.Sprintf("https://%s/oidc", hostName) | ||
|
|
||
| url := fmt.Sprintf("https://%s/.well-known/databricks-config", hostName) | ||
| client := &http.Client{Timeout: hostConfigTimeout} |
There was a problem hiding this comment.
[LOW] Every AWS/GCP GetEndpoint now makes a synchronous /.well-known/databricks-config call before the existing OIDC discovery, on the connection-setup path. It is bounded only by the fixed 30s hostConfigTimeout — both production callers pass context.Background(), so the caller's deadline doesn't tighten it (though fetchHostMetadata does honor ctx cancellation via NewRequestWithContext). The result is immutable per host but is re-fetched on every authenticator construction with no caching.
Consider: deriving the timeout from the incoming ctx, a tighter default, and memoizing per host. databricks-sdk-go resolves host metadata once per Config and reuses it. (architecture + security + devil's-advocate)
Posted by code-review-squad · /full-review · feedback: #code-review-squad-feedback
Code Review Squad — ReviewScore: 76/100 — MODERATE RISK The fix is well-targeted and the fallback design is conservative (any metadata-fetch failure degrades to the historical bare-host issuer), and the Low FindingsLow findings — click to expand"Mirrors databricks-sdk-go" claim is inaccurate; /.well-known/databricks-config server contract unverifiedThe devil's-advocate reviewer checked databricks-sdk-go v0.111.0: it does not fetch Two asks: (1) correct the PR description's "Mirrors databricks-sdk-go" wording, and (2) confirm that Also worth a one-line confirmation (raised by devil's-advocate + security): the AWS-only assumption for bare |
- Fall back to the bare-host issuer when oidc_endpoint has an {account_id}
placeholder but account_id is empty (previously produced a malformed
".../accounts/" issuer and a hard error, violating the fallback contract).
- Validate the metadata-supplied issuer before use (isValidDatabricksIssuer):
must be https, on a recognized Databricks domain, with no unresolved
placeholder. Covers the http:// cleartext case and bounds the trust placed
in the host-supplied document given InsecureIssuerURLContext disables the
issuer-match check.
- Log (debug/warn) on every fallback path so a misconfigured SPOG host leaves
a breadcrumb instead of silently reproducing the opaque 400.
- Inject the *http.Client into resolveOIDCIssuer and bound OIDC discovery with
the same client (oidc.ClientContext); tighten the timeout 30s -> 10s.
- Tests: drive resolveOIDCIssuer end-to-end via httptest (substitute /
workspace-unchanged / empty-account_id / empty-endpoint / non-https /
non-databricks-host / 404 / garbage / unreachable) and add
isValidDatabricksIssuer coverage.
Co-authored-by: Isaac
|
Thanks for the review — addressed in 548e4bf. [HIGH] empty [MED] [MED] silent fallback / no diagnostics. Added [MED] tests skip [LOW] latency / timeout / caching. Tightened the timeout 30s→10s and bound OIDC discovery with the same client ( [MED] transport/TLS not honored. Acknowledged and, as the review notes, not a regression — the pre-existing [LOW] "mirrors databricks-sdk-go" wording + contract. Corrected the PR description. I verified the bare |
Code Review Squad — Follow-up on
|
The hardening commit made InferCloudFromHost (substring matching) load-bearing
for the cross-host issuer trust decision in isValidDatabricksIssuer, so a host
like "databricks.com.evil.example" would have passed.
Replace the InferCloudFromHost call in the trust gate with isDatabricksHost,
which uses HasSuffix against an explicit list of Databricks DNS suffixes
(case-insensitive). Cloud routing (InferCloudFromHost) is left unchanged.
Tests: add TestIsDatabricksHost and extend TestIsValidDatabricksIssuer /
TestResolveOIDCIssuer with suffix-spoof ("databricks.com.evil.example"),
"-databricks.com", and bare-apex cases, plus the real SPOG/non-SPOG issuer
hosts as positives.
Verified end-to-end against staging over the network (opt-in test, not
committed): GetEndpoint resolves the SPOG host to its account-rooted endpoint
(accounts.staging.cloud.databricks.com/oidc/accounts/<id>/v1/{authorize,token})
and the non-SPOG host to its workspace endpoint, unchanged by the gate.
Co-authored-by: Isaac
|
Thanks for the re-review. Both remaining items addressed. #2 (Contains looseness now a trust boundary). Fixed in ae8bf4d. The trust gate no longer goes through #1 (server contract — gating). Verified live, not assumed.
Field names match the struct exactly. And end-to-end through the real driver the full CUJ works on these hosts (U2M + M2M, table in the PR description). I also ran an opt-in live test of So it is a shipped contract (dev/staging today; prod ramp is tracked separately by the platform team). I corrected the "Mirrors databricks-sdk-go" wording in the PR description — the SDK has the same host-metadata concept in its All hermetic unit tests pass; full suite (incl. root package) green with no regressions. |
Problem
SPOG / unified hosts — a single
*.databricks.comcustom URL that fronts workspaces across multiple accounts, addressed with?o=<workspaceId>— failed OAuth (U2M and M2M) in two ways:Cloud inference.
InferCloudFromHostreturnedUnknownfor bare*.databricks.comhosts, sou2m.NewAuthenticatorerrored withunhandled cloud typebefore any OAuth happened.Endpoint discovery. For AWS/GCP,
GetEndpointdiscovered the OIDC endpoint by hittinghttps://<host>/oidcdirectly. On a SPOG host the discovery doc points at the account-agnostic account-console authorize endpoint (accounts.../oidc/v1/authorize), which mints a token for the caller's default account. The target workspace (owned by a different account) then rejects it with400 Invalid Token(ext_authz_denied).The non-SPOG path was unaffected: a dedicated workspace host advertises its own workspace-rooted endpoint, so there was never any account ambiguity.
Fix
InferCloudFromHost: classify bare*.databricks.comhosts as AWS. Checked after the Azure (.azuredatabricks.net) and GCP (.gcp.databricks.com) matches so those stay correctly classified.GetEndpoint(AWS/GCP): resolve the OIDC issuer via/.well-known/databricks-config, substituting the{account_id}placeholder, so unified/SPOG hosts use their account-rooted endpoint (https://<host>/oidc/accounts/<accountId>). Normal workspace hosts advertisehttps://<host>/oidc, so the resolved issuer is identical to before. The resolved issuer is validated (https + recognized Databricks domain + placeholder fully resolved); any failure or unusable value falls back to the previous bare-host issuer, preserving existing behavior.Azure is unchanged — its account resolution is handled server-side by the host's
/oidc/oauth2/v2.0/authorizeredirector.Server contract
/.well-known/databricks-configwas verified live on staging: bothdogfood.staging.databricks.com(unified) ande2-dogfood.staging.cloud.databricks.com(workspace) return{oidc_endpoint, account_id}with those exact field names; the unified host returnsoidc_endpoint: https://<host>/oidc/accounts/{account_id}. (The same host-metadata concept exists indatabricks-sdk-go'sconfigpackage; this driver implements its own minimal resolver against the shipped endpoint.)Tests
auth/oauth/oauth_test.go:InferCloudFromHost(incl. SPOG hosts and GCP/Azure disambiguation),resolveOIDCIssuerend-to-end viahttptest(substitute / workspace-unchanged / empty-account_id/ empty-endpoint / non-https / non-databricks-host / 404 / garbage / unreachable),isValidDatabricksIssuer, andfetchHostMetadatafailure paths.End-to-end against staging via the real driver auth paths (no overrides):
e2-dogfood…cloud.databricks.com(non-SPOG AWS)e2-dogfood…cloud.databricks.com(non-SPOG AWS)dogfood.staging.databricks.com(SPOG AWS)dogfood.staging.databricks.com(SPOG AWS)dogfood-spog.staging.azuredatabricks.net(SPOG Azure)Full unit suite passes with no regressions.
Known limitation
The
databricks-configGET (and the existing OIDC discovery) use the default HTTP transport; a connector-suppliedWithTransport/WithSkipTLSHostVerifyis not threaded into the OAuth endpoint-resolution path. This is pre-existing (the prioroidc.NewProviderdiscovery had the same behavior) and is left for a follow-up since it requires changing the public authenticator constructors.