Skip to content

Extract SPOG org-id from cluster httpPath for non-Thrift requests#1472

Open
msrathore-db wants to merge 1 commit into
databricks:mainfrom
msrathore-db:fix-spog-org-id-extract-cluster-path
Open

Extract SPOG org-id from cluster httpPath for non-Thrift requests#1472
msrathore-db wants to merge 1 commit into
databricks:mainfrom
msrathore-db:fix-spog-org-id-extract-cluster-path

Conversation

@msrathore-db
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes silent telemetry loss on SPOG (custom-URL) hosts when connecting to an all-purpose cluster via a Thrift httpPath like sql/protocolv1/o/<workspace-id>/<cluster-id>.
  • parseCustomHeaders now extracts the workspace ID from the /o/<wsid>/ path segment as a fallback when ?o=<wsid> is not present, and sets it as the x-databricks-org-id header.
  • Priority order preserved: explicit http.header.x-databricks-org-id?o= in httpPath/sql/protocolv1/o/<wsid>/ path segment.

Why

On a SPOG host the workspace identity has to be in either the URL or the x-databricks-org-id header so PoPP can route a request to the right workspace. For cluster Thrift this happens "for free" because the workspace ID is in the /o/<wsid>/ segment of the path, so PoPP routes it correctly via routing_reason=workspace-id and the Thrift session opens fine without the user supplying ?o=. The connection-level telemetry, feature-flag, and other helper requests, however, target different paths (/telemetry-ext, /api/...) that don't carry the workspace ID, and the driver only set the x-databricks-org-id header when ?o= was present in httpPath. Result: those requests had no workspace context on SPOG, PoPP fell back to default (account) routing, and the response was an HTTP 303 redirect to /login — silently dropping telemetry.

Reproduced against peco.azuredatabricks.net (Prod SPOG), workspace 6436897454825492, cluster 1214-195625-gtrwbe64:

URL httpPath Thrift queries Telemetry POST /telemetry-ext
sql/protocolv1/o/<wsid>/<cluster> (no ?o=, no fix) ✅ 200 303 → /login (silent loss)
Same, with this fix ✅ 200 200 OK

The fix only adds extraction logic; it does not modify the outgoing request URL or the routing path.

What changes

src/main/java/com/databricks/jdbc/api/impl/DatabricksConnectionContext.java

  • New static CLUSTER_PATH_ORG_ID_PATTERN regex matching (?:^|/)sql/protocolv1/o/(\d+)/[^/?]+.
  • parseCustomHeaders falls back to that pattern when the ?o= extraction does not yield a workspace ID. The existing precedence guard (explicit http.header.x-databricks-org-id already set by the caller wins) is unchanged. The fallback runs only when httpPath is non-empty.

src/test/java/com/databricks/jdbc/api/impl/DatabricksConnectionContextTest.java

  • Five new tests under the existing "SPOG ?o= Tests" block covering:
    • Cluster path without ?o= → header extracted from /o/<wsid>/
    • Cluster path with leading / → same
    • Cluster path with ?o=?o= value wins (priority)
    • Cluster path + explicit http.header.x-databricks-org-id → caller header wins
    • Warehouse path without ?o= → no header (regression guard: the new regex must not match warehouse paths)

NEXT_CHANGELOG.md

  • Added a ### Fixed entry describing the change.

Test plan

  • mvn test -Dtest='DatabricksConnectionContextTest' — 136 tests pass (5 new).
  • mvn package -DskipTests — uber jar builds; spotless:apply clean.
  • End-to-end verification against peco.azuredatabricks.net (Prod SPOG) all-purpose cluster 1214-195625-gtrwbe64 with PAT:
    • Outgoing wire log shows x-databricks-org-id: 6436897454825492 on the telemetry POST.
    • Telemetry response flipped from HTTP/1.1 303 See Other (with location: /login?next_url=%2Ftelemetry-ext) to HTTP/1.1 200 OK.
    • All four Thrift operations (CREATE_SESSION, two EXECUTE_STATEMENT, DELETE_SESSION) continue to return 200 — behaviour unchanged.

Out of scope

  • Stg SPOG (dogfood-spog.staging.azuredatabricks.net) currently rejects PAT for cluster Thrift with Authentication Error: Authorization header received from the client is empty. regardless of ?o= or header presence — the 401 is generated by the upstream cluster HS2 handler, not by PoPP (response carries apiproxy-response-code-details: via_upstream). This is a server-side workspace/cluster auth config issue and is not addressed here.
  • The corresponding C# ADBC driver (adbc-drivers/databricks, HiveServer2-backed DatabricksConnection) has the same parsing limitation in PropertyHelper.ParseOrgIdFromProperties. That fix belongs in the ADBC repo.

This pull request and its description were written with assistance from Claude Code.

For all-purpose-compute Thrift connections on SPOG (custom-URL) hosts
the httpPath is /sql/protocolv1/o/<workspace-id>/<cluster-id> and the
workspace ID is encoded in the path itself. PoPP routes the Thrift
request correctly off the /o/<wsid>/ segment, so the connection succeeds
without an explicit ?o= query parameter.

Other requests on the same connection (telemetry POSTs to /telemetry-ext,
feature-flag fetches, etc.) hit different paths that don't carry the
workspace ID. Previously parseCustomHeaders only looked at ?o= in the
httpPath, so the x-databricks-org-id header was never set for cluster
URLs without ?o=. On SPOG hosts PoPP then had no workspace context for
these requests and redirected them to /login (HTTP 303), silently
dropping telemetry.

Extend parseCustomHeaders to also extract the workspace ID from the
cluster path segment as a fallback when ?o= is absent. Priority order
is preserved: explicit http.header.x-databricks-org-id > ?o= query
param > /o/<wsid>/ path segment.

Verified end-to-end against peco.azuredatabricks.net (Prod SPOG) cluster
1214-195625-gtrwbe64 via OSS JDBC PAT: telemetry POST /telemetry-ext now
returns HTTP 200 instead of HTTP 303 redirect to /login.

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants