fix(databricks-jdbc-driver): support OIDC/workload identity for export buckets#11143
Conversation
…t buckets
When a Databricks export bucket is configured for OIDC / workload identity
there are no static credentials, but the driver passed
`{ accessKeyId: '', secretAccessKey: '' }` to the S3 client unconditionally.
The AWS SDK treats those empty strings as explicit credentials and the request
fails with `AuthorizationHeaderMalformed: ... a non-empty Access Key (AKID)
must be provided`, instead of falling back to the default provider chain
(`AWS_WEB_IDENTITY_TOKEN_FILE`, IRSA, ...).
Athena and BigQuery already work in this setup (Athena omits credentials when
none are set; BigQuery uses GCS). This brings Databricks in line:
- DatabricksDriver omits credentials across all three bucket types (S3, GCS,
Azure) when none are configured, so the cloud SDK default chain /
`DefaultAzureCredential` (`AZURE_FEDERATED_TOKEN_FILE`) resolves them.
- base-driver `aws.fs` adds `normalizeS3ClientConfig`, which strips blank static
credentials and a blank region before constructing the S3 client — protecting
every S3-using driver, while leaving credential provider functions and
fully-populated static credentials untouched.
- base-driver `gcs.fs` adds `hasGCSCredentials`, treating empty string/object as
absent so the Google SDK falls back to Application Default Credentials.
Tests: base-driver unit tests cover the normalizers and verify the S3/GCS/Azure
SDK clients construct without throwing when no credentials are passed; the
Databricks driver test covers OIDC (no keys) and static-key S3 cases.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @bsod90's task in 1m 51s —— View job Claude Code ReviewVerdict: LGTM — clean fix, well-targeted, defense-in-depth at the right boundary. 0 high, 0 medium, 2 low. Full review (expand)
SummaryThe fix correctly identifies the root cause: passing empty-string credentials to the AWS SDK is interpreted as explicit credentials, blocking the default provider chain (web identity token file for OIDC/IRSA). The fix is applied in two layers:
Since Code quality
Tests
Minor / low-priority observations1. 2. Security / perfNo security concerns — the change moves credential resolution to the SDK's default chain, which is the documented safe path for OIDC/workload identity. No performance impact. Nice fix. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11143 +/- ##
===========================================
- Coverage 79.18% 58.80% -20.38%
===========================================
Files 474 217 -257
Lines 96880 17511 -79369
Branches 3558 3563 +5
===========================================
- Hits 76712 10298 -66414
+ Misses 19650 6695 -12955
Partials 518 518
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Problem
A Databricks deployment configured to authenticate to its S3 export bucket via OIDC / workload identity (no static
accessKeyId/secretAccessKey) fails when building pre-aggregations:The Databricks driver's export-bucket S3 path passed credentials unconditionally:
Under OIDC there are no static keys, so the SDK receives
accessKeyId: ''/secretAccessKey: ''and treats them as explicit (malformed) credentials instead of falling back to its default provider chain (AWS_WEB_IDENTITY_TOKEN_FILE, IRSA, env vars, ...).Athena and BigQuery already work in this setup — Athena omits credentials when none are configured, and BigQuery's export bucket is GCS (default-chain-friendly). This PR brings Databricks in line and hardens the shared storage helpers.
Changes
cubejs-databricks-jdbc-driver—getCsvFilesomits credentials when none are configured, across all three bucket types:credentialsonly when both key and secret are set; omit a blank region too → AWS SDK default chain resolves the web-identity token file.undefined(not'') → falls through toDefaultAzureCredential, honoringAZURE_FEDERATED_TOKEN_FILE+AZURE_CLIENT_ID/AZURE_TENANT_ID.gcsCredentials || undefined→ Google SDK falls back to Application Default Credentials (GOOGLE_APPLICATION_CREDENTIALS, including WIFexternal_accountconfigs).cubejs-base-driver(defense-in-depth at the library boundary):aws.fs: newnormalizeS3ClientConfigstrips blank static credentials and a blank region beforenew S3(...). Protects every S3-using driver. Credential provider functions (e.g.fromTemporaryCredentials) and fully-populated static credentials are left untouched.gcs.fs: newhasGCSCredentialstreats empty string/object as absent so the Google SDK uses ADC.azure.fs: unchanged — already falls through toDefaultAzureCredentialcorrectly.Testing
cubejs-base-driver/test/unit/storage-fs.test.ts(new): verifies the normalizers, and that the S3 / GCS / Azure SDK clients construct without throwing when no credentials are passed (so the default chain can engage).cubejs-databricks-jdbc-driver/test/DatabricksDriver.test.ts(extended): drivesunload()with the S3 SDK mocked — asserts credentials are omitted under OIDC and passed through when static keys are configured.🤖 Generated with Claude Code