fix(gocloud): support cross-bucket IO for write.metadata.path#1078
fix(gocloud): support cross-bucket IO for write.metadata.path#1078cassio-paesleme wants to merge 3 commits into
Conversation
The blob FileIO implementation assumes all file operations target a single S3 bucket (the warehouse bucket). When Iceberg's write.metadata.path table property points to a different bucket, the defaultKeyExtractor strips the wrong prefix and the file lands in the warehouse bucket under a mangled key. Readers following the absolute URI in metadata.json get a 404 because the file is not where they expect it. Add multi-bucket support to blobFileIO: - Add resolveBucket() which parses the full S3 URI, routes to the primary bucket for same-bucket paths, and lazily opens secondary buckets via a BucketOpener callback for cross-bucket paths. - Update Open, NewWriter, WriteFile, Remove, and DeleteFiles to use resolveBucket instead of always using the embedded primary bucket. - Wire S3 scheme registration to use createMultiBucketBlobFS with a BucketOpener that reuses the same AWS config for secondary buckets. - Secondary bucket handles are cached and safe for concurrent use. - Backward compatible: callers that don't set BucketOpener get the same legacy behavior (key extraction falls through to primary bucket). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
laskoviymishka
left a comment
There was a problem hiding this comment.
Thanks for picking this up — write.metadata.path on a separate bucket is a real production gap. The routing shape with resolveBucket + per-bucket cache + BucketOpener callback also looks like the right direction.
I’d hold before merging though.
Main concern: the read path still doesn’t look fixed. blobOpenFile.ReadAt still goes to the primary bucket, so any cross-bucket Parquet / Avro / Puffin read will fail on the first range request. The new TestMultiBucketWriteAndRead doesn’t catch this because io.ReadAll uses Read, not ReadAt.
A few other things I’d sort out before merge:
WalkDirstill hardcodes the primary bucket, so orphan-file enumeration overwrite.metadata.pathis still wrong.- Cached secondary buckets are never closed, so long-running processes can leak HTTP transports.
- The fix is S3-only. GCS and Azure still call
createBlobFS, so they keep the original mangled-key bug. I’d either extend the fix to those backends or call out the S3-only scope clearly and file a follow-up.
One smaller thing I wouldn’t block on: TestMultiBucketFallbackWithoutOpener currently asserts the broken legacy behavior as a contract. Worth either expecting an error, or renaming the test to make that intent explicit.
Once those are sorted, happy to take another pass.
| name := filepath.Base(key) | ||
|
|
||
| r, err := bfs.NewReader(bfs.ctx, key, nil) | ||
| r, err := bucket.NewReader(bfs.ctx, key, nil) |
There was a problem hiding this comment.
I think there's a subtle issue — Open now routes through the resolved bucket here, but blobOpenFile.ReadAt (line 49) still calls f.b.NewRangeReader(...), which dispatches to the embedded primary bucket. So any cross-bucket file opened here fails on the first ReadAt (404, or worse if a same-named key exists in the primary), and that's the path Parquet column readers and the puffin/DV reader actually take. TestMultiBucketWriteAndRead passes only because io.ReadAll uses Read.
Could we thread bucket into blobOpenFile and have ReadAt use f.bucket.NewRangeReader(...)? A ReadAt-specific test would help lock it in. wdyt?
| bucketOpener BucketOpener | ||
|
|
||
| mu sync.RWMutex | ||
| buckets map[string]*blob.Bucket |
There was a problem hiding this comment.
These secondary buckets are never closed — blobFileIO only inherits Close() from the embedded primary. Each cached *blob.Bucket holds its own HTTP transport / connection pool, so a long-running process that touches many distinct metadata buckets will leak transports indefinitely.
I'd add a Close() on blobFileIO that walks bfs.buckets under the lock and closes each before closing the primary. wdyt?
| } | ||
| } | ||
|
|
||
| func (bfs *blobFileIO) WalkDir(root string, fn fs.WalkDirFunc) error { |
There was a problem hiding this comment.
WalkDir is the one IO entry point that doesn't go through resolveBucket — it still hardcodes fs.WalkDir(bfs.Bucket, ...). So if root names a secondary bucket (e.g., orphan-file enumeration on write.metadata.path), this silently walks the primary warehouse bucket and returns the wrong listing.
Could we route this through resolveBucket too? Probably worth a small test covering the cross-bucket root.
| return createS3Bucket(ctx, u, props) | ||
| } | ||
|
|
||
| return createMultiBucketBlobFS(ctx, bucket, parsed.Host, opener), nil |
There was a problem hiding this comment.
The fix lands for S3 here, but registerGCSScheme (line 64) and registerAzureSchemes (line 76) still call createBlobFS, so a GCS or ADLS user with write.metadata.path on a separate bucket still hits the mangled-key bug we're fixing. PyIceberg and Java handle cross-bucket GCS transparently, so tables written by them become unreadable from Go in that layout.
Could we either add BucketOpener impls for GCS/Azure here, or call out the S3-only scope in the PR description with a follow-up tracked? wdyt?
Summary
The blob
FileIOimplementation assumes all file operations target a single S3 bucket (the warehouse bucket opened at init time). When Iceberg'swrite.metadata.pathtable property points to a different bucket (e.g. a dedicated versioned metadata bucket),defaultKeyExtractorstrips the wrong bucket prefix and the file lands in the warehouse bucket under a mangled key. Readers following the absolute S3 URI inmetadata.jsonget a 404.Concrete failure mode: Setting
write.metadata.path = s3://metadata-bucket/db/table/causes manifest lists (snap-*.avro) and manifest files (*-m*.avro) to be written tos3://warehouse-bucket/metadata-bucket/db/table/snap-*.avroinstead ofs3://metadata-bucket/db/table/snap-*.avro. Themetadata.jsonrecords the correct URI, but the bytes are in the wrong place.Changes
resolveBucket()toblobFileIOwhich parses the full S3 URI and routes to the correct bucketBucketOpenercallback and cached withsync.RWMutexOpen,NewWriter,WriteFile,Remove, andDeleteFilesto useresolveBucketBucketOpenerthat reuses the same AWS configBucketOpenerget the same legacy behaviorTest plan
TestMultiBucketWriteAndRead- writes to two memblob buckets via different S3 URIs, verifies files land in the correct bucket and can be read backTestMultiBucketDelete- verifiesRemoveandDeleteFilesroute to the correct bucketTestMultiBucketOpenerCaching- verifies the opener is called once per bucket nameTestMultiBucketFallbackWithoutOpener- verifies backward compatibility when no opener is set🤖 Generated with Claude Code