[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests#60414
[opt](s3) Skip S3 listing for deterministic file paths using HEAD requests#60414dataroaring wants to merge 3 commits intomasterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
b5f5124 to
82c3832
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes S3 file access by using HEAD requests instead of LIST operations for deterministic file paths (paths without wildcards like *, ?, [...]). This enables loading data from S3 when only s3:GetObject permission is granted, without requiring s3:ListBucket permission.
Changes:
- Added utility methods to detect deterministic patterns and expand brace patterns
- Modified S3 object listing logic to use HEAD requests for deterministic paths
- Added comprehensive unit tests for the new utility methods
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java |
Added isDeterministicPattern() to detect paths without wildcards, and expandBracePatterns() with helper methods to expand brace patterns to concrete paths |
fe/fe-core/src/main/java/org/apache/doris/fs/obj/S3ObjStorage.java |
Modified globListInternal() to use HEAD requests instead of LIST operations for deterministic paths when no limits or startFile are specified |
fe/fe-core/src/test/java/org/apache/doris/common/util/S3UtilTest.java |
Added comprehensive unit tests for isDeterministicPattern() and expandBracePatterns() methods |
Comments suppressed due to low confidence (1)
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java:62
- Duplicate imports detected. Lines 57-58 import
java.util.ArrayListandjava.util.List, but lines 60-62 duplicate these imports. Remove the duplicate imports on lines 60 and 62.
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public void testExpandBracePatterns_withPath() { | ||
| // Full path with braces | ||
| List<String> result = S3Util.expandBracePatterns("data/year{2023,2024}/month{01,02}/file.csv"); | ||
| Assert.assertEquals(8, result.size()); |
There was a problem hiding this comment.
The expected result size is incorrect. The pattern "data/year{2023,2024}/month{01,02}/file.csv" should expand to 4 paths (2 years × 2 months), not 8. The correct expansion should produce:
data/year2023/month01/file.csvdata/year2023/month02/file.csvdata/year2024/month01/file.csvdata/year2024/month02/file.csv
Change the expected size from 8 to 4.
| Assert.assertEquals(8, result.size()); | |
| Assert.assertEquals(4, result.size()); |
77f8a42 to
8f528c8
Compare
|
run buildall |
8f528c8 to
34a6005
Compare
|
run buildall |
TPC-H: Total hot run time: 31856 ms |
ClickBench: Total hot run time: 28.39 s |
34a6005 to
d495108
Compare
|
run buildall |
TPC-H: Total hot run time: 31900 ms |
ClickBench: Total hot run time: 28.68 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 32219 ms |
ClickBench: Total hot run time: 28.74 s |
FE UT Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 30957 ms |
ClickBench: Total hot run time: 28.13 s |
FE UT Coverage ReportIncrement line coverage |
…uests
For S3 paths without wildcards (*, ?, [...]), use HEAD requests instead
of ListObjectsV2 to avoid requiring s3:ListBucket permission. This is
useful when only s3:GetObject permission is granted.
Brace patterns like {1..10} are expanded to concrete file paths and
verified individually with HEAD requests.
…D request optimization
Expand bracket character class patterns like [abc] and [0-9] to concrete
paths for the S3 HEAD request optimization. Previously, all bracket
patterns were treated as non-deterministic wildcards requiring listing.
Now [abc] is converted to {a,b,c} and [0-9] to {0,1,...,9} before brace
expansion, allowing HEAD requests instead of ListObjects. Negated
brackets [!abc] and [^abc] still fall back to listing since they cannot
be expanded deterministically.
Also adds tests for malformed brace patterns ({1,2 without closing })
to document the existing literal-fallback behavior.
…ositives When a virtual-host URL (http://bucket.endpoint/key) is used with use_path_style=true, path-style URI parsing extracts the wrong bucket and key. The HEAD request URL coincidentally matches the correct virtual-host URL, so HEAD succeeds where ListObjects correctly fails. This happens because in path-style, the SDK constructs: HEAD http://bucket.endpoint/{parsed-bucket}/{parsed-key} which equals the original URL, while LIST constructs: GET http://bucket.endpoint/{parsed-bucket}?list-type=2&prefix=... which the service cannot interpret via virtual-host routing. Fix: skip the HEAD optimization when isUsePathStyle=true. Path-style users typically have full S3 permissions (common with MinIO/self-hosted). Fixes regression in test_s3_tvf_s3_storage shouldFail test case.
4702df9 to
1e8a877
Compare
|
run buildall |
TPC-H: Total hot run time: 30500 ms |
ClickBench: Total hot run time: 28.53 s |
FE Regression Coverage ReportIncrement line coverage |
Summary
*,?,[...]), use HEAD requests instead of ListObjectsV2 to avoid requirings3:ListBucketpermission{1..10}are expanded to concrete file paths and verified individually with HEAD requestss3:GetObjectpermission is grantedMotivation
S3
ListBucketpermission is often more restricted thanGetObjectin enterprise environments. When users specify exact file paths or deterministic patterns likefile{1..3}.csv, listing is unnecessary since the file names can be determined from the input.Changes
S3Util.javaisDeterministicPattern()to detect paths without wildcards, andexpandBracePatterns()to expand brace patterns to concrete pathsS3ObjStorage.javaglobListInternal()to use HEAD requests for deterministic pathsS3UtilTest.javaExamples
s3://bucket/data/file.csvs3://bucket/data/file{1..3}.csvs3://bucket/data/*.csvTest Plan
isDeterministicPattern()expandBracePatterns()🤖 Generated with Claude Code