Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #59217

…rg table queries (#59217)

### What problem does this PR solve?

### Problem
When querying Iceberg tables with large number of data files,
`LocationPath.of` method consumes significant CPU time. The main
bottlenecks are:
1. Repeated regex parsing in `S3URI.create()` for each file path
2. Multiple `String.split()` calls for scheme extraction
3. Repeated `StorageProperties` lookup from map for each file

### Solution
This PR introduces several optimizations to reduce CPU overhead:

#### 1. Optimize scheme parsing in `LocationPath.parseScheme`
- Replace `String.split("://")` with `indexOf` + `substring` to avoid
array allocation

#### 2. Add fast path for S3-compatible schemes in
`S3PropertyUtils.validateAndNormalizeUri`
- For simple S3-compatible URIs like `oss://bucket/key`,
`s3a://bucket/key`, use direct string replacement instead of full S3URI
regex parsing
- Only fall back to full S3URI parsing for complex HTTP URLs

#### 3. Add path prefix caching in `IcebergScanNode`
- Cache `StorageProperties`, schema, and path prefix mapping on first
file
- For subsequent files with the same prefix, directly transform paths
using string replacement
- New `LocationPath.ofDirect()` method to create LocationPath without
any parsing
@github-actions github-actions bot requested a review from yiguolei as a code owner December 28, 2025 08:28
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Dec 28, 2025
@hello-stephen
Copy link
Contributor

run buildall

@yiguolei yiguolei merged commit d8f7adc into branch-4.0 Dec 29, 2025
24 of 27 checks passed
@github-actions github-actions bot deleted the auto-pick-59217-branch-4.0 branch December 29, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants