Skip to content

[Fix](storage)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling#49596

Merged
CalvinKirs merged 4 commits intoapache:branch-refactor_propertyfrom
CalvinKirs:branch-refactor_property-hdfs-test-2
Mar 28, 2025
Merged

[Fix](storage)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling#49596
CalvinKirs merged 4 commits intoapache:branch-refactor_propertyfrom
CalvinKirs:branch-refactor_property-hdfs-test-2

Conversation

@CalvinKirs
Copy link
Member

Background

Previously, the globList implementation used two different protocols for object storage access, leading to inconsistencies between the Frontend (FE) and Backend (BE). To resolve this issue, we are migrating globList to use the native AWS SDK, ensuring a unified access approach across both FE and BE. This change reduces protocol discrepancies, improves maintainability, and is expected to offer performance benefits (to be validated via benchmarking).

Additionally, we have adjusted the S3 Table-Valued Function (S3TVF) handling of region and endpoint. Instead of explicitly specifying these parameters, they are now extracted directly from the S3 URL. As a result, we have rolled back the previous commit that introduced explicit region and endpoint settings. However, we still need to discuss whether similar changes should be applied consistently across other parts of the system.

Changes

  • Migrated globList to AWS SDK Native Implementation

  • Replaced the existing implementation with AWS SDK’s listObjectsV2 API to ensure consistency across object storage operations.

  • Eliminated the need to maintain two different protocols for listing objects.

  • Improved alignment between FE and BE storage access.

Fix S3 storage

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Previously, the globList implementation used two different protocols for object storage access, leading to inconsistencies between the Frontend (FE) and Backend (BE). To resolve this issue, we are migrating globList to use the native AWS SDK, ensuring a unified access approach across both FE and BE. This change reduces protocol discrepancies, improves maintainability, and is expected to offer performance benefits (to be validated via benchmarking).

Additionally, we have adjusted the S3 Table-Valued Function (S3TVF) handling of region and endpoint. Instead of explicitly specifying these parameters, they are now extracted directly from the S3 URL. As a result, we have rolled back the previous commit that introduced explicit region and endpoint settings. However, we still need to discuss whether similar changes should be applied consistently across other parts of the system.

### Changes

- Migrated globList to AWS SDK Native Implementation

- Replaced the existing implementation with AWS SDK’s listObjectsV2 API to ensure consistency across object storage operations.

- Eliminated the need to maintain two different protocols for listing objects.

- Improved alignment between FE and BE storage access.

Fix S3 storage
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@CalvinKirs CalvinKirs changed the title [Fix](S3Tvf)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling [Fix]( storage)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling Mar 28, 2025
@CalvinKirs CalvinKirs changed the title [Fix]( storage)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling [Fix](storage)Unify globList Implementation Using AWS SDK and Optimize S3TVF Handling Mar 28, 2025
@CalvinKirs CalvinKirs merged commit 6f4568e into apache:branch-refactor_property Mar 28, 2025
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants