Skip to content

Conversation

@github-actions
Copy link
Contributor

@github-actions github-actions bot commented Feb 2, 2026

Cherry-picked from #60063

…ive small files (#60063)

## What problem does this PR solve?

During Iceberg rewrite_data_files operations, when BE count is large, an
unexpected number of small files are generated.

Problem: total_files = task_count × active_BE_count × partition_count

## What is changed and how it works?

This change refines the parallelism strategy of Iceberg
rewrite_data_files to maximize concurrency without producing
  extra output files.

- If expectedFileCount <= alive BE count, use GATHER. Read parallelism
is no longer forced to 1; it is capped at
min(defaultParallelism, expectedFileCount) so output files never exceed
the expected count.
- If expectedFileCount > alive BE count, cap per‑BE parallelism to
floor(expectedFileCount / aliveBeCount), and then take
min with defaultParallelism, ensuring total writers do not exceed the
expected file count.
- Updated unit and regression tests to cover GATHER vs non‑GATHER paths
and boundary cases.
## Benefits

- Small data: reduce from 100+ files to ~1 file (90%+ reduction)
- Adaptive strategy, no manual tuning needed

## Check List

- [x] Code changes
- [x] Test strategy described

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions github-actions bot requested a review from yiguolei as a code owner February 2, 2026 09:54
@Thearas
Copy link
Contributor

Thearas commented Feb 2, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring closed this Feb 2, 2026
@dataroaring dataroaring reopened this Feb 2, 2026
@Thearas
Copy link
Contributor

Thearas commented Feb 2, 2026

run buildall

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor Author

github-actions bot commented Feb 3, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 3, 2026
@github-actions
Copy link
Contributor Author

github-actions bot commented Feb 3, 2026

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit 9c78f17 into branch-4.0 Feb 3, 2026
28 of 31 checks passed
@github-actions github-actions bot deleted the auto-pick-60063-branch-4.0 branch February 3, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants