[fix](insert) Report physical file count in LoadStatistic.FileNumber by JNSimba · Pull Request #62804 · apache/doris

JNSimba · 2026-04-24T07:56:44Z

What problem does this PR solve?

InsertIntoTableCommand.applyInsertPlanStatistic populated LoadStatistic.fileNum from FileScanNode.getSelectedSplitNum(), i.e. the BE split count, not the number of physical input files. When a file crossed the split-size threshold (default max_initial_file_split_size × 1.1 ≈ 35.2MB) and was cut into multiple splits, both jobs("type"="insert").LoadStatistic.FileNumber and tasks("type"="insert").LoadStatistic.FileNumber reported a value larger than the actual file list. In the user-reported scenario, 8 input files appeared as FileNumber = 16 because each 42MB file was split in two. Data correctness is unaffected; only the displayed statistic was misleading.

This affects both streaming insert jobs and regular INSERT INTO ... SELECT FROM S3/HDFS/Hive.

Fix

Add FileScanNode.selectedFileNum (default -1).
In FileQueryScanNode.createScanRangeLocations, populate it from distinct FileSplit.getPathString() collected in the existing split-assembly loop (zero extra traversal, only the non-batch-mode path; batch-mode scans don't materialize splits on FE).
InsertIntoTableCommand.applyInsertPlanStatistic prefers getSelectedFileNum(); falls back to getSelectedSplitNum() for batch-mode scans (Hudi/Iceberg/Paimon).

EXPLAIN's inputSplitNum continues to report the split count; the two fields are now semantically distinct.

Behavior changes

LoadStatistic.FileNumber now reports physical file count for INSERT FROM S3/HDFS/TVF (streaming and regular). Previously it reported BE split count.
Batch-mode scans (Hudi/Iceberg/Paimon INSERT FROM ...) keep the previous behavior (fallback to estimated split count). Wiring selectedFileNum for those sources is a separate follow-up.

Compatibility

The proto field StreamingTaskCommitAttachmentPB.num_files is unchanged. Its semantics shift from split count to physical file count for new attachments. For streaming jobs already running before the upgrade, the persisted jobStatistic.fileNumber (and cloud streaming_job.num_files) remain cumulative and continue accumulating; existing jobs' LoadStatistic.FileNumber will be a mix of pre- and post-upgrade values until the job is recreated.

Release note

Fix LoadStatistic.FileNumber for INSERT jobs (including streaming insert) reading from S3/HDFS/TVF to report physical file count instead of BE split count.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes. LoadStatistic.FileNumber for INSERT FROM S3/HDFS/TVF (streaming and regular) now reflects physical file count instead of BE split count. Batch-mode external table scans (Hudi/Iceberg/Paimon) keep the previous fallback.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…eNumber Previously StreamingInsertJob.beforeCommitted wrote loadStatistic.getFileNumber() into the txn commit attachment, which stores the BE selectedSplitNum (split count) rather than the physical file count. When a file crossed the split-size threshold and was cut into multiple splits, jobs("type"="insert").LoadStatistic.FileNumber appeared to be "doubled" relative to the actual file list. Fix: introduce Offset.getPhysicalFileNum() (default -1) and override it in S3Offset to return the file count recorded when listing S3. beforeCommitted now prefers the offset's physical file count; other sources fall back to the existing behavior. EXPLAIN's inputSplitNum semantics are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Thearas · 2026-04-24T07:57:32Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

…itted Drop the redundant null guard on runningOffset (getCommitOffsetJson above already dereferences it) and the explanatory comment that duplicates the Offset.getPhysicalFileNum javadoc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-24T08:22:34Z

/review

github-actions · 2026-04-24T10:18:48Z

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24879802575

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

JNSimba · 2026-04-27T02:59:08Z

/review

github-actions

Two blocking issues found:

fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java:1058
This only changes the count written into new txn attachments. Existing jobStatistic.fileNumber values and cloud streaming_job.num_files are cumulative persisted totals that still contain the old split-based counts, and both paths keep adding the new value on top. For already-running jobs after upgrade, jobs("type"="insert").LoadStatistic.FileNumber will remain permanently mixed instead of converging to physical file counts.
regression-test/suites/job_p0/streaming_job/test_streaming_insert_job_file_number.groovy:95
The new regression adds qt_select but does not add the generated baseline file regression-test/data/job_p0/streaming_job/test_streaming_insert_job_file_number.out, so the case is incomplete as submitted.

Checkpoint conclusions:

Goal of the task: partially achieved. New S3 task commits use physical file counts, but upgrade/persisted-state handling is incomplete and the added regression case is not fully checked in.
Change size/focus: small and focused.
Concurrency: no new locking/concurrency issue identified; the change stays inside the existing beforeCommitted() write-lock path.
Lifecycle/persistence: problematic for upgraded jobs because num_files is stored cumulatively in both FE job images and cloud meta-service state without migration/recalculation.
Configuration: no new config.
Compatibility: not handled for existing streaming jobs across upgrade; the semantic change to persisted num_files is user-visible.
Parallel paths: no additional source-provider path was accidentally broken; JDBC still falls back to the old behavior.
Test coverage: insufficient as submitted. The new regression covers the happy path only, and its .out baseline is missing.
Test result files: missing .out file for the new qt_select.
Observability: no additional logging/metrics appear necessary.
Transaction/persistence: affected; the txn attachment field is persisted/replayed, so compatibility matters here.
FE/BE variable passing: no new FE-BE protocol field was added.
Performance: no material concern found.
Other issues: none beyond the two blockers above.

User focus:

No additional user-provided review focus.

…case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-27T03:50:53Z

/review

JNSimba · 2026-04-27T03:51:00Z

run buildall

github-actions

Findings

tasks("type"="insert") still exposes split-based LoadStatistic.FileNumber via StreamingInsertTask.getTvfInfo(), so the same field now has different semantics between jobs() and tasks().
An existing inline thread on StreamingInsertJob.beforeCommitted() is still unresolved: upgraded streaming jobs keep mixed historical totals unless the persisted statistic is migrated or reset. I am not duplicating that thread.
The previously raised missing baseline issue looks fixed in the current head.

Critical Checkpoints

Goal: Partially met. The PR fixes the non-cloud jobs() happy path for newly committed S3 tasks and adds a matching regression baseline, but it does not fully solve upgraded jobs and it leaves the parallel tasks() surface unchanged.
Scope/minimality: Small and focused.
Concurrency: No new locking or thread-safety issue found; beforeCommitted() still uses the existing job write lock.
Lifecycle/static init: No special lifecycle or static-init issue found.
Config: No new config.
Compatibility/persistence: Not fully handled because pre-upgrade persisted aggregates remain mixed without migration/reset.
Parallel code paths: Not fully handled; tasks() still reports split count.
Special checks: The physicalFileNum >= 0 fallback is reasonable for non-file sources.
Test coverage: Improved for the S3 jobs() regression path, but still missing coverage for upgrade behavior and the tasks() surface.
Test result changes: The new .out baseline is present in the current head.
Observability: User-visible statistics remain inconsistent across surfaces.
Transaction/persistence: New-commit attachment plumbing is wired, but the migration story is incomplete.
Data writes/modifications: No data-atomicity concern beyond metadata/statistics updates.
FE-BE variable passing: No new propagation issue found.
Performance: No material concern found.
User focus: No additional user-provided focus.

Track distinct physical files in FileScanNode (-1 for batch-mode scans where splits are not materialized on FE) and prefer it over split count in InsertIntoTableCommand.applyInsertPlanStatistic. tasks() and jobs() views now report consistent semantics for streaming and regular INSERTs. Drop the now-redundant Offset.getPhysicalFileNum() source-layer fallback since loadStatistic.getFileNumber() already carries the correct value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-27T09:33:03Z

/review

JNSimba · 2026-04-27T09:33:08Z

run buildall

github-actions

In addition to the existing review threads, I found one more blocking issue.

Batch-mode external scans still do not populate physical file counts. FileQueryScanNode only sets selectedFileNum in the eager-split path, and InsertIntoTableCommand.applyInsertPlanStatistic() falls back to selectedSplitNum otherwise. For batch-mode Hive/Hudi/Iceberg/MaxCompute inserts, that value is a split or partition estimate rather than a physical file count, so LoadStatistic.FileNumber is still wrong for large INSERT INTO ... SELECT jobs.

Critical checkpoints

Goal / correctness: Partially satisfied. The eager-split S3 or TVF path is fixed, but batch-mode external-table inserts still violate the intended FileNumber semantics.
Change size / focus: Small and focused, but not complete because one parallel code path is still using different semantics.
Concurrency: No new concurrency issue found in the touched code.
Lifecycle / static initialization: No special lifecycle or SIOF issue found.
Config: No new config was added, but existing external-table batch-mode settings are an applicable path and remain incorrect.
Compatibility: Existing review threads already cover the mixed old and new semantics across upgrades, so I did not duplicate that point.
Parallel code paths: Not fully handled; the batch-mode FileQueryScanNode path is still different from the eager-split path.
Special conditions: The new fallback explicitly calls out batch mode, but the fallback result is still semantically wrong there.
Test coverage: The new regression case covers the streaming S3 eager-split path only; there is still no coverage for batch-mode external-table inserts.
Test result files: The new .out file is present and consistent with the added test.
Observability: No new observability gap found.
Transaction / persistence: No additional persistence defect found beyond the already-raised compatibility concern.
Data writes / atomicity: No new transactionality issue found in the touched code.
FE/BE variable passing: Not applicable for this change.
Performance: No material performance regression found from the added distinct-path accounting in the eager-split path.

User focus: no additional user-provided focus; no extra issue beyond the full review.

hello-stephen · 2026-04-27T10:57:02Z

FE UT Coverage Report

Increment line coverage 10.00% (1/10) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-27T12:01:42Z

FE Regression Coverage Report

Increment line coverage 40.91% (9/22) 🎉
Increment coverage report
Complete coverage report

liaoxin01

LGTM

github-actions · 2026-04-29T13:48:16Z

PR approved by at least one committer and no changes requested.

already fixed

…62804) ### What problem does this PR solve? `InsertIntoTableCommand.applyInsertPlanStatistic` populated `LoadStatistic.fileNum` from `FileScanNode.getSelectedSplitNum()`, i.e. the BE **split count**, not the number of physical input files. When a file crossed the split-size threshold (default `max_initial_file_split_size × 1.1 ≈ 35.2MB`) and was cut into multiple splits, both `jobs("type"="insert").LoadStatistic.FileNumber` and `tasks("type"="insert").LoadStatistic.FileNumber` reported a value larger than the actual file list. In the user-reported scenario, 8 input files appeared as `FileNumber = 16` because each 42MB file was split in two. Data correctness is unaffected; only the displayed statistic was misleading. This affects both streaming insert jobs and regular `INSERT INTO ... SELECT FROM S3/HDFS/Hive`.

….FileNumber #62804 (#62952) Cherry-picked from #62804 Co-authored-by: wudi <wudi@selectdb.com>

JNSimba changed the title ~~[fix](streaming-load) Report physical file count in LoadStatistic.fileNumber~~ [fix](streamingjob) Report physical file count in LoadStatistic.fileNumber Apr 24, 2026

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread ...-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Outdated

Comment thread regression-test/suites/job_p0/streaming_job/test_streaming_insert_job_file_number.groovy

[fix](streaming-job) add missing baseline for file_number regression …

460254e

…case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread ...-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Outdated

github-actions Bot previously requested changes Apr 27, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/datasource/FileQueryScanNode.java

JNSimba added dev/4.1.x labels Apr 27, 2026

JNSimba changed the title ~~[fix](streamingjob) Report physical file count in LoadStatistic.fileNumber~~ [fix](insert) Report physical file count in LoadStatistic.FileNumber Apr 28, 2026

liaoxin01 approved these changes Apr 29, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 29, 2026

JNSimba merged commit d5ed612 into apache:master Apr 29, 2026
35 checks passed

github-actions Bot mentioned this pull request Apr 29, 2026

branch-4.1: [fix](insert) Report physical file count in LoadStatistic.FileNumber #62804 #62952

Merged

yiguolei pushed a commit that referenced this pull request May 1, 2026

branch-4.1: [fix](insert) Report physical file count in LoadStatistic…

272520e

….FileNumber #62804 (#62952) Cherry-picked from #62804 Co-authored-by: wudi <wudi@selectdb.com>

yiguolei removed the dev/4.1.x label May 1, 2026

yiguolei added the dev/4.1.1-merged label May 1, 2026

Conversation

JNSimba commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Fix

Behavior changes

Compatibility

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 24, 2026

Uh oh!

JNSimba commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hello-stephen commented Apr 27, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

FE Regression Coverage Report

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JNSimba commented Apr 24, 2026 •

edited

Loading