Skip to content

[fix](insert) Report physical file count in LoadStatistic.FileNumber#62804

Merged
JNSimba merged 4 commits intoapache:masterfrom
JNSimba:fix/streaming-load-filenumber-physical-count
Apr 29, 2026
Merged

[fix](insert) Report physical file count in LoadStatistic.FileNumber#62804
JNSimba merged 4 commits intoapache:masterfrom
JNSimba:fix/streaming-load-filenumber-physical-count

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 24, 2026

What problem does this PR solve?

InsertIntoTableCommand.applyInsertPlanStatistic populated LoadStatistic.fileNum from FileScanNode.getSelectedSplitNum(), i.e. the BE split count, not the number of physical input files. When a file crossed the split-size threshold (default max_initial_file_split_size × 1.1 ≈ 35.2MB) and was cut into multiple splits, both jobs("type"="insert").LoadStatistic.FileNumber and tasks("type"="insert").LoadStatistic.FileNumber reported a value larger than the actual file list. In the user-reported scenario, 8 input files appeared as FileNumber = 16 because each 42MB file was split in two. Data correctness is unaffected; only the displayed statistic was misleading.

This affects both streaming insert jobs and regular INSERT INTO ... SELECT FROM S3/HDFS/Hive.

Fix

  • Add FileScanNode.selectedFileNum (default -1).
  • In FileQueryScanNode.createScanRangeLocations, populate it from distinct FileSplit.getPathString() collected in the existing split-assembly loop (zero extra traversal, only the non-batch-mode path; batch-mode scans don't materialize splits on FE).
  • InsertIntoTableCommand.applyInsertPlanStatistic prefers getSelectedFileNum(); falls back to getSelectedSplitNum() for batch-mode scans (Hudi/Iceberg/Paimon).

EXPLAIN's inputSplitNum continues to report the split count; the two fields are now semantically distinct.

Behavior changes

  • LoadStatistic.FileNumber now reports physical file count for INSERT FROM S3/HDFS/TVF (streaming and regular). Previously it reported BE split count.
  • Batch-mode scans (Hudi/Iceberg/Paimon INSERT FROM ...) keep the previous behavior (fallback to estimated split count). Wiring selectedFileNum for those sources is a separate follow-up.

Compatibility

The proto field StreamingTaskCommitAttachmentPB.num_files is unchanged. Its semantics shift from split count to physical file count for new attachments. For streaming jobs already running before the upgrade, the persisted jobStatistic.fileNumber (and cloud streaming_job.num_files) remain cumulative and continue accumulating; existing jobs' LoadStatistic.FileNumber will be a mix of pre- and post-upgrade values until the job is recreated.

Release note

Fix LoadStatistic.FileNumber for INSERT jobs (including streaming insert) reading from S3/HDFS/TVF to report physical file count instead of BE split count.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. LoadStatistic.FileNumber for INSERT FROM S3/HDFS/TVF (streaming and regular) now reflects physical file count instead of BE split count. Batch-mode external table scans (Hudi/Iceberg/Paimon) keep the previous fallback.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…eNumber

Previously StreamingInsertJob.beforeCommitted wrote loadStatistic.getFileNumber()
into the txn commit attachment, which stores the BE selectedSplitNum (split count)
rather than the physical file count. When a file crossed the split-size threshold
and was cut into multiple splits, jobs("type"="insert").LoadStatistic.FileNumber
appeared to be "doubled" relative to the actual file list.

Fix: introduce Offset.getPhysicalFileNum() (default -1) and override it in S3Offset
to return the file count recorded when listing S3. beforeCommitted now prefers the
offset's physical file count; other sources fall back to the existing behavior.

EXPLAIN's inputSplitNum semantics are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba JNSimba changed the title [fix](streaming-load) Report physical file count in LoadStatistic.fileNumber [fix](streamingjob) Report physical file count in LoadStatistic.fileNumber Apr 24, 2026
…itted

Drop the redundant null guard on runningOffset (getCommitOffsetJson above
already dereferences it) and the explanatory comment that duplicates the
Offset.getPhysicalFileNum javadoc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 24, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24879802575

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues found:

  1. fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java:1058
    This only changes the count written into new txn attachments. Existing jobStatistic.fileNumber values and cloud streaming_job.num_files are cumulative persisted totals that still contain the old split-based counts, and both paths keep adding the new value on top. For already-running jobs after upgrade, jobs("type"="insert").LoadStatistic.FileNumber will remain permanently mixed instead of converging to physical file counts.

  2. regression-test/suites/job_p0/streaming_job/test_streaming_insert_job_file_number.groovy:95
    The new regression adds qt_select but does not add the generated baseline file regression-test/data/job_p0/streaming_job/test_streaming_insert_job_file_number.out, so the case is incomplete as submitted.

Checkpoint conclusions:

  • Goal of the task: partially achieved. New S3 task commits use physical file counts, but upgrade/persisted-state handling is incomplete and the added regression case is not fully checked in.
  • Change size/focus: small and focused.
  • Concurrency: no new locking/concurrency issue identified; the change stays inside the existing beforeCommitted() write-lock path.
  • Lifecycle/persistence: problematic for upgraded jobs because num_files is stored cumulatively in both FE job images and cloud meta-service state without migration/recalculation.
  • Configuration: no new config.
  • Compatibility: not handled for existing streaming jobs across upgrade; the semantic change to persisted num_files is user-visible.
  • Parallel paths: no additional source-provider path was accidentally broken; JDBC still falls back to the old behavior.
  • Test coverage: insufficient as submitted. The new regression covers the happy path only, and its .out baseline is missing.
  • Test result files: missing .out file for the new qt_select.
  • Observability: no additional logging/metrics appear necessary.
  • Transaction/persistence: affected; the txn attachment field is persisted/replayed, so compatibility matters here.
  • FE/BE variable passing: no new FE-BE protocol field was added.
  • Performance: no material concern found.
  • Other issues: none beyond the two blockers above.

User focus:

  • No additional user-provided review focus.

…case

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 27, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 27, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. tasks("type"="insert") still exposes split-based LoadStatistic.FileNumber via StreamingInsertTask.getTvfInfo(), so the same field now has different semantics between jobs() and tasks().
  2. An existing inline thread on StreamingInsertJob.beforeCommitted() is still unresolved: upgraded streaming jobs keep mixed historical totals unless the persisted statistic is migrated or reset. I am not duplicating that thread.
  3. The previously raised missing baseline issue looks fixed in the current head.

Critical Checkpoints

  • Goal: Partially met. The PR fixes the non-cloud jobs() happy path for newly committed S3 tasks and adds a matching regression baseline, but it does not fully solve upgraded jobs and it leaves the parallel tasks() surface unchanged.
  • Scope/minimality: Small and focused.
  • Concurrency: No new locking or thread-safety issue found; beforeCommitted() still uses the existing job write lock.
  • Lifecycle/static init: No special lifecycle or static-init issue found.
  • Config: No new config.
  • Compatibility/persistence: Not fully handled because pre-upgrade persisted aggregates remain mixed without migration/reset.
  • Parallel code paths: Not fully handled; tasks() still reports split count.
  • Special checks: The physicalFileNum >= 0 fallback is reasonable for non-file sources.
  • Test coverage: Improved for the S3 jobs() regression path, but still missing coverage for upgrade behavior and the tasks() surface.
  • Test result changes: The new .out baseline is present in the current head.
  • Observability: User-visible statistics remain inconsistent across surfaces.
  • Transaction/persistence: New-commit attachment plumbing is wired, but the migration story is incomplete.
  • Data writes/modifications: No data-atomicity concern beyond metadata/statistics updates.
  • FE-BE variable passing: No new propagation issue found.
  • Performance: No material concern found.
  • User focus: No additional user-provided focus.

Track distinct physical files in FileScanNode (-1 for batch-mode scans
where splits are not materialized on FE) and prefer it over split count
in InsertIntoTableCommand.applyInsertPlanStatistic. tasks() and jobs()
views now report consistent semantics for streaming and regular INSERTs.

Drop the now-redundant Offset.getPhysicalFileNum() source-layer fallback
since loadStatistic.getFileNumber() already carries the correct value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 27, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 27, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the existing review threads, I found one more blocking issue.

  1. Batch-mode external scans still do not populate physical file counts. FileQueryScanNode only sets selectedFileNum in the eager-split path, and InsertIntoTableCommand.applyInsertPlanStatistic() falls back to selectedSplitNum otherwise. For batch-mode Hive/Hudi/Iceberg/MaxCompute inserts, that value is a split or partition estimate rather than a physical file count, so LoadStatistic.FileNumber is still wrong for large INSERT INTO ... SELECT jobs.

Critical checkpoints

  • Goal / correctness: Partially satisfied. The eager-split S3 or TVF path is fixed, but batch-mode external-table inserts still violate the intended FileNumber semantics.
  • Change size / focus: Small and focused, but not complete because one parallel code path is still using different semantics.
  • Concurrency: No new concurrency issue found in the touched code.
  • Lifecycle / static initialization: No special lifecycle or SIOF issue found.
  • Config: No new config was added, but existing external-table batch-mode settings are an applicable path and remain incorrect.
  • Compatibility: Existing review threads already cover the mixed old and new semantics across upgrades, so I did not duplicate that point.
  • Parallel code paths: Not fully handled; the batch-mode FileQueryScanNode path is still different from the eager-split path.
  • Special conditions: The new fallback explicitly calls out batch mode, but the fallback result is still semantically wrong there.
  • Test coverage: The new regression case covers the streaming S3 eager-split path only; there is still no coverage for batch-mode external-table inserts.
  • Test result files: The new .out file is present and consistent with the added test.
  • Observability: No new observability gap found.
  • Transaction / persistence: No additional persistence defect found beyond the already-raised compatibility concern.
  • Data writes / atomicity: No new transactionality issue found in the touched code.
  • FE/BE variable passing: Not applicable for this change.
  • Performance: No material performance regression found from the added distinct-path accounting in the eager-split path.

User focus: no additional user-provided focus; no extra issue beyond the full review.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 10.00% (1/10) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 40.91% (9/22) 🎉
Increment coverage report
Complete coverage report

@JNSimba JNSimba changed the title [fix](streamingjob) Report physical file count in LoadStatistic.fileNumber [fix](insert) Report physical file count in LoadStatistic.FileNumber Apr 28, 2026
Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@JNSimba JNSimba merged commit d5ed612 into apache:master Apr 29, 2026
35 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 29, 2026
…62804)

### What problem does this PR solve?

`InsertIntoTableCommand.applyInsertPlanStatistic` populated
`LoadStatistic.fileNum` from `FileScanNode.getSelectedSplitNum()`, i.e.
the BE **split count**, not the number of physical input files. When a
file crossed the split-size threshold (default
`max_initial_file_split_size × 1.1 ≈ 35.2MB`) and was cut into multiple
splits, both `jobs("type"="insert").LoadStatistic.FileNumber` and
`tasks("type"="insert").LoadStatistic.FileNumber` reported a value
larger than the actual file list. In the user-reported scenario, 8 input
files appeared as `FileNumber = 16` because each 42MB file was split in
two. Data correctness is unaffected; only the displayed statistic was
misleading.

This affects both streaming insert jobs and regular `INSERT INTO ...
SELECT FROM S3/HDFS/Hive`.
yiguolei pushed a commit that referenced this pull request May 1, 2026
….FileNumber #62804 (#62952)

Cherry-picked from #62804

Co-authored-by: wudi <wudi@selectdb.com>
@yiguolei yiguolei removed the dev/4.1.x label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.1-merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants