[fix](insert) Report physical file count in LoadStatistic.FileNumber#62804
Conversation
…eNumber
Previously StreamingInsertJob.beforeCommitted wrote loadStatistic.getFileNumber()
into the txn commit attachment, which stores the BE selectedSplitNum (split count)
rather than the physical file count. When a file crossed the split-size threshold
and was cut into multiple splits, jobs("type"="insert").LoadStatistic.FileNumber
appeared to be "doubled" relative to the actual file list.
Fix: introduce Offset.getPhysicalFileNum() (default -1) and override it in S3Offset
to return the file count recorded when listing S3. beforeCommitted now prefers the
offset's physical file count; other sources fall back to the existing behavior.
EXPLAIN's inputSplitNum semantics are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…itted Drop the redundant null guard on runningOffset (getCommitOffsetJson above already dereferences it) and the explanatory comment that duplicates the Offset.getPhysicalFileNum javadoc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
Two blocking issues found:
-
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java:1058
This only changes the count written into new txn attachments. ExistingjobStatistic.fileNumbervalues and cloudstreaming_job.num_filesare cumulative persisted totals that still contain the old split-based counts, and both paths keep adding the new value on top. For already-running jobs after upgrade,jobs("type"="insert").LoadStatistic.FileNumberwill remain permanently mixed instead of converging to physical file counts. -
regression-test/suites/job_p0/streaming_job/test_streaming_insert_job_file_number.groovy:95
The new regression addsqt_selectbut does not add the generated baseline fileregression-test/data/job_p0/streaming_job/test_streaming_insert_job_file_number.out, so the case is incomplete as submitted.
Checkpoint conclusions:
- Goal of the task: partially achieved. New S3 task commits use physical file counts, but upgrade/persisted-state handling is incomplete and the added regression case is not fully checked in.
- Change size/focus: small and focused.
- Concurrency: no new locking/concurrency issue identified; the change stays inside the existing
beforeCommitted()write-lock path. - Lifecycle/persistence: problematic for upgraded jobs because
num_filesis stored cumulatively in both FE job images and cloud meta-service state without migration/recalculation. - Configuration: no new config.
- Compatibility: not handled for existing streaming jobs across upgrade; the semantic change to persisted
num_filesis user-visible. - Parallel paths: no additional source-provider path was accidentally broken; JDBC still falls back to the old behavior.
- Test coverage: insufficient as submitted. The new regression covers the happy path only, and its
.outbaseline is missing. - Test result files: missing
.outfile for the newqt_select. - Observability: no additional logging/metrics appear necessary.
- Transaction/persistence: affected; the txn attachment field is persisted/replayed, so compatibility matters here.
- FE/BE variable passing: no new FE-BE protocol field was added.
- Performance: no material concern found.
- Other issues: none beyond the two blockers above.
User focus:
- No additional user-provided review focus.
…case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
|
run buildall |
There was a problem hiding this comment.
Findings
tasks("type"="insert")still exposes split-basedLoadStatistic.FileNumberviaStreamingInsertTask.getTvfInfo(), so the same field now has different semantics betweenjobs()andtasks().- An existing inline thread on
StreamingInsertJob.beforeCommitted()is still unresolved: upgraded streaming jobs keep mixed historical totals unless the persisted statistic is migrated or reset. I am not duplicating that thread. - The previously raised missing baseline issue looks fixed in the current head.
Critical Checkpoints
- Goal: Partially met. The PR fixes the non-cloud
jobs()happy path for newly committed S3 tasks and adds a matching regression baseline, but it does not fully solve upgraded jobs and it leaves the paralleltasks()surface unchanged. - Scope/minimality: Small and focused.
- Concurrency: No new locking or thread-safety issue found;
beforeCommitted()still uses the existing job write lock. - Lifecycle/static init: No special lifecycle or static-init issue found.
- Config: No new config.
- Compatibility/persistence: Not fully handled because pre-upgrade persisted aggregates remain mixed without migration/reset.
- Parallel code paths: Not fully handled;
tasks()still reports split count. - Special checks: The
physicalFileNum >= 0fallback is reasonable for non-file sources. - Test coverage: Improved for the S3
jobs()regression path, but still missing coverage for upgrade behavior and thetasks()surface. - Test result changes: The new
.outbaseline is present in the current head. - Observability: User-visible statistics remain inconsistent across surfaces.
- Transaction/persistence: New-commit attachment plumbing is wired, but the migration story is incomplete.
- Data writes/modifications: No data-atomicity concern beyond metadata/statistics updates.
- FE-BE variable passing: No new propagation issue found.
- Performance: No material concern found.
- User focus: No additional user-provided focus.
Track distinct physical files in FileScanNode (-1 for batch-mode scans where splits are not materialized on FE) and prefer it over split count in InsertIntoTableCommand.applyInsertPlanStatistic. tasks() and jobs() views now report consistent semantics for streaming and regular INSERTs. Drop the now-redundant Offset.getPhysicalFileNum() source-layer fallback since loadStatistic.getFileNumber() already carries the correct value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
|
run buildall |
There was a problem hiding this comment.
In addition to the existing review threads, I found one more blocking issue.
- Batch-mode external scans still do not populate physical file counts.
FileQueryScanNodeonly setsselectedFileNumin the eager-split path, andInsertIntoTableCommand.applyInsertPlanStatistic()falls back toselectedSplitNumotherwise. For batch-mode Hive/Hudi/Iceberg/MaxCompute inserts, that value is a split or partition estimate rather than a physical file count, soLoadStatistic.FileNumberis still wrong for largeINSERT INTO ... SELECTjobs.
Critical checkpoints
- Goal / correctness: Partially satisfied. The eager-split S3 or TVF path is fixed, but batch-mode external-table inserts still violate the intended
FileNumbersemantics. - Change size / focus: Small and focused, but not complete because one parallel code path is still using different semantics.
- Concurrency: No new concurrency issue found in the touched code.
- Lifecycle / static initialization: No special lifecycle or SIOF issue found.
- Config: No new config was added, but existing external-table batch-mode settings are an applicable path and remain incorrect.
- Compatibility: Existing review threads already cover the mixed old and new semantics across upgrades, so I did not duplicate that point.
- Parallel code paths: Not fully handled; the batch-mode
FileQueryScanNodepath is still different from the eager-split path. - Special conditions: The new fallback explicitly calls out batch mode, but the fallback result is still semantically wrong there.
- Test coverage: The new regression case covers the streaming S3 eager-split path only; there is still no coverage for batch-mode external-table inserts.
- Test result files: The new
.outfile is present and consistent with the added test. - Observability: No new observability gap found.
- Transaction / persistence: No additional persistence defect found beyond the already-raised compatibility concern.
- Data writes / atomicity: No new transactionality issue found in the touched code.
- FE/BE variable passing: Not applicable for this change.
- Performance: No material performance regression found from the added distinct-path accounting in the eager-split path.
User focus: no additional user-provided focus; no extra issue beyond the full review.
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
…62804) ### What problem does this PR solve? `InsertIntoTableCommand.applyInsertPlanStatistic` populated `LoadStatistic.fileNum` from `FileScanNode.getSelectedSplitNum()`, i.e. the BE **split count**, not the number of physical input files. When a file crossed the split-size threshold (default `max_initial_file_split_size × 1.1 ≈ 35.2MB`) and was cut into multiple splits, both `jobs("type"="insert").LoadStatistic.FileNumber` and `tasks("type"="insert").LoadStatistic.FileNumber` reported a value larger than the actual file list. In the user-reported scenario, 8 input files appeared as `FileNumber = 16` because each 42MB file was split in two. Data correctness is unaffected; only the displayed statistic was misleading. This affects both streaming insert jobs and regular `INSERT INTO ... SELECT FROM S3/HDFS/Hive`.
What problem does this PR solve?
InsertIntoTableCommand.applyInsertPlanStatisticpopulatedLoadStatistic.fileNumfromFileScanNode.getSelectedSplitNum(), i.e. the BE split count, not the number of physical input files. When a file crossed the split-size threshold (defaultmax_initial_file_split_size × 1.1 ≈ 35.2MB) and was cut into multiple splits, bothjobs("type"="insert").LoadStatistic.FileNumberandtasks("type"="insert").LoadStatistic.FileNumberreported a value larger than the actual file list. In the user-reported scenario, 8 input files appeared asFileNumber = 16because each 42MB file was split in two. Data correctness is unaffected; only the displayed statistic was misleading.This affects both streaming insert jobs and regular
INSERT INTO ... SELECT FROM S3/HDFS/Hive.Fix
FileScanNode.selectedFileNum(default-1).FileQueryScanNode.createScanRangeLocations, populate it from distinctFileSplit.getPathString()collected in the existing split-assembly loop (zero extra traversal, only the non-batch-mode path; batch-mode scans don't materialize splits on FE).InsertIntoTableCommand.applyInsertPlanStatisticprefersgetSelectedFileNum(); falls back togetSelectedSplitNum()for batch-mode scans (Hudi/Iceberg/Paimon).EXPLAIN'sinputSplitNumcontinues to report the split count; the two fields are now semantically distinct.Behavior changes
LoadStatistic.FileNumbernow reports physical file count forINSERT FROM S3/HDFS/TVF(streaming and regular). Previously it reported BE split count.INSERT FROM ...) keep the previous behavior (fallback to estimated split count). WiringselectedFileNumfor those sources is a separate follow-up.Compatibility
The proto field
StreamingTaskCommitAttachmentPB.num_filesis unchanged. Its semantics shift from split count to physical file count for new attachments. For streaming jobs already running before the upgrade, the persistedjobStatistic.fileNumber(and cloudstreaming_job.num_files) remain cumulative and continue accumulating; existing jobs'LoadStatistic.FileNumberwill be a mix of pre- and post-upgrade values until the job is recreated.Release note
Fix
LoadStatistic.FileNumberfor INSERT jobs (including streaming insert) reading from S3/HDFS/TVF to report physical file count instead of BE split count.Check List (For Author)
Test
Behavior changed:
LoadStatistic.FileNumberfor INSERT FROM S3/HDFS/TVF (streaming and regular) now reflects physical file count instead of BE split count. Batch-mode external table scans (Hudi/Iceberg/Paimon) keep the previous fallback.Does this need documentation?
Check List (For Reviewer who merge this PR)