[fix](streaming-job) fix filteredRows always 0 on single-table S3 streaming by JNSimba · Pull Request #62816 · apache/doris

JNSimba · 2026-04-24T09:18:11Z

What problem does this PR solve?

Problem Summary:

For single-table S3 streaming insert jobs running under session.enable_insert_strict=false + session.insert_max_filter_ratio>0, BE correctly filters bad rows but jobStatistic.filteredRows stays at 0 in the jobs("type"="insert") view. The issue reproduces in the live path (before any restart) and also after FE EditLog replay, because the whole commit chain never carried filteredRows end-to-end for the txn path.

Root cause — the single-table txn commit pipeline has no filteredRows channel:

StreamingTaskCommitAttachmentPB has no filtered_rows field (only scanned_rows / load_bytes / num_files / file_bytes).
LoadStatistic has no filteredRows field; the value only lives as a local int in AbstractInsertExecutor for strict-mode / insert_max_filter_ratio checks and is never pushed anywhere persistent.
StreamingInsertJob.beforeCommitted() builds the attachment from loadStatistic.get*() — so there is nothing to read even if the attachment class had a field.
updateJobStatisticAndOffset(StreamingTaskTxnCommitAttachment, boolean) and updateCloudJobStatisticAndOffset() accumulate every other stat but skip filteredRows.

The multi-table CDC non-txn path (CommitOffsetRequest → updateNoTxnJobStatisticAndOffset()) already accumulates filteredRows into nonTxnJobStatistic correctly; the single-table txn path needed the same wiring.

Fix

Thread filteredRows along the same channel scanned/loadBytes/fileNumber/fileSize already use:

cloud.proto: add optional int64 filtered_rows = 7 to StreamingTaskCommitAttachmentPB.
StreamingTaskTxnCommitAttachment: add filteredRows field (@SerializedName("fr")), extend full-args constructor, read from PB in the PB constructor, include in toString().
TxnUtil.streamingTaskTxnCommitAttachmentToPb: populate the new PB field.
LoadStatistic: add filteredRows field + getter/setter; surface it in toJson() so the loadStatistic column of jobs("type"="insert") shows it too.
AbstractInsertExecutor.execImpl: after reading filteredRows from coordinator.getLoadCounters(), persist it into insertLoadJob.getLoadStatistic().setFilteredRows(...) (symmetric to how BrokerLoadJob pushes DPP_ABNORMAL_ALL into its own job state).
StreamingInsertJob.beforeCommitted: pass loadStatistic.getFilteredRows() into the new attachment constructor arg.
StreamingInsertJob.updateJobStatisticAndOffset(attachment, isReplay) — live + FE EditLog replay accumulate: jobStatistic.setFilteredRows(old + attachment.getFilteredRows()).
StreamingInsertJob.updateCloudJobStatisticAndOffset — cloud MS replay overwrite: jobStatistic.setFilteredRows(attachment.getFilteredRows()), matching the existing latest-snapshot semantics of that method.

After the fix all three read paths (live accumulate, replayOnCommitted, replayOnCloudMode) see the same PB field, so filteredRows is correct whether BE or FE is restarted.

Added regression test test_streaming_insert_job_filtered_rows: loads example_[0-1].csv into a table with c2 INT NOT NULL. Non-parseable name strings on a NOT NULL int column force every row to be filtered (mirrors the age int NOT NULL + 'abc' pattern from test_streaming_mysql_job_errormsg). Asserts scannedRows=20, filteredRows=20, fileNumber=2, and the target table ends up empty. Before this fix the test fails at filteredRows == 20 (observed 0); after the fix it passes.

Release note

Fix filteredRows always reported as 0 in jobs("type"="insert") for single-table S3 streaming insert jobs under enable_insert_strict=false + insert_max_filter_ratio>0. The filter count is now propagated from BE through the txn commit attachment into job statistics, and survives FE EditLog replay and cloud meta-service round-trip.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes. jobs("type"="insert").jobStatistic.filteredRows (and loadStatistic.FilteredRows) now report the actual number of rows filtered by BE on the single-table streaming commit path, instead of always 0.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…xn commit For single-table S3 streaming insert jobs running with enable_insert_strict=false + insert_max_filter_ratio>0, BE correctly filters bad rows but jobStatistic.filteredRows stays at 0 because the txn commit path had no channel for filteredRows end-to-end: - StreamingTaskCommitAttachmentPB has no filtered_rows field. - LoadStatistic has no filteredRows field; the value only lives as a local int in AbstractInsertExecutor for strict/ratio checks. - beforeCommitted() builds the attachment from loadStatistic.get*() so there is nothing to read. - updateJobStatisticAndOffset() / updateCloudJobStatisticAndOffset() accumulate every other stat but skip filteredRows. The multi-table CDC non-txn path via CommitOffsetRequest already does the symmetric accumulate into nonTxnJobStatistic correctly; this PR wires the same channel for the single-table txn path: 1. cloud.proto: add optional int64 filtered_rows = 7 to StreamingTaskCommitAttachmentPB. 2. StreamingTaskTxnCommitAttachment: add filteredRows field (@SerializedName("fr")), extend full-args constructor, read from PB in PB constructor, include in toString(). 3. TxnUtil.streamingTaskTxnCommitAttachmentToPb: populate the new PB field. 4. LoadStatistic: add filteredRows field + getter/setter; expose it in toJson() so the loadStatistic column surfaces it too. 5. AbstractInsertExecutor.execImpl: after reading filteredRows from coordinator.getLoadCounters(), persist it into insertLoadJob.loadStatistic (symmetric to how BrokerLoadJob pushes DPP_ABNORMAL_ALL into its own job state). 6. StreamingInsertJob.beforeCommitted: pass loadStatistic.getFilteredRows() into the new attachment constructor arg. 7. StreamingInsertJob.updateJobStatisticAndOffset: accumulate jobStatistic.setFilteredRows(old + attachment.getFilteredRows()). 8. StreamingInsertJob.updateCloudJobStatisticAndOffset: overwrite jobStatistic.setFilteredRows(attachment.getFilteredRows()) to match the existing latest-snapshot semantics of that method. After this fix, filteredRows is correct for live accumulation, FE EditLog replay (replayOnCommitted) and cloud MS replay (replayOnCloudMode), all three paths reading the same PB. Added regression test test_streaming_insert_job_filtered_rows which loads example_[0-1].csv into a table with c2 INT NOT NULL (non-parseable names force every row to be filtered, mirroring the pattern in test_streaming_mysql_job_errormsg), and asserts scannedRows=20, filteredRows=20, fileNumber=2 and an empty table.

Thearas · 2026-04-24T09:18:16Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

JNSimba · 2026-04-27T02:41:41Z

/review

github-actions

Blocking issue:

Cloud replay/reset still loses filteredRows. replayOnCloudMode() now reads filteredRows from GetStreamingTaskCommitAttachResponse, but the meta-service code that stores streaming-job progress only accumulates/preserves scanned_rows, load_bytes, num_files, and file_bytes in cloud/src/meta-service/meta_service_txn.cpp. That means the live cloud path can look correct because the commit response echoes the request attachment, while FE restart, replayOnCloudMode(), and offset reset in cloud mode still reload filteredRows as 0.

Critical checkpoint conclusions:

Goal: Partially met. The shared-nothing txn path is fixed, but cloud replay is still incorrect.
Scope: Small and focused.
Concurrency: No new locking or thread-safety issue found; the FE changes follow existing callback and write-lock patterns.
Lifecycle/Persistence: Not fully correct in cloud mode because the new field is not persisted in the cloud streaming-job snapshot.
Parallel paths: The non-txn CDC path already handled filteredRows; cloud txn replay remains inconsistent with the other replay paths.
Compatibility: The proto addition is backward-compatible.
Tests: Added regression coverage for the local txn path, but there is no cloud/meta-service coverage for the new field.
Observability: FE job/load statistics improve on the direct path, but cloud replay still reports the wrong value.
Data writes/transactions: No data-visibility or transaction-atomicity regression found beyond incorrect persisted job statistics.
User focus: No additional user-provided focus points.

I did not find another distinct blocking issue beyond the cloud replay hole.

…aming-job snapshot The previous commit threaded filteredRows from BE through the FE txn commit path, but cloud replay still reloaded 0 because the meta-service streaming-job snapshot did not store filtered_rows: - update_streaming_job_meta() accumulated/initialized only scanned_rows / load_bytes / num_files / file_bytes. - reset_streaming_job_offset() preserved only the same four fields when rewriting the snapshot with a new offset. So in cloud mode, while the live commit response echoed the request attachment correctly, FE restart, replayOnCloudMode() and offset reset all reloaded filtered_rows as 0. Mirror the four-stat handling for the new field: 1. update_streaming_job_meta(): accumulate filtered_rows in the prev-existed branch and initialize it in the first-write branch, matching scanned_rows. 2. reset_streaming_job_offset(): preserve prev_job_info.filtered_rows() alongside the other accumulated stats. Tests: - UpdateStreamingJobMetaFilteredRowsAccumulateTest: two commits then a get, asserts filtered_rows accumulates (30 + 20 = 50) the same way scanned_rows does, then resets offset and verifies filtered_rows is preserved. - GetStreamingTaskCommitAttachTest / ResetStreamingJobOffsetTest: populate filtered_rows on the seed commit and assert it round-trips through get and survives reset_streaming_job_offset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on comments - Remove UpdateStreamingJobMetaFilteredRowsAccumulateTest. The filtered_rows round-trip through update / get / reset is already covered by the EXPECT_EQ filtered_rows assertions added to GetStreamingTaskCommitAttachTest and ResetStreamingJobOffsetTest; the new test was duplicating that coverage with extra non-filtered fields whose accumulation is not what this PR changes. - Trim test_streaming_insert_job_filtered_rows.groovy comments: drop the issue number and cross-test reference, keep only the two invariants the reader needs to understand the schema and ratio choice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-27T03:36:07Z

/review

JNSimba · 2026-04-27T03:51:19Z

run buildall

github-actions

Findings

High: AbstractInsertExecutor still parses DPP_ABNORMAL_ALL with Integer.parseInt into an int (fe/fe-core/src/main/java/org/apache/doris/nereids/trees/plans/commands/insert/AbstractInsertExecutor.java:216). Coordinator/LoadContext accumulate that counter as a long, and this PR now carries it through long/int64 fields. Once a load filters more than Integer.MAX_VALUE rows, FE will fail before the new loadStatistic / txn-attachment path is built, so the end-to-end filteredRows propagation is still incorrect for large jobs.
Medium: StreamingInsertJob still only updates COUNTER_STREAMING_JOB_FILTER_ROWS in the non-txn CDC path. After this PR, single-table txn jobs will show non-zero jobStatistic.filteredRows, but the FE metrics endpoint still reports streaming_job_filter_rows=0 for that same path, and cloud replay never resets the counter to the cumulative value either.

Critical Checkpoints

Goal / correctness: The PR does wire filteredRows through the single-table txn attachment, FE replay, and cloud meta-service snapshot/reset paths, and the previously raised cloud replay issue appears fixed in the current diff. However, the large-counter overflow above means the goal is still not fully met end-to-end.
Scope / minimality: The change is small and focused on the filtered-row bookkeeping path.
Concurrency: No new concurrency model or lock-order risk was introduced in the reviewed paths; FE changes stay under the existing job lock / callback flow and cloud changes stay inside the existing meta-service txn update path.
Lifecycle / persistence: The new optional proto field is wire-compatible, and the cloud snapshot write/read/reset logic now includes filtered_rows. I did not find a new static lifecycle issue.
Config: No new configuration items are added.
Compatibility: Adding optional filtered_rows = 7 is compatible with readers that ignore unknown protobuf fields.
Parallel paths: The multi-table non-txn path already handled filtered rows, and the cloud snapshot path is now updated too. The remaining parity gaps are the int-limited source parsing in the single-table executor and the missing metric update in the txn path.
Special conditional checks: No new special conditional branch looks incorrect.
Test coverage: The new S3 regression test plus the meta-service get/reset assertions improve coverage for the intended fix. I did not find coverage for very large filtered-row counters or the metric parity issue.
Test result files: No .out result files were changed.
Observability: jobs("type"="insert") becomes more accurate, but /metrics is still inconsistent for single-table txn jobs.
Transaction / persistence: Attachment serialization/deserialization and the cloud snapshot round-trip now look aligned in the touched code. No additional failover/persistence gap was found beyond the issues above.
Data writes / atomicity: This PR changes bookkeeping rather than row visibility; I did not find a new atomicity regression in the reviewed paths.
FE/BE variable passing: The new filtered_rows field is threaded through the touched FE/cloud paths for single-table streaming commits.
Performance: No meaningful performance regression stood out in the new code.

User focus: no additional user-provided focus points.

hello-stephen · 2026-04-27T04:35:04Z

Cloud UT Coverage Report

Increment line coverage 100.00% (4/4) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.06% (1843/2361)
Line Coverage	64.75% (32992/50951)
Region Coverage	65.27% (16372/25085)
Branch Coverage	55.83% (8738/15652)

hello-stephen · 2026-04-27T07:25:22Z

FE Regression Coverage Report

Increment line coverage 93.33% (14/15) 🎉
Increment coverage report
Complete coverage report

JNSimba · 2026-04-27T08:20:41Z

run nonConcurrent

hello-stephen · 2026-04-27T08:33:01Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.34% (20415/38271)
Line Coverage	36.90% (192423/521528)
Region Coverage	33.20% (149642/450752)
Branch Coverage	34.34% (65505/190766)

hello-stephen · 2026-04-27T08:36:59Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.88% (27692/37484)
Line Coverage	57.63% (299639/519931)
Region Coverage	54.70% (248837/454909)
Branch Coverage	56.42% (107945/191339)

Single-table txn commit and cloud replay accumulated jobStatistic.filteredRows but never touched COUNTER_STREAMING_JOB_FILTER_ROWS, so /metrics still reported 0 while jobs("type"="insert") showed the right value. Mirror the TOTAL_ROWS / LOAD_BYTES pattern in both paths so the two observation points stay aligned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-27T10:03:18Z

run buildall

hello-stephen · 2026-04-27T10:40:01Z

Cloud UT Coverage Report

Increment line coverage 100.00% (4/4) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	78.06% (1843/2361)
Line Coverage	64.73% (32983/50951)
Region Coverage	65.25% (16369/25085)
Branch Coverage	55.80% (8734/15652)

hello-stephen · 2026-04-27T11:35:39Z

FE UT Coverage Report

Increment line coverage 11.76% (2/17) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2026-04-27T12:28:25Z

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.34% (20417/38274)
Line Coverage	36.90% (192445/521559)
Region Coverage	33.23% (149778/450768)
Branch Coverage	34.33% (65500/190776)

hello-stephen · 2026-04-27T12:35:32Z

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	72.17% (27051/37484)
Line Coverage	55.39% (288010/519931)
Region Coverage	52.55% (239065/454909)
Branch Coverage	53.88% (103103/191339)

hello-stephen · 2026-04-27T12:37:53Z

FE Regression Coverage Report

Increment line coverage 51.72% (15/29) 🎉
Increment coverage report
Complete coverage report

sollhui

LGTM

liaoxin01

LGTM

github-actions · 2026-04-29T13:48:56Z

PR approved by at least one committer and no changes requested.

already fixed

…eaming (#62816) Fix `filteredRows` always reported as 0 in `jobs("type"="insert")` for single-table S3 streaming insert jobs under `enable_insert_strict=false` + `insert_max_filter_ratio>0`. The filter count is now propagated from BE through the txn commit attachment into job statistics, and survives FE EditLog replay and cloud meta-service round-trip. Added regression test `test_streaming_insert_job_filtered_rows`.

JNSimba requested review from dataroaring, gavinchou and w41ter as code owners April 24, 2026 09:18

github-actions Bot requested changes Apr 27, 2026

View reviewed changes

Comment thread ...-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java

JNSimba and others added 2 commits April 27, 2026 11:24

github-actions Bot previously requested changes Apr 27, 2026

View reviewed changes

Comment thread ...c/main/java/org/apache/doris/nereids/trees/plans/commands/insert/AbstractInsertExecutor.java

Comment thread ...-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java

JNSimba added dev/4.1.x labels Apr 27, 2026

sollhui approved these changes Apr 28, 2026

View reviewed changes

liaoxin01 approved these changes Apr 29, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 29, 2026

gavinchou approved these changes May 6, 2026

View reviewed changes

gavinchou merged commit ddce328 into apache:master May 6, 2026
34 of 36 checks passed

github-actions Bot mentioned this pull request May 6, 2026

branch-4.1: [fix](streaming-job) fix filteredRows always 0 on single-table S3 streaming #62816 #63000

Open

Conversation

JNSimba commented Apr 24, 2026

What problem does this PR solve?

Fix

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 24, 2026

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Findings

Critical Checkpoints

Uh oh!

Uh oh!

Uh oh!

hello-stephen commented Apr 27, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

FE Regression Coverage Report

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

BE Regression && UT Coverage Report

Uh oh!

JNSimba commented Apr 27, 2026

Uh oh!

hello-stephen commented Apr 27, 2026

Cloud UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

FE UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Apr 27, 2026

FE Regression Coverage Report

Uh oh!

sollhui left a comment

Choose a reason for hiding this comment

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants