Skip to content

Conversation

@JNSimba
Copy link
Member

@JNSimba JNSimba commented Feb 3, 2026

What problem does this PR solve?

Related PR: #58898 #59461

In some scenarios, it is necessary to tolerate a certain amount of erroneous data.

Supported parameters:

load.strict_mode: Whether to enable strict mode, defaults to false.

load.max_filter_ratio: The maximum allowed filtering rate within the sampling window, defaults to zero tolerance. The sampling window is max_interval * 10. That is, if the number of erroneous rows/total rows exceeds max_filter_ratio within the sampling window, the job will be paused, requiring manual intervention to check data quality issues.

eg:

CREATE JOB test_streaming_mysql_job_errormsg
ON STREAMING
FROM MYSQL (
"jdbc_url" = "jdbc:mysql://127.0.0.1:3308",
......
)
TO DATABASE database (
"table.create.properties.replication_num" = "1"
...
"load.max_filter_ratio" = "1"
)

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Member Author

JNSimba commented Feb 3, 2026

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/21) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 32066 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a92766b116558900d2d541e5a74d1b476a6b0af8, data reload: false

------ Round 1 ----------------------------------
q1	17649	5228	5047	5047
q2	2038	308	188	188
q3	10202	1355	750	750
q4	10232	896	320	320
q5	8139	2177	1952	1952
q6	227	179	149	149
q7	904	738	607	607
q8	9270	1444	1238	1238
q9	5390	4807	4832	4807
q10	6877	1921	1552	1552
q11	511	303	278	278
q12	379	377	227	227
q13	17781	4071	3201	3201
q14	256	249	226	226
q15	908	839	817	817
q16	678	677	637	637
q17	665	767	511	511
q18	6755	6542	6524	6524
q19	1448	1005	622	622
q20	408	370	229	229
q21	2638	2099	1907	1907
q22	358	328	277	277
Total cold run time: 103713 ms
Total hot run time: 32066 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5373	5311	5281	5281
q2	273	343	262	262
q3	2194	2675	2259	2259
q4	1366	1733	1303	1303
q5	4318	4253	4329	4253
q6	215	181	141	141
q7	2441	2023	1942	1942
q8	2602	2612	2474	2474
q9	7485	7782	7465	7465
q10	2798	3058	2582	2582
q11	567	475	446	446
q12	684	748	639	639
q13	3945	4966	3671	3671
q14	289	314	299	299
q15	883	839	855	839
q16	675	729	698	698
q17	1213	1370	1393	1370
q18	8039	8185	7594	7594
q19	844	848	875	848
q20	2079	2161	2012	2012
q21	4735	4165	4164	4164
q22	596	544	522	522
Total cold run time: 53614 ms
Total hot run time: 51064 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit a92766b116558900d2d541e5a74d1b476a6b0af8, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.04
query3	0.26	0.08	0.09
query4	1.62	0.11	0.11
query5	0.27	0.25	0.25
query6	1.17	0.69	0.67
query7	0.03	0.02	0.03
query8	0.06	0.03	0.04
query9	0.56	0.51	0.50
query10	0.56	0.54	0.55
query11	0.15	0.10	0.09
query12	0.14	0.11	0.11
query13	0.63	0.61	0.60
query14	1.05	1.04	1.04
query15	0.88	0.87	0.88
query16	0.42	0.39	0.40
query17	1.18	1.15	1.15
query18	0.23	0.21	0.21
query19	2.10	2.00	2.00
query20	0.02	0.01	0.02
query21	15.40	0.25	0.14
query22	5.20	0.05	0.05
query23	15.92	0.31	0.11
query24	1.07	0.63	0.79
query25	0.11	0.05	0.11
query26	0.16	0.13	0.13
query27	0.09	0.06	0.06
query28	5.00	1.14	0.97
query29	12.56	3.94	3.21
query30	0.28	0.12	0.11
query31	2.82	0.64	0.41
query32	3.25	0.59	0.49
query33	3.27	3.26	3.35
query34	16.18	5.39	4.70
query35	4.76	4.77	4.73
query36	0.67	0.50	0.50
query37	0.10	0.07	0.06
query38	0.07	0.04	0.04
query39	0.04	0.02	0.03
query40	0.20	0.18	0.16
query41	0.09	0.03	0.03
query42	0.05	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 98.81 s
Total hot run time: 28.69 s

@JNSimba JNSimba changed the title [Improve](StreamingJob) add stream load properties for mysql/pg streaming job [Improve](StreamingJob) add max_filter_ratio and strict mode for mysql/pg streaming job Feb 3, 2026
@JNSimba
Copy link
Member Author

JNSimba commented Feb 3, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31554 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 856a0da5bcdcd9c316beadb076a751f95b163f82, data reload: false

------ Round 1 ----------------------------------
q1	17652	5323	5042	5042
q2	2030	298	204	204
q3	10233	1281	748	748
q4	10200	803	324	324
q5	7519	2207	1886	1886
q6	197	179	149	149
q7	867	761	596	596
q8	9271	1421	1021	1021
q9	5081	4807	4826	4807
q10	6817	1929	1565	1565
q11	502	304	277	277
q12	337	373	232	232
q13	17797	4038	3263	3263
q14	243	236	227	227
q15	874	828	820	820
q16	676	668	624	624
q17	651	832	456	456
q18	6685	6449	6288	6288
q19	1234	997	631	631
q20	390	348	232	232
q21	2672	2015	1897	1897
q22	353	313	265	265
Total cold run time: 102281 ms
Total hot run time: 31554 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5292	5251	5235	5235
q2	259	342	265	265
q3	2187	2664	2265	2265
q4	1339	1725	1317	1317
q5	4275	4235	4298	4235
q6	228	187	139	139
q7	1948	2302	1918	1918
q8	2545	2438	2518	2438
q9	7569	7453	7581	7453
q10	2952	3135	2581	2581
q11	542	495	463	463
q12	679	727	615	615
q13	3789	4480	3576	3576
q14	310	348	288	288
q15	896	837	856	837
q16	684	728	694	694
q17	1203	1374	1382	1374
q18	8016	7965	7849	7849
q19	890	847	849	847
q20	2091	2205	2104	2104
q21	4751	4229	4147	4147
q22	568	545	496	496
Total cold run time: 53013 ms
Total hot run time: 51136 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.33 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 856a0da5bcdcd9c316beadb076a751f95b163f82, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.05	0.04
query3	0.26	0.09	0.08
query4	1.60	0.11	0.11
query5	0.27	0.26	0.24
query6	1.16	0.67	0.67
query7	0.03	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.51	0.49
query10	0.54	0.54	0.55
query11	0.15	0.10	0.10
query12	0.14	0.11	0.10
query13	0.64	0.61	0.61
query14	1.06	1.08	1.07
query15	0.87	0.84	0.87
query16	0.44	0.39	0.41
query17	1.17	1.14	1.08
query18	0.23	0.21	0.21
query19	2.02	1.95	1.99
query20	0.02	0.01	0.01
query21	15.39	0.27	0.15
query22	4.97	0.05	0.05
query23	15.83	0.28	0.10
query24	0.93	0.93	0.32
query25	0.11	0.08	0.06
query26	0.14	0.14	0.14
query27	0.10	0.04	0.06
query28	3.68	1.13	0.96
query29	12.58	3.93	3.17
query30	0.28	0.13	0.12
query31	2.81	0.64	0.43
query32	3.23	0.60	0.49
query33	3.26	3.25	3.28
query34	16.42	5.44	4.74
query35	4.79	4.80	4.78
query36	0.64	0.50	0.50
query37	0.11	0.08	0.07
query38	0.07	0.05	0.04
query39	0.05	0.04	0.04
query40	0.19	0.16	0.15
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.06	0.03	0.03
Total cold run time: 97.11 s
Total hot run time: 28.33 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/63) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 33.33% (21/63) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Feb 4, 2026

run buildall

@JNSimba
Copy link
Member Author

JNSimba commented Feb 4, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32090 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4d2d8f3b8f91bd43866861078f2ea1ebfb364c06, data reload: false

------ Round 1 ----------------------------------
q1	17613	5233	5111	5111
q2	2067	316	230	230
q3	10142	1305	726	726
q4	10201	822	322	322
q5	7560	2173	1846	1846
q6	196	183	147	147
q7	882	737	623	623
q8	9267	1417	1073	1073
q9	5181	4764	4912	4764
q10	6799	1915	1561	1561
q11	492	296	277	277
q12	340	370	228	228
q13	17759	4045	3280	3280
q14	256	238	212	212
q15	904	825	804	804
q16	664	673	622	622
q17	634	787	508	508
q18	6734	6579	7520	6579
q19	1249	1056	640	640
q20	408	385	232	232
q21	2983	2440	2012	2012
q22	374	331	293	293
Total cold run time: 102705 ms
Total hot run time: 32090 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5672	5531	5807	5531
q2	272	339	253	253
q3	2359	2858	2438	2438
q4	1508	1811	1431	1431
q5	4594	4638	4830	4638
q6	239	182	140	140
q7	2055	1921	1785	1785
q8	2504	2402	2409	2402
q9	7617	7465	7386	7386
q10	2879	3050	2600	2600
q11	595	501	465	465
q12	685	836	608	608
q13	3780	4488	3479	3479
q14	262	285	265	265
q15	844	797	797	797
q16	644	682	646	646
q17	1092	1260	1310	1260
q18	7576	7204	7199	7199
q19	804	775	798	775
q20	2021	2049	1875	1875
q21	4493	4231	4083	4083
q22	573	538	491	491
Total cold run time: 53068 ms
Total hot run time: 50547 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.38 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4d2d8f3b8f91bd43866861078f2ea1ebfb364c06, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.04	0.04
query3	0.26	0.08	0.08
query4	1.59	0.11	0.11
query5	0.26	0.25	0.25
query6	1.16	0.69	0.66
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.57	0.50	0.49
query10	0.54	0.56	0.55
query11	0.14	0.09	0.09
query12	0.14	0.10	0.11
query13	0.63	0.60	0.60
query14	1.09	1.08	1.04
query15	0.88	0.87	0.87
query16	0.40	0.39	0.40
query17	1.12	1.10	1.12
query18	0.23	0.21	0.21
query19	2.13	2.02	2.07
query20	0.02	0.02	0.01
query21	15.39	0.26	0.15
query22	5.47	0.05	0.05
query23	15.85	0.29	0.11
query24	1.70	0.34	0.54
query25	0.09	0.06	0.08
query26	0.14	0.13	0.13
query27	0.06	0.05	0.08
query28	4.05	1.14	0.95
query29	12.55	3.91	3.14
query30	0.28	0.14	0.14
query31	2.81	0.64	0.39
query32	3.24	0.61	0.50
query33	3.17	3.25	3.27
query34	15.97	5.35	4.72
query35	4.79	4.80	4.79
query36	0.64	0.51	0.49
query37	0.11	0.07	0.08
query38	0.07	0.05	0.04
query39	0.05	0.03	0.03
query40	0.18	0.15	0.16
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.04
Total cold run time: 98.18 s
Total hot run time: 28.38 s

@JNSimba
Copy link
Member Author

JNSimba commented Feb 4, 2026

run p0

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for load.max_filter_ratio and load.strict_mode properties to MySQL/PostgreSQL streaming jobs, enabling error tolerance configuration for data quality monitoring.

Changes:

  • Added data quality monitoring with configurable filter ratio thresholds using a sliding window approach
  • Introduced LoadStatistic class to track filtered rows, loaded rows, and load bytes
  • Modified target properties validation to support load properties prefix
  • Refactored statistics tracking from scannedBytes to loadBytes and added filteredRows tracking

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
LoadStatistic.java New class to track load statistics (filtered/loaded rows, bytes)
HttpPutBuilder.java Changed properties parameter type from Properties to Map<String, String>
DorisBatchStreamLoad.java Added load statistics tracking and stream load properties support
PipelineCoordinator.java Updated to pass LoadStatistic object in commitOffset
StreamingJobUtils.java Moved TABLE_PROPS_PREFIX constant to DataSourceConfigKeys
StreamingMultiTblTask.java Generate stream load properties based on max_filter_ratio and strict_mode
StreamingJobStatistic.java Added filteredRows field
StreamingJobSchedulerTask.java Initialize sampleStartTime when job transitions to RUNNING
StreamingInsertJob.java Implemented checkDataQuality method with sliding window monitoring
DataSourceConfigValidator.java Updated to allow load properties prefix in target validation
StreamingJobAction.java Removed CommitOffsetRequest inner class (moved to separate file)
DorisParser.g4 Made sourceProperties optional in jobFromToClause grammar
CommitOffsetRequest.java New file with fields for filtered/loaded rows and load bytes
DataSourceConfigKeys.java Added TABLE_PROPS_PREFIX and LOAD_PROPERTIES constants
WriteRecordRequest.java Added streamLoadProps field, removed unused abstract methods
JobBaseRecordRequest.java Removed unused abstract methods
FetchRecordRequest.java Removed unused method implementations
Test files Updated to parse JSON statistics and verify new fields; adjusted expected byte counts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JNSimba
Copy link
Member Author

JNSimba commented Feb 4, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32501 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b95a3b7b990e8c9d6abc2e825c51393d3d6ed0dc, data reload: false

------ Round 1 ----------------------------------
q1	17667	5365	5078	5078
q2	2031	321	217	217
q3	10185	1387	783	783
q4	10213	898	365	365
q5	7534	2292	2053	2053
q6	221	187	152	152
q7	898	737	631	631
q8	9279	1562	1200	1200
q9	5501	4913	4867	4867
q10	6950	1977	1582	1582
q11	538	302	266	266
q12	380	401	243	243
q13	17781	4136	3261	3261
q14	253	241	218	218
q15	938	833	809	809
q16	692	687	617	617
q17	655	876	567	567
q18	6910	6637	6459	6459
q19	1100	1115	675	675
q20	414	366	234	234
q21	2953	2322	1944	1944
q22	362	320	280	280
Total cold run time: 103455 ms
Total hot run time: 32501 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5412	5391	5398	5391
q2	269	357	252	252
q3	2242	2733	2264	2264
q4	1421	1829	1343	1343
q5	4311	4259	4869	4259
q6	297	204	154	154
q7	2194	2006	1865	1865
q8	2638	2552	2479	2479
q9	7709	7702	7819	7702
q10	2961	3062	2730	2730
q11	583	462	448	448
q12	740	755	616	616
q13	3861	4293	3712	3712
q14	304	331	292	292
q15	870	822	796	796
q16	685	703	725	703
q17	1263	1449	1428	1428
q18	8266	8048	7913	7913
q19	934	895	902	895
q20	2176	2222	1980	1980
q21	4970	4492	4408	4408
q22	588	529	510	510
Total cold run time: 54694 ms
Total hot run time: 52140 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.32 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b95a3b7b990e8c9d6abc2e825c51393d3d6ed0dc, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.04	0.05
query3	0.25	0.08	0.09
query4	1.61	0.11	0.12
query5	0.27	0.25	0.24
query6	1.16	0.67	0.68
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.56	0.50	0.48
query10	0.55	0.53	0.54
query11	0.15	0.09	0.09
query12	0.15	0.11	0.11
query13	0.63	0.60	0.62
query14	1.05	1.07	1.06
query15	0.88	0.85	0.88
query16	0.39	0.38	0.38
query17	1.11	1.13	1.13
query18	0.23	0.21	0.22
query19	2.11	1.97	2.09
query20	0.03	0.01	0.01
query21	15.39	0.27	0.15
query22	5.06	0.06	0.05
query23	16.03	0.30	0.10
query24	1.00	0.70	0.31
query25	0.07	0.10	0.07
query26	0.14	0.13	0.14
query27	0.06	0.06	0.05
query28	3.11	1.15	0.97
query29	12.60	3.91	3.16
query30	0.28	0.13	0.13
query31	2.83	0.65	0.41
query32	3.24	0.59	0.51
query33	3.16	3.24	3.22
query34	16.21	5.38	4.73
query35	4.85	4.81	4.78
query36	0.66	0.50	0.49
query37	0.12	0.07	0.06
query38	0.08	0.05	0.05
query39	0.04	0.04	0.03
query40	0.20	0.17	0.15
query41	0.09	0.04	0.03
query42	0.04	0.02	0.02
query43	0.05	0.04	0.04
Total cold run time: 96.68 s
Total hot run time: 28.32 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 55.07% (38/69) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Member Author

JNSimba commented Feb 5, 2026

run cloud_p0

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 55.07% (38/69) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

PR approved by anyone and no changes requested.

Copy link
Contributor

@sollhui sollhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JNSimba JNSimba merged commit d655b4b into apache:master Feb 5, 2026
30 of 31 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 5, 2026
…l/pg streaming job (#60473)

### What problem does this PR solve?

Related PR: #58898
#59461

In some scenarios, it is necessary to tolerate a certain amount of
erroneous data.

Supported parameters:

`load.strict_mode`: Whether to enable strict mode, defaults to false.

`load.max_filter_ratio`: The maximum allowed filtering rate within the
sampling window, defaults to zero tolerance. The sampling window is
`max_interval * 10`. That is, if the number of erroneous rows/total rows
exceeds `max_filter_ratio` within the sampling window, the job will be
paused, requiring manual intervention to check data quality issues.


eg:  
```
CREATE JOB test_streaming_mysql_job_errormsg
ON STREAMING
FROM MYSQL (
"jdbc_url" = "jdbc:mysql://127.0.0.1:3308",
......
)
TO DATABASE database (
"table.create.properties.replication_num" = "1"
...
"load.max_filter_ratio" = "1"
)
```
yiguolei pushed a commit that referenced this pull request Feb 5, 2026
…ode for mysql/pg streaming job #60473 (#60527)

Cherry-picked from #60473

Co-authored-by: wudi <wudi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.4-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants