Skip to content

[fix](scan) Fix missing predicate filter when Native and JNI readers are mixed in FileScanner#61929

Merged
morningman merged 3 commits intoapache:masterfrom
HYDCP:fix-paimon-pushdown-bug
Apr 3, 2026
Merged

[fix](scan) Fix missing predicate filter when Native and JNI readers are mixed in FileScanner#61929
morningman merged 3 commits intoapache:masterfrom
HYDCP:fix-paimon-pushdown-bug

Conversation

@wenzhenghu
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Problem Summary:
When querying a Paimon table (or other external tables) with a condition that cannot be pushed down (e.g., LIKE '%*%'), if a single FileScanner instance processes both Native splits (Parquet/ORC) and JNI splits consecutively, the data returned by the JNI reader will skip the fallback filtering at the Scanner layer, resulting in dirty data leaking into the final result.

Root Cause:

  1. When FileScanner prepares to read a Native split, it calls _process_late_arrival_conjuncts() to assign _conjuncts into _push_down_conjuncts.
  2. However, it mistakenly called _conjuncts.clear() at the end of this logic, wiping out the shared fallback _conjuncts at the scanner level.
  3. When the scanner subsequently processes a JNI split (which does not trigger the push-down logic), Scanner::_filter_output_block() finds _conjuncts is empty, causing predicates like LIKE to be completely bypassed.

Solution:
Remove the _conjuncts.clear() call in _process_late_arrival_conjuncts(). This ensures that _conjuncts is always retained as the final fallback filter at the Scanner level, regardless of how underlying readers execute their own push-down predicates.

Added a BE unit test process_late_arrival_conjuncts_retain to prevent regression.

Release note

Fix a correctness issue where complex string predicates (like LIKE) might fail to filter dirty data when querying external tables with mixed native and JNI splits.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wenzhenghu
Copy link
Copy Markdown
Contributor Author

run buildall

@wenzhenghu
Copy link
Copy Markdown
Contributor Author

/review

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26596 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 118e99f43912c587c0d89e620a146f0d389a111e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17630	4500	4284	4284
q2	q3	10648	753	511	511
q4	4672	347	251	251
q5	7557	1211	1032	1032
q6	173	172	145	145
q7	770	835	695	695
q8	9385	1469	1361	1361
q9	4897	4681	4672	4672
q10	6336	1916	1650	1650
q11	471	251	244	244
q12	747	576	460	460
q13	18039	2668	1957	1957
q14	225	225	213	213
q15	q16	743	729	673	673
q17	736	812	480	480
q18	6287	5418	5256	5256
q19	1115	992	616	616
q20	536	506	380	380
q21	4612	1819	1398	1398
q22	423	457	318	318
Total cold run time: 96002 ms
Total hot run time: 26596 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4792	4695	4571	4571
q2	q3	3980	4328	3858	3858
q4	861	1169	772	772
q5	4060	4354	4347	4347
q6	179	171	136	136
q7	1762	1612	1531	1531
q8	2475	2724	2616	2616
q9	7598	7471	7543	7471
q10	3780	3948	3635	3635
q11	514	441	415	415
q12	477	580	450	450
q13	2420	2969	2142	2142
q14	317	312	278	278
q15	q16	743	756	724	724
q17	1170	1363	1334	1334
q18	7295	6769	6693	6693
q19	930	897	905	897
q20	2102	2215	2007	2007
q21	3943	3451	3401	3401
q22	455	442	382	382
Total cold run time: 49853 ms
Total hot run time: 47660 ms

@wenzhenghu
Copy link
Copy Markdown
Contributor Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 169796 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 118e99f43912c587c0d89e620a146f0d389a111e, data reload: false

query5	4369	671	516	516
query6	349	233	214	214
query7	4225	464	262	262
query8	346	244	227	227
query9	8711	2768	2779	2768
query10	536	408	358	358
query11	6968	5103	4887	4887
query12	181	129	124	124
query13	1272	473	349	349
query14	5769	3800	3526	3526
query14_1	2936	2902	2921	2902
query15	216	195	175	175
query16	1015	471	462	462
query17	1125	734	630	630
query18	2501	460	356	356
query19	231	216	191	191
query20	135	126	124	124
query21	218	142	112	112
query22	13320	14387	14367	14367
query23	16895	16479	15929	15929
query23_1	16018	16081	15847	15847
query24	7198	1595	1209	1209
query24_1	1227	1249	1229	1229
query25	621	469	412	412
query26	1241	264	148	148
query27	2768	489	298	298
query28	4448	1842	1852	1842
query29	861	578	487	487
query30	305	232	193	193
query31	1019	928	880	880
query32	76	72	69	69
query33	527	340	293	293
query34	884	877	523	523
query35	668	680	590	590
query36	1127	1162	983	983
query37	137	94	82	82
query38	2965	2921	2876	2876
query39	870	839	813	813
query39_1	796	806	781	781
query40	234	151	136	136
query41	62	60	59	59
query42	262	258	256	256
query43	241	268	230	230
query44	
query45	195	187	187	187
query46	891	996	606	606
query47	2131	2140	2116	2116
query48	331	332	239	239
query49	659	474	390	390
query50	716	277	219	219
query51	4184	4073	4165	4073
query52	275	270	260	260
query53	301	341	294	294
query54	318	295	300	295
query55	95	91	88	88
query56	330	349	327	327
query57	1989	1926	1833	1833
query58	296	278	284	278
query59	2796	2960	2789	2789
query60	351	333	326	326
query61	163	159	163	159
query62	628	594	536	536
query63	314	278	277	277
query64	5051	1293	1025	1025
query65	
query66	1457	460	374	374
query67	24091	24223	24091	24091
query68	
query69	410	321	295	295
query70	942	933	963	933
query71	353	313	302	302
query72	2895	2693	2435	2435
query73	567	557	319	319
query74	9644	9555	9377	9377
query75	2864	2756	2478	2478
query76	2309	1028	695	695
query77	383	399	312	312
query78	10934	11069	10462	10462
query79	3381	781	570	570
query80	1757	632	546	546
query81	585	267	232	232
query82	941	162	131	131
query83	356	296	255	255
query84	262	125	96	96
query85	918	527	467	467
query86	560	309	295	295
query87	3219	3110	3019	3019
query88	4854	2669	2671	2669
query89	446	371	349	349
query90	2094	189	183	183
query91	172	168	138	138
query92	90	80	69	69
query93	2753	846	498	498
query94	637	314	285	285
query95	591	353	319	319
query96	662	533	230	230
query97	2463	2493	2373	2373
query98	248	229	220	220
query99	1012	1018	932	932
Total cold run time: 257309 ms
Total hot run time: 169796 ms

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26719 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 415832f7ea3d03d9eff216a040a6199fa00f5cda, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17604	4406	4297	4297
q2	q3	10639	786	522	522
q4	4704	361	248	248
q5	7891	1223	1023	1023
q6	217	172	150	150
q7	836	843	670	670
q8	10736	1512	1389	1389
q9	6543	4808	4744	4744
q10	6346	1955	1668	1668
q11	498	245	244	244
q12	745	578	458	458
q13	18034	2739	1930	1930
q14	227	235	210	210
q15	q16	736	731	677	677
q17	740	869	421	421
q18	6334	5333	5226	5226
q19	1116	1001	624	624
q20	549	497	378	378
q21	4507	2221	1552	1552
q22	410	322	288	288
Total cold run time: 99412 ms
Total hot run time: 26719 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4738	4554	4729	4554
q2	q3	3885	4370	3832	3832
q4	907	1225	770	770
q5	4089	4443	4360	4360
q6	183	179	139	139
q7	1836	1686	1555	1555
q8	2542	2797	2652	2652
q9	7790	7435	7418	7418
q10	3792	4035	3649	3649
q11	521	436	421	421
q12	545	683	444	444
q13	2461	2929	2026	2026
q14	281	294	273	273
q15	q16	694	795	747	747
q17	1228	1417	1380	1380
q18	7265	6924	6705	6705
q19	897	898	918	898
q20	2185	2204	2087	2087
q21	4021	3508	3368	3368
q22	449	433	387	387
Total cold run time: 50309 ms
Total hot run time: 47665 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168918 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 415832f7ea3d03d9eff216a040a6199fa00f5cda, data reload: false

query5	4338	642	509	509
query6	333	218	197	197
query7	4213	458	261	261
query8	348	239	223	223
query9	8727	2709	2692	2692
query10	535	405	340	340
query11	6878	5078	4872	4872
query12	185	130	131	130
query13	1286	453	347	347
query14	5769	3703	3493	3493
query14_1	2864	2827	2817	2817
query15	210	193	179	179
query16	1001	464	458	458
query17	920	727	618	618
query18	2455	456	353	353
query19	225	219	187	187
query20	135	127	124	124
query21	218	133	108	108
query22	13246	14890	14572	14572
query23	16675	16493	15881	15881
query23_1	16459	15722	15684	15684
query24	7237	1612	1233	1233
query24_1	1227	1256	1242	1242
query25	607	452	397	397
query26	1243	261	147	147
query27	2782	481	296	296
query28	4467	1839	1831	1831
query29	848	547	478	478
query30	292	245	191	191
query31	1011	950	877	877
query32	82	73	70	70
query33	507	331	288	288
query34	882	887	519	519
query35	649	671	597	597
query36	1050	1115	997	997
query37	132	91	80	80
query38	2936	2946	2851	2851
query39	856	821	819	819
query39_1	802	809	797	797
query40	233	149	139	139
query41	62	58	58	58
query42	264	260	250	250
query43	238	247	226	226
query44	
query45	206	190	183	183
query46	926	979	600	600
query47	2103	2144	2066	2066
query48	323	309	225	225
query49	637	470	385	385
query50	688	278	219	219
query51	4149	4045	3972	3972
query52	269	264	258	258
query53	290	347	294	294
query54	304	269	266	266
query55	94	87	81	81
query56	319	318	308	308
query57	1930	1711	1596	1596
query58	304	275	267	267
query59	2824	2944	2755	2755
query60	329	329	317	317
query61	156	156	156	156
query62	621	584	529	529
query63	311	280	272	272
query64	5080	1268	1038	1038
query65	
query66	1466	455	351	351
query67	24153	24286	24119	24119
query68	
query69	406	312	286	286
query70	990	981	950	950
query71	328	303	289	289
query72	2874	2668	2437	2437
query73	544	544	314	314
query74	9618	9586	9375	9375
query75	2859	2765	2465	2465
query76	2327	1028	716	716
query77	369	398	306	306
query78	10990	11082	10435	10435
query79	2952	768	573	573
query80	1714	621	535	535
query81	581	261	230	230
query82	978	159	118	118
query83	341	262	252	252
query84	300	119	106	106
query85	914	497	442	442
query86	497	329	325	325
query87	3169	3080	2994	2994
query88	3554	2654	2634	2634
query89	415	365	341	341
query90	2142	188	176	176
query91	171	170	135	135
query92	84	77	73	73
query93	1790	821	497	497
query94	644	323	279	279
query95	596	391	311	311
query96	651	518	228	228
query97	2493	2477	2438	2438
query98	235	219	215	215
query99	1008	970	920	920
Total cold run time: 253623 ms
Total hot run time: 168918 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.90% (19955/37722)
Line Coverage 36.44% (187214/513819)
Region Coverage 32.66% (145066/444148)
Branch Coverage 33.84% (63609/187943)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.57% (27176/36940)
Line Coverage 57.15% (292779/512274)
Region Coverage 54.39% (243812/448265)
Branch Coverage 56.10% (105751/188509)

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

PR approved by anyone and no changes requested.

@morningman morningman closed this Apr 3, 2026
@morningman morningman reopened this Apr 3, 2026
@morningman morningman merged commit 1ce2e40 into apache:master Apr 3, 2026
44 of 46 checks passed
github-actions bot pushed a commit that referenced this pull request Apr 3, 2026
…are mixed in FileScanner (#61929)

### What problem does this PR solve?

Problem Summary:
When querying a Paimon table (or other external tables) with a condition
that cannot be pushed down (e.g., `LIKE '%*%'`), if a single
`FileScanner` instance processes both Native splits (Parquet/ORC) and
JNI splits consecutively, the data returned by the JNI reader will skip
the fallback filtering at the `Scanner` layer, resulting in dirty data
leaking into the final result.

Root Cause:
1. When `FileScanner` prepares to read a Native split, it calls
`_process_late_arrival_conjuncts()` to assign `_conjuncts` into
`_push_down_conjuncts`.
2. However, it mistakenly called `_conjuncts.clear()` at the end of this
logic, wiping out the shared fallback `_conjuncts` at the scanner level.
3. When the scanner subsequently processes a JNI split (which does not
trigger the push-down logic), `Scanner::_filter_output_block()` finds
`_conjuncts` is empty, causing predicates like `LIKE` to be completely
bypassed.

Solution:
Remove the `_conjuncts.clear()` call in
`_process_late_arrival_conjuncts()`. This ensures that `_conjuncts` is
always retained as the final fallback filter at the `Scanner` level,
regardless of how underlying readers execute their own push-down
predicates.

Added a BE unit test `process_late_arrival_conjuncts_retain` to prevent
regression.
morningman pushed a commit that referenced this pull request Apr 3, 2026
suxiaogang223 pushed a commit to suxiaogang223/doris that referenced this pull request Apr 3, 2026
…are mixed in FileScanner (apache#61929)

### What problem does this PR solve?

Problem Summary:
When querying a Paimon table (or other external tables) with a condition
that cannot be pushed down (e.g., `LIKE '%*%'`), if a single
`FileScanner` instance processes both Native splits (Parquet/ORC) and
JNI splits consecutively, the data returned by the JNI reader will skip
the fallback filtering at the `Scanner` layer, resulting in dirty data
leaking into the final result.

Root Cause:
1. When `FileScanner` prepares to read a Native split, it calls
`_process_late_arrival_conjuncts()` to assign `_conjuncts` into
`_push_down_conjuncts`.
2. However, it mistakenly called `_conjuncts.clear()` at the end of this
logic, wiping out the shared fallback `_conjuncts` at the scanner level.
3. When the scanner subsequently processes a JNI split (which does not
trigger the push-down logic), `Scanner::_filter_output_block()` finds
`_conjuncts` is empty, causing predicates like `LIKE` to be completely
bypassed.

Solution:
Remove the `_conjuncts.clear()` call in
`_process_late_arrival_conjuncts()`. This ensures that `_conjuncts` is
always retained as the final fallback filter at the `Scanner` level,
regardless of how underlying readers execute their own push-down
predicates.

Added a BE unit test `process_late_arrival_conjuncts_retain` to prevent
regression.
yiguolei pushed a commit that referenced this pull request Apr 3, 2026
…JNI readers are mixed in FileScanner #61929 (#62078)

Cherry-picked from #61929

Co-authored-by: Wen Zhenghu <wenzhenghu.zju@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.6-merged dev/4.1.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants