Skip to content

[fix](be) Restrict varbinary predicate block to file scan#64019

Merged
BiteTheDDDDt merged 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/restrict-varbinary-file-scan
Jun 3, 2026
Merged

[fix](be) Restrict varbinary predicate block to file scan#64019
BiteTheDDDDt merged 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/restrict-varbinary-file-scan

Conversation

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The scan operator unconditionally skipped VARBINARY column predicate and TopN runtime predicate pushdown. The commit that introduced the guard was for external Parquet/file scan reader predicate limitations, so applying it in the shared scan path also blocked non-file scans. This change adds a scan-operator hook for column predicate pushdown capability, keeps the default permissive, and makes FileScanOperatorX reject VARBINARY column predicates.

Release note

None

Check List (For Author)

  • Test: Manual test
    • ./build.sh --be
    • build-support/check-format.sh
    • build-support/run-clang-tidy.sh (fails due existing jni-util.h static_assert analysis errors and pre-existing scan_operator.cpp complexity/style diagnostics)
  • Behavior changed: Yes. VARBINARY predicate pushdown is only blocked by file scan now.
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings June 2, 2026 08:45
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: Requesting changes for a TopN runtime predicate regression in file scans. The PR goal is clear and the hook-based split between shared scan logic and file-scan limitations is focused, but the current target initialization path disables VARBINARY TopN filters entirely for file scans instead of only preventing reader/column-predicate pushdown.

Critical checkpoint conclusions:

  • Goal/test: The change partially achieves the goal, but the file-scan VARBINARY TopN case is not covered and regresses runtime filter application. Only manual testing is reported; no targeted regression/BE test proves the affected path.
  • Scope: The code is small and focused.
  • Concurrency/lifecycle: No new threads, locks, global/static lifecycle, or ownership cycles were introduced. Existing RuntimePredicate locking is unchanged.
  • Config/compatibility/protocol: No new config, storage format, or FE-BE protocol changes.
  • Parallel paths: Column predicate normalization and TopN predicate initialization are parallel decisions; the new capability hook is applied to both, but prepare() now conflates cannot push down as column predicate with do not initialize target, which is the issue below.
  • Tests/results: No new test or expected result is added. A focused external file-scan VARBINARY TopN/runtime-filter case would catch this.
  • Observability/performance: No new observability required. The issue is performance/behavioral because a safe residual VTopN predicate is not installed.
  • Transaction/data correctness: Not applicable; no transaction, persistence, or data write path changes.

User focus: No additional user-provided review focus was supplied.

Comment thread be/src/exec/operator/scan_operator.cpp Outdated
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: The scan operator unconditionally skipped VARBINARY column predicate and TopN runtime predicate pushdown. The commit that introduced the guard was for external Parquet/file scan reader predicate limitations, so applying it in the shared scan path also blocked non-file scans. This change adds a scan-operator hook for column predicate pushdown capability, keeps the default permissive, and makes FileScanOperatorX reject VARBINARY column predicates.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - ./build.sh --be
    - build-support/check-format.sh
    - build-support/run-clang-tidy.sh (fails due existing jni-util.h static_assert analysis errors and pre-existing scan_operator.cpp complexity/style diagnostics)
- Behavior changed: Yes. VARBINARY predicate pushdown is only blocked by file scan now.
- Does this need documentation: No
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/restrict-varbinary-file-scan branch from 11f26e0 to 830a82d Compare June 2, 2026 10:50
@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (21/21) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.78% (28196/38217)
Line Coverage 57.75% (306837/531277)
Region Coverage 54.53% (257255/471750)
Branch Coverage 55.99% (111491/199118)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29098 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 830a82d272521b339876e1fd9e4be7fc12993a5d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17783	4031	4004	4004
q2	q3	10761	1348	817	817
q4	4692	477	347	347
q5	7538	900	597	597
q6	180	169	137	137
q7	766	866	640	640
q8	9370	1487	1556	1487
q9	5871	4490	4453	4453
q10	6760	1808	1556	1556
q11	434	272	253	253
q12	627	426	289	289
q13	18100	3387	2727	2727
q14	267	256	243	243
q15	q16	816	782	712	712
q17	1028	1004	964	964
q18	7329	5762	5493	5493
q19	1502	1287	1052	1052
q20	511	407	267	267
q21	6188	2755	2729	2729
q22	471	374	331	331
Total cold run time: 100994 ms
Total hot run time: 29098 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5082	4820	4666	4666
q2	q3	4755	5287	4635	4635
q4	2130	2216	1377	1377
q5	4802	4861	4706	4706
q6	239	188	131	131
q7	1832	1751	1606	1606
q8	2425	2102	2096	2096
q9	7939	7564	7364	7364
q10	4745	4653	4201	4201
q11	527	395	355	355
q12	731	738	538	538
q13	2964	3324	2832	2832
q14	281	286	259	259
q15	q16	678	699	615	615
q17	1291	1261	1254	1254
q18	7195	6813	6874	6813
q19	1120	1099	1104	1099
q20	2232	2218	1962	1962
q21	5280	4590	4460	4460
q22	525	457	400	400
Total cold run time: 56773 ms
Total hot run time: 51369 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169188 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 830a82d272521b339876e1fd9e4be7fc12993a5d, data reload: false

query5	4388	617	456	456
query6	449	193	186	186
query7	4819	563	306	306
query8	371	208	212	208
query9	8749	3990	3978	3978
query10	439	313	270	270
query11	5887	2333	2152	2152
query12	163	106	101	101
query13	1286	644	433	433
query14	6436	5409	5098	5098
query14_1	4424	4425	4318	4318
query15	203	203	174	174
query16	1020	443	450	443
query17	1095	685	575	575
query18	2559	472	349	349
query19	192	179	141	141
query20	119	108	106	106
query21	216	136	111	111
query22	13736	13518	13351	13351
query23	17433	16546	16193	16193
query23_1	16212	16216	16383	16216
query24	7468	1767	1287	1287
query24_1	1314	1303	1308	1303
query25	557	448	375	375
query26	1294	306	160	160
query27	2678	587	335	335
query28	4455	2039	1998	1998
query29	1080	602	490	490
query30	319	230	195	195
query31	1112	1085	965	965
query32	111	62	61	61
query33	548	333	263	263
query34	1201	1110	675	675
query35	757	796	722	722
query36	1352	1414	1302	1302
query37	158	113	95	95
query38	3211	3166	3023	3023
query39	934	907	905	905
query39_1	870	877	881	877
query40	223	127	108	108
query41	71	69	67	67
query42	97	96	100	96
query43	320	316	281	281
query44	
query45	196	181	187	181
query46	1103	1223	745	745
query47	2378	2386	2212	2212
query48	411	426	299	299
query49	642	464	363	363
query50	1004	349	251	251
query51	4305	4367	4232	4232
query52	85	86	75	75
query53	251	264	194	194
query54	263	214	191	191
query55	77	75	69	69
query56	226	222	215	215
query57	1430	1426	1321	1321
query58	242	219	217	217
query59	1535	1662	1417	1417
query60	285	246	231	231
query61	156	157	150	150
query62	693	654	589	589
query63	226	180	190	180
query64	2543	786	643	643
query65	
query66	1762	471	335	335
query67	29037	29710	29520	29520
query68	
query69	418	306	266	266
query70	991	924	908	908
query71	297	224	213	213
query72	3095	2683	2463	2463
query73	840	750	426	426
query74	5157	4934	4778	4778
query75	2657	2551	2228	2228
query76	2337	1142	797	797
query77	345	370	292	292
query78	12392	12387	11905	11905
query79	1309	1063	734	734
query80	561	468	389	389
query81	456	283	248	248
query82	239	153	123	123
query83	267	273	244	244
query84	262	140	115	115
query85	857	535	433	433
query86	341	314	288	288
query87	3353	3328	3175	3175
query88	3598	2759	2719	2719
query89	423	381	330	330
query90	2209	180	185	180
query91	176	166	136	136
query92	64	63	59	59
query93	1595	1385	827	827
query94	525	320	347	320
query95	686	374	345	345
query96	1079	816	351	351
query97	2685	2681	2585	2585
query98	223	228	219	219
query99	1169	1173	1014	1014
Total cold run time: 249904 ms
Total hot run time: 169188 ms

Copy link
Copy Markdown
Contributor

@hubgeter hubgeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BiteTheDDDDt
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 3, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: I reviewed the actual GitHub PR diff for the four changed scan operator files and did not find additional blocking issues beyond the already-known inline thread. The current patch no longer uses continue for non-pushdown TopN targets and initializes the runtime predicate target with cid = -1, so the residual TopN predicate path is preserved for file-scan VARBINARY targets.

Critical checkpoint conclusions:

  • Goal/test: The goal is to restrict the VARBINARY column predicate pushdown block to file scans while allowing other scan types. The code implements this via a virtual can_push_down_column_predicate() hook and a FileScanOperatorX override. The PR reports BE build/format/manual checks; I did not run additional tests in this review.
  • Scope/clarity: The change is small and focused on scan predicate pushdown capability routing.
  • Concurrency/lifecycle: No new shared mutable state, thread lifecycle, lock ordering, or dependency objects are introduced.
  • Configuration/compatibility: No new config, storage format, function symbol, or FE/BE protocol compatibility change is introduced.
  • Parallel paths: The generic scan predicate path and file-scan TopN path are both routed through the new capability hook; OLAP keeps its existing key-column TopN restriction.
  • Conditional checks: The file-scan-specific VARBINARY check is localized in FileScanOperatorX::can_push_down_column_predicate() and matches the stated external-reader limitation.
  • Tests: No new automated regression/unit test is added. The PR states manual BE build/style/tidy checks; this is acceptable for the small routing fix but leaves residual risk around external file-scan VARBINARY TopN coverage.
  • Observability/transactions/persistence/data writes: Not applicable; this change affects scan predicate routing only.
  • Performance: No meaningful extra hot-path overhead beyond one virtual capability check during predicate normalization/TopN setup.

User focus: No additional user-provided review focus was specified.

@BiteTheDDDDt BiteTheDDDDt merged commit 2c0181b into apache:master Jun 3, 2026
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants