Skip to content

[fix](be) Remove original predicates after dict rewrite#63199

Closed
Gabriel39 wants to merge 2 commits into
apache:masterfrom
Gabriel39:fix_013
Closed

[fix](be) Remove original predicates after dict rewrite#63199
Gabriel39 wants to merge 2 commits into
apache:masterfrom
Gabriel39:fix_013

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Parquet dictionary filtering rewrites string predicates to dictionary id predicates, but the original slot predicates could remain in the filter conjunct list. During lazy materialization the predicate column is decoded as INT32 dictionary ids, so executing the original string IN predicate can assert-cast ColumnInt32 to ColumnString and abort. Remove the original slot predicates after a successful dictionary predicate rewrite so only the rewritten int predicates are executed in the dictionary-id phase.

Release note

Fix BE crash when Parquet dictionary filtering executes string IN predicates during lazy materialization.

Check List (For Author)

  • Test: Manual test

    • Ran build-support/clang-format.sh for the touched BE files and git diff --check.
  • Behavior changed: No

  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking review issue: this BE scan-path crash fix changes predicate execution behavior but has no regression coverage. The implementation itself is small and aligned with the stated goal: after dictionary predicate rewrite succeeds, the original string predicate is removed from the predicate-phase conjunct list so it is not evaluated against ColumnInt32 dictionary ids.

Critical checkpoint conclusions:

  • Goal/test: The code appears to address the stated crash path, but there is no automated test proving the lazy-materialization + Parquet string dictionary IN predicate case no longer crashes.
  • Scope: The code change is focused and minimal.
  • Concurrency/lifecycle: No new concurrency, shared mutable state, or non-obvious lifecycle management is introduced.
  • Config/compatibility: No new config, storage format, RPC, or compatibility concern.
  • Parallel paths: Existing Parquet dictionary-filter tests exist, but the lazy materialization path that triggered this bug is not covered by this PR.
  • Error handling: The new invariant check is consistent with the preceding lookup; no ignored Status was found.
  • Data correctness/performance: Removing the original predicate after successful rewrite is logically consistent with the dictionary-id filtering phase and avoids redundant/crashing evaluation. The erase cost is negligible for the small conjunct vector.
  • Observability: No new observability seems necessary for this narrow fix.

User focus: No additional user-provided focus points were present.

// 4. Rewrite conjuncts.
RETURN_IF_ERROR(_rewrite_dict_conjuncts(
dict_codes, slot_id, temp_block.get_by_position(dict_pos).column->is_nullable()));
_remove_slot_filter_conjuncts(slot_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a BE crash path, but the PR only records manual testing and does not add an automated regression. Please add coverage for the exact scenario: Parquet dictionary filtering on a string IN predicate with lazy materialization enabled and at least one lazy-read column, so the predicate phase reads dictionary ids and would previously execute the original string predicate against ColumnInt32. There is already related coverage in regression-test/suites/external_table_p0/hive/test_string_dict_filter.groovy and test_parquet_lazy_mat_profile.groovy; extending one of those would prevent this crash from regressing.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29405 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b07cfe52d094bb23ff9becb551b4c9211066b0dc, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17641	3851	3832	3832
q2	q3	10703	863	597	597
q4	4664	451	349	349
q5	7432	1341	1135	1135
q6	189	164	139	139
q7	908	937	760	760
q8	9323	1391	1266	1266
q9	5608	5387	5337	5337
q10	6262	2077	1812	1812
q11	472	265	256	256
q12	633	414	288	288
q13	18090	3339	2725	2725
q14	294	283	269	269
q15	q16	900	848	785	785
q17	948	981	716	716
q18	6474	5690	5547	5547
q19	1159	1175	1019	1019
q20	525	402	260	260
q21	4683	2452	1980	1980
q22	523	393	333	333
Total cold run time: 97431 ms
Total hot run time: 29405 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4793	4768	4796	4768
q2	q3	4703	4773	4189	4189
q4	2125	2179	1421	1421
q5	4989	5087	5295	5087
q6	203	173	133	133
q7	2046	1777	1638	1638
q8	3352	3097	3104	3097
q9	8357	8450	8471	8450
q10	4543	4511	4278	4278
q11	602	413	413	413
q12	718	756	517	517
q13	3217	3593	2891	2891
q14	329	317	273	273
q15	q16	776	788	791	788
q17	1330	1282	1241	1241
q18	7899	7122	7102	7102
q19	1202	1152	1163	1152
q20	2219	2279	1923	1923
q21	6092	5467	4866	4866
q22	558	520	414	414
Total cold run time: 60053 ms
Total hot run time: 54641 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172268 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b07cfe52d094bb23ff9becb551b4c9211066b0dc, data reload: false

query5	4326	655	524	524
query6	336	234	206	206
query7	4210	550	311	311
query8	319	224	218	218
query9	8852	4117	4079	4079
query10	452	334	312	312
query11	5799	2437	2233	2233
query12	180	129	132	129
query13	1318	608	452	452
query14	6022	5398	5100	5100
query14_1	4393	4374	4383	4374
query15	213	207	186	186
query16	1029	468	438	438
query17	1140	781	642	642
query18	2723	487	366	366
query19	228	211	166	166
query20	143	132	133	132
query21	216	140	122	122
query22	13606	14008	14552	14008
query23	17399	16564	16203	16203
query23_1	16252	16330	16329	16329
query24	7409	1743	1352	1352
query24_1	1365	1327	1356	1327
query25	540	536	452	452
query26	1298	314	166	166
query27	2706	607	331	331
query28	4389	1960	1955	1955
query29	1005	650	506	506
query30	297	234	198	198
query31	1090	1060	940	940
query32	93	76	70	70
query33	538	344	292	292
query34	1138	1141	652	652
query35	776	780	678	678
query36	1318	1310	1202	1202
query37	142	99	84	84
query38	3201	3110	3078	3078
query39	918	906	891	891
query39_1	878	877	909	877
query40	231	151	141	141
query41	64	62	60	60
query42	110	107	106	106
query43	323	325	279	279
query44	
query45	215	196	191	191
query46	1054	1206	748	748
query47	2290	2269	2263	2263
query48	407	429	313	313
query49	624	542	421	421
query50	698	282	221	221
query51	4381	4323	4196	4196
query52	105	111	94	94
query53	251	294	203	203
query54	310	268	254	254
query55	91	88	82	82
query56	302	300	293	293
query57	1426	1404	1310	1310
query58	286	280	263	263
query59	1573	1638	1481	1481
query60	347	336	320	320
query61	160	159	157	157
query62	658	617	571	571
query63	244	209	202	202
query64	2428	821	676	676
query65	
query66	1689	525	390	390
query67	30097	30039	29915	29915
query68	
query69	465	339	313	313
query70	1053	1041	999	999
query71	315	270	271	270
query72	2958	2865	2544	2544
query73	870	803	417	417
query74	5071	4926	4751	4751
query75	2801	2665	2335	2335
query76	2278	1146	845	845
query77	444	467	375	375
query78	12887	12844	12276	12276
query79	1492	1047	753	753
query80	717	626	519	519
query81	462	278	247	247
query82	1359	163	124	124
query83	390	280	253	253
query84	266	139	112	112
query85	863	515	447	447
query86	384	372	324	324
query87	3383	3355	3189	3189
query88	3539	2691	2684	2684
query89	420	381	334	334
query90	1926	173	182	173
query91	181	172	144	144
query92	88	81	72	72
query93	936	959	549	549
query94	536	346	300	300
query95	654	366	346	346
query96	1007	783	336	336
query97	2723	2717	2598	2598
query98	237	229	234	229
query99	1100	1097	960	960
Total cold run time: 253303 ms
Total hot run time: 172268 ms

@Gabriel39 Gabriel39 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants