[refactor](storage) Drop PredicateColumnType by csun5285 · Pull Request #64128 · apache/doris

csun5285 · 2026-06-05T01:32:58Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-06-05T01:33:03Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

csun5285 · 2026-06-05T01:33:08Z

run buildall

csun5285 · 2026-06-05T06:28:44Z

run buildall

csun5285 · 2026-06-05T06:29:10Z

run buildall

hello-stephen · 2026-06-05T10:18:30Z

BE Regression && UT Coverage Report

Increment line coverage 87.50% (98/112) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.85% (28222/38213)
Line Coverage	57.82% (306996/530916)
Region Coverage	54.70% (257459/470702)
Branch Coverage	56.07% (111651/199119)

PredicateColumnType<T> was a storage-layer wrapper that flattened every predicate column into PaddedPODArray<value_type> so the SIMD-friendly predicate-eval loops could use a uniform `data_array[i]` access regardless of the underlying type. The cost was: a parallel column hierarchy with restricted API, an extra 16 bytes/row (StringRef header) for strings on top of the actual chars, and the need for callers to thread the column's "role" (predicate vs output) down through the read path. This change retires PredicateColumnType entirely: - Schema::get_predicate_column_ptr now allocates the canonical PrimitiveTypeTraits<T>::ColumnType (ColumnVector / ColumnDecimal / ColumnString / ColumnIPv4 / ...). ColumnDictI32 is unchanged: it still serves the low-cardinality string fast path for predicate eval. - filter_by_selector is now implemented on ColumnVector<T> and ColumnDecimal<T> with the same selector-gather semantics PredicateColumnType used. - ColumnString does not expose a contiguous StringRef[] array, so per-row access in predicate evaluators flows through ColumnElementView<Type> (already in core/column/column_execute_util.h for the compute layer) rather than a pointer subscript: ColumnElementView<Type> view {column}; _base_loop_vec<...>(size, flags, null_map, view, _value); ColumnElementView numeric specialization yields T via pointer arithmetic; the string specialization yields StringRef via get_data_at — same `view[i]` call shape, no if-constexpr at the call site. The compute-layer ColumnElementView gains size() / operator[] aliases and its TYPE_STRING specialization is generalized to all is_string_type(PType) via a defaulted bool template param so TYPE_CHAR / TYPE_VARCHAR / TYPE_JSONB all resolve correctly. - For HybridSet::find paths that need const T* per row, a small ColumnPointerCursor<Type> lives next to ColumnElementView. Numeric specialization holds a const T* (zero copy); string specialization stages each row into a member StringRef and returns its address (HybridSet::find consumes synchronously, so the staged-cell reuse is safe). - BloomFilter / BitmapFilter find_fixed_len_olap_engine API is redesigned to take `const IColumn&` instead of a `const char*` that was reinterpreted as `const T[]`. CommonFindOp / StringFindOp each specialize per-row access: CommonFindOp: reads column.get_raw_data() as `const T*`, passes `[data](int i){return data[i];}` as the accessor. StringFindOp: `assert_cast<const ColumnString&>(column)`, passes `[&col](int i){return col.get_data_at(i);}`. Storage-side BF/Bitmap callers collapse to one line and the previous workaround (materializing a temporary `vector<StringRef>` to satisfy the legacy char* API) goes away — saves one jemalloc + N×16-byte store/load per evaluate on string columns. - Schema-template CHAR predicate columns previously got trailing-zero padding stripped on every PredicateColumnType<TYPE_CHAR>::get_data_at call. ColumnString no longer does that; the strip is handled by Block::shrink_char_type_column_suffix_zero on the output side (and page-decoder-level for the read path per upstream PR apache#63291), so no extra pass is needed here. - ColumnDictI32::convert_to_predicate_column_if_dictionary now produces ColumnString for mid-batch dict->plain fallback. - predicate_column.h / predicate_column_test.cpp deleted; the PredicateColumnHolderType<T> transition alias and all core/column/predicate_column.h includes are removed. On the _base_loop_vec signature: upstream used `const TArray* __restrict data_array` which worked because PredicateColumnType<TYPE_STRING> physically held a contiguous StringRef[]. After this change the string side is ColumnElementView<TYPE_STRING> (a struct value), and `__restrict` is a pointer-only qualifier — so we pass TArray by value without it. Verified on a Release build (objdump) that this does NOT regress SIMD: vectorizable types (INT / BIGINT / dict-encoded string) still emit fully-vectorized loops (vpcmpeqd / vpcmpeqq, 4× unrolled, 16 elements/iter). The compiler's loop versioning emits one runtime alias check at function entry (~5 cycles, <0.5% of a 1024-row batch's total cost); the main loop body is identical to the __restrict version. Non-vectorizable types (LARGEINT / DOUBLE / DECIMAL / ColumnString memcmp) were scalar regardless of __restrict. See the comment above _base_loop_vec for full rationale. Storage compiles clean in both Release and ASAN trees. Targeted UTs: 319 tests across 13 suites (Column* / Predicate* / Segment* / BloomFilterFunc / BitmapFilterPredicate / ColumnExecuteUtil): PASS, 0 FAIL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

csun5285 · 2026-06-06T03:37:42Z

run buildall

csun5285 · 2026-06-06T03:37:53Z

/review

github-actions

Review completed for the PredicateColumnType removal refactor. I did not find a blocking correctness issue in the changed paths.

Critical checkpoint conclusions:

Goal and tests: The PR replaces storage predicate columns with canonical ColumnVector/ColumnDecimal/ColumnString types and updates predicate/runtime-filter call sites accordingly. The changed BE unit tests cover representative predicate, bloom filter, bitmap filter, nullable, and selector-copy paths.
Scope: The code is focused on removing PredicateColumnType and updating direct dependents; no unrelated functional expansion is apparent in the GitHub PR diff.
Concurrency: No new shared mutable state, thread entry, or lock ordering change was introduced. Existing dictionary predicate caches remain on the same structures as before.
Lifecycle: No new static/global initialization or ownership cycle risk was found. New per-row view helpers are stack-local and used synchronously.
Configuration and compatibility: No config, wire protocol, storage format, or persisted metadata change is introduced.
Parallel paths: The storage comparison, in-list, like, bloom-filter, bitmap-filter, dictionary conversion, nullable, and selector copy paths were all updated consistently for the new column representations.
Conditional checks: New asserts/checks are consistent with the prior predicate-column invariants; I did not find a silent-continuation path added by this PR.
Test coverage: Unit tests were updated for the new APIs. I did not run the test suite in this review runner.
Observability: No new observability appears necessary for this internal representation refactor.
Transactions/data visibility: The PR does not change transaction, publish, rowset visibility, or delete-bitmap logic.
Performance: The replacement avoids the extra predicate-column representation. I checked the hot predicate loops for obvious redundant copies and did not find a blocking regression.

User focus: No additional user-provided review focus was present.

hello-stephen · 2026-06-06T06:13:35Z

TPC-H: Total hot run time: 29821 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 46012540c84b5a3ad3a4e2c7c65befebfe0d1481, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17743	4083	4031	4031
q2	2021	330	203	203
q3	10302	1499	873	873
q4	4685	477	347	347
q5	7513	876	586	586
q6	184	178	137	137
q7	775	876	634	634
q8	9376	1624	1643	1624
q9	5971	4587	4541	4541
q10	6759	1810	1531	1531
q11	437	276	243	243
q12	638	427	292	292
q13	18095	3449	2772	2772
q14	265	264	253	253
q15	q16	805	789	710	710
q17	981	983	1020	983
q18	7272	5800	5586	5586
q19	1362	1395	1109	1109
q20	513	396	273	273
q21	6515	2775	3000	2775
q22	470	381	318	318
Total cold run time: 102682 ms
Total hot run time: 29821 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	5042	4811	4792	4792
q2	349	375	232	232
q3	4945	5257	4795	4795
q4	2140	2177	1422	1422
q5	4829	4795	4713	4713
q6	235	181	128	128
q7	1867	1702	1589	1589
q8	2438	2114	2156	2114
q9	7954	7932	7467	7467
q10	4745	4707	4261	4261
q11	522	386	350	350
q12	733	750	526	526
q13	3020	3357	2768	2768
q14	283	276	254	254
q15	q16	686	697	627	627
q17	1293	1260	1257	1257
q18	7435	6932	6961	6932
q19	1188	1081	1086	1081
q20	2229	2229	1940	1940
q21	5296	4651	4452	4452
q22	512	454	402	402
Total cold run time: 57741 ms
Total hot run time: 52102 ms

hello-stephen · 2026-06-06T06:24:45Z

TPC-DS: Total hot run time: 176302 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 46012540c84b5a3ad3a4e2c7c65befebfe0d1481, data reload: false

query5	4331	634	484	484
query6	454	205	183	183
query7	4839	531	314	314
query8	370	223	204	204
query9	8791	4000	4054	4000
query10	446	322	259	259
query11	5920	2324	2132	2132
query12	158	104	103	103
query13	1295	601	414	414
query14	6377	5388	5097	5097
query14_1	4381	4352	4367	4352
query15	207	202	176	176
query16	984	467	407	407
query17	916	696	572	572
query18	2418	475	339	339
query19	201	185	147	147
query20	112	110	103	103
query21	218	138	125	125
query22	13665	13591	13418	13418
query23	17363	16575	16198	16198
query23_1	16314	16370	16271	16271
query24	7544	1794	1299	1299
query24_1	1359	1361	1336	1336
query25	584	493	419	419
query26	1288	330	182	182
query27	2671	570	350	350
query28	4572	2010	2012	2010
query29	1109	645	511	511
query30	315	243	205	205
query31	1112	1099	960	960
query32	116	68	63	63
query33	539	327	265	265
query34	1212	1131	625	625
query35	765	803	679	679
query36	1423	1417	1191	1191
query37	157	109	97	97
query38	3218	3159	3105	3105
query39	945	920	887	887
query39_1	875	892	873	873
query40	231	127	110	110
query41	73	68	71	68
query42	101	97	99	97
query43	329	328	287	287
query44	1473	776	794	776
query45	193	190	182	182
query46	1035	1222	754	754
query47	2368	2380	2239	2239
query48	415	438	303	303
query49	656	481	362	362
query50	1085	372	268	268
query51	4365	4340	4237	4237
query52	90	91	79	79
query53	261	278	194	194
query54	296	239	220	220
query55	92	89	71	71
query56	278	246	228	228
query57	1447	1423	1335	1335
query58	254	230	219	219
query59	1606	1728	1433	1433
query60	280	256	229	229
query61	162	156	154	154
query62	699	657	585	585
query63	237	189	197	189
query64	2579	790	631	631
query65	4843	4795	4786	4786
query66	1819	474	337	337
query67	29988	29701	29717	29701
query68	3497	1578	907	907
query69	424	305	258	258
query70	1102	980	959	959
query71	295	231	207	207
query72	2945	2726	2464	2464
query73	827	780	439	439
query74	5164	4975	4784	4784
query75	2659	2602	2263	2263
query76	2378	1183	797	797
query77	365	390	286	286
query78	12323	12455	11902	11902
query79	1431	1186	759	759
query80	1275	482	398	398
query81	522	279	246	246
query82	626	157	127	127
query83	355	275	261	261
query84	300	181	114	114
query85	929	555	447	447
query86	430	304	299	299
query87	3381	3382	3243	3243
query88	3707	2778	2795	2778
query89	425	394	336	336
query90	1888	189	184	184
query91	178	173	136	136
query92	65	62	60	60
query93	1597	1497	859	859
query94	721	365	316	316
query95	690	482	350	350
query96	1028	844	361	361
query97	2690	2676	2526	2526
query98	215	216	205	205
query99	1179	1168	1036	1036
Total cold run time: 262753 ms
Total hot run time: 176302 ms

hello-stephen · 2026-06-06T06:28:51Z

BE UT Coverage Report

Increment line coverage 42.86% (48/112) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.78% (21040/39123)
Line Coverage	37.46% (200038/534039)
Region Coverage	33.49% (156920/468584)
Branch Coverage	34.52% (68659/198874)

hello-stephen · 2026-06-06T06:29:45Z

ClickBench: Total hot run time: 25.24 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 46012540c84b5a3ad3a4e2c7c65befebfe0d1481, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.06	0.06
query3	0.26	0.14	0.14
query4	1.61	0.15	0.14
query5	0.25	0.22	0.22
query6	1.24	1.04	1.07
query7	0.04	0.01	0.00
query8	0.06	0.04	0.04
query9	0.38	0.30	0.31
query10	0.57	0.55	0.55
query11	0.20	0.15	0.15
query12	0.19	0.15	0.15
query13	0.47	0.49	0.48
query14	1.02	1.01	1.02
query15	0.62	0.59	0.60
query16	0.32	0.32	0.33
query17	1.06	1.13	1.09
query18	0.23	0.21	0.22
query19	2.10	1.96	1.96
query20	0.02	0.01	0.01
query21	15.43	0.25	0.13
query22	4.70	0.05	0.05
query23	16.16	0.30	0.13
query24	2.96	0.43	0.32
query25	0.12	0.06	0.03
query26	0.75	0.20	0.16
query27	0.04	0.04	0.03
query28	3.52	0.89	0.51
query29	12.50	4.30	3.43
query30	0.29	0.15	0.16
query31	2.77	0.61	0.32
query32	3.23	0.60	0.48
query33	3.28	3.22	3.18
query34	15.54	4.24	3.50
query35	3.56	3.55	3.54
query36	0.56	0.43	0.42
query37	0.09	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.04	0.03
query40	0.17	0.17	0.15
query41	0.10	0.04	0.04
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.7 s
Total hot run time: 25.24 s

hello-stephen · 2026-06-06T07:49:28Z

BE Regression && UT Coverage Report

Increment line coverage 87.50% (98/112) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.80% (28204/38219)
Line Coverage	57.83% (307053/530994)
Region Coverage	54.55% (256808/470789)
Branch Coverage	55.99% (111502/199157)

hello-stephen · 2026-06-06T08:01:57Z

BE Regression && UT Coverage Report

Increment line coverage 87.50% (98/112) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.80% (28204/38219)
Line Coverage	57.83% (307053/530994)
Region Coverage	54.55% (256808/470789)
Branch Coverage	55.99% (111502/199157)

csun5285 requested review from gavinchou, liaoxin01 and yiguolei as code owners June 5, 2026 01:32

csun5285 force-pushed the refactor/drop-predicate-column-type branch from 9f59008 to aa7c0ff Compare June 5, 2026 01:39

csun5285 force-pushed the refactor/drop-predicate-column-type branch from aa7c0ff to 671bc54 Compare June 5, 2026 06:29

csun5285 force-pushed the refactor/drop-predicate-column-type branch from 671bc54 to 4601254 Compare June 6, 2026 03:36

github-actions Bot reviewed Jun 6, 2026

View reviewed changes

Conversation

csun5285 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jun 5, 2026

Uh oh!

csun5285 commented Jun 5, 2026

Uh oh!

csun5285 commented Jun 5, 2026

Uh oh!

csun5285 commented Jun 5, 2026

Uh oh!

hello-stephen commented Jun 5, 2026

BE Regression && UT Coverage Report

Uh oh!

csun5285 commented Jun 6, 2026

Uh oh!

csun5285 commented Jun 6, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hello-stephen commented Jun 6, 2026

Uh oh!

hello-stephen commented Jun 6, 2026

Uh oh!

hello-stephen commented Jun 6, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 6, 2026

Uh oh!

hello-stephen commented Jun 6, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Jun 6, 2026

BE Regression && UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csun5285 commented Jun 5, 2026 •

edited

Loading