Skip to content

[feature](function) Support murmur_hash3_128 function#63196

Open
seawinde wants to merge 4 commits into
apache:masterfrom
seawinde:feature-murmur-hash3-128-upstream
Open

[feature](function) Support murmur_hash3_128 function#63196
seawinde wants to merge 4 commits into
apache:masterfrom
seawinde:feature-murmur-hash3-128-upstream

Conversation

@seawinde
Copy link
Copy Markdown
Member

@seawinde seawinde commented May 13, 2026

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:

This PR adds a builtin 128-bit MurmurHash3 scalar function, murmur_hash3_128, for callers that need a wider hash value than the existing 32-bit and 64-bit variants.

File Change Description
be/src/exprs/function/function_hash.cpp Adds BE implementation for murmur_hash3_128, returning LARGEINT.
be/test/exprs/function/function_hash_test.cpp Adds unit coverage for constant and multi-argument hash cases.
BuiltinScalarFunctions.java Registers the Nereids scalar function.
MurmurHash3128.java Adds FE scalar function metadata and signatures.
ScalarFunctionVisitor.java Adds visitor entry for the new scalar function.

Design rationale: the function reuses the existing MurmurHash3 x64 128-bit processing path in BE and exposes the packed 128-bit result as LARGEINT, matching Doris' existing signed 128-bit integer representation.

BE execution logic:

murmur_hash3_128 follows the same variadic hash execution model as the existing Doris hash functions. The BE function framework does not evaluate all arguments for one row in a single call. Instead, FunctionVariadicArgumentsBase invokes the implementation once per argument column:

first_apply(arg0, result_column)
combine_apply(arg1, result_column)
combine_apply(arg2, result_column)
...

For a query such as:

SELECT murmur_hash3_128(k1, 'world') FROM t;

the first round processes the whole k1 column with execute<true>(). It creates one LARGEINT result slot per row and initializes each row's MurmurHash3 128-bit state from seed 0. The second round processes the constant argument 'world' with execute<false>(). It reads each row's previous state from the result column, updates that state with the new argument bytes, and writes the packed state back.

For example, if k1 has two rows:

row0: "hello"
row1: "apache"

the state evolves as:

Initial result column:
row0: empty
row1: empty

Round 1: execute<true>() for k1
row0: init_hash("hello")  -> state_hello
row1: init_hash("apache") -> state_apache

Result column:
row0: pack(state_hello)
row1: pack(state_apache)

Round 2: execute<false>() for 'world'
row0: unpack(state_hello)  -> update_hash("world") -> state_hello_world
row1: unpack(state_apache) -> update_hash("world") -> state_apache_world

Final result column:
row0: pack(state_hello_world)
row1: pack(state_apache_world)

The underlying MurmurHash3 128-bit state is the pair (h1, h2). For a single argument, the implementation can call the existing murmur_hash3_x64_128(data, len, 0, out) directly. For multiple arguments, each later argument must continue from the previous (h1, h2) state; calling murmur_hash3_x64_128 independently for every argument would restart from seed 0 and lose the effect of earlier arguments.

Because the SQL return type is LARGEINT and the BE result column stores one __int128_t value per row, the implementation packs (h1, h2) into the result column between argument rounds and unpacks it before processing the next argument:

high 64 bits                 low 64 bits
+-------------------------+-------------------------+
|           h2            |           h1            |
+-------------------------+-------------------------+

This state storage is required by Doris' vectorized, per-argument-column execution model. It is not a requirement of the MurmurHash3 algorithm itself.

Release note

Support murmur_hash3_128 function.

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

Unit tests run locally:

  • ./run-fe-ut.sh --coverage --run org.apache.doris.nereids.rules.analysis.FunctionRegistryTest passed.

  • ./run-be-ut.sh --run '--filter=HashFunctionTest.*' was attempted, but the local worktree failed before compilation while downloading the apache-orc submodule from GitHub.

  • Behavior changed:

    • No.
    • Yes. Adds a new builtin scalar function.
  • Does this need documentation?

    • No.
    • Yes. Function documentation should be added if this PR is kept beyond CI validation.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: Add a 128-bit MurmurHash3 scalar function that returns LARGEINT. The function is implemented in BE, registered in FE Nereids scalar functions, and covered by BE hash function tests.

### Release note

Support murmur_hash3_128 function.

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --coverage --run org.apache.doris.nereids.rules.analysis.FunctionRegistryTest
    - ./run-be-ut.sh --run '--filter=HashFunctionTest.*' attempted, but local worktree failed before compilation while downloading apache-orc submodule from GitHub.
- Behavior changed: Yes. Adds a new builtin scalar hash function.
- Does this need documentation: Yes
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@seawinde
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29748 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e5a3a613d211ffe09d9dbcbdd50ecc8ef12a82a4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17612	3854	3838	3838
q2	q3	10705	893	609	609
q4	4664	471	358	358
q5	7578	1336	1167	1167
q6	267	174	144	144
q7	976	952	755	755
q8	10351	1415	1283	1283
q9	6480	5417	5375	5375
q10	6326	2083	1808	1808
q11	473	270	255	255
q12	691	413	299	299
q13	18175	3351	2779	2779
q14	296	284	264	264
q15	q16	905	870	792	792
q17	1025	952	847	847
q18	6355	5706	5621	5621
q19	1616	1238	1100	1100
q20	510	409	267	267
q21	4529	2276	1884	1884
q22	425	352	303	303
Total cold run time: 99959 ms
Total hot run time: 29748 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4218	4097	4074	4074
q2	q3	4635	4796	4153	4153
q4	2096	2160	1389	1389
q5	4955	4974	5236	4974
q6	192	167	133	133
q7	2023	2025	1738	1738
q8	3469	3186	3207	3186
q9	8552	8480	8405	8405
q10	4501	4510	4269	4269
q11	621	429	422	422
q12	705	772	508	508
q13	3195	3632	3039	3039
q14	316	311	265	265
q15	q16	772	798	719	719
q17	1352	1384	1320	1320
q18	8090	7136	7144	7136
q19	1182	1225	1143	1143
q20	2249	2234	1973	1973
q21	6193	5432	4917	4917
q22	571	530	411	411
Total cold run time: 59887 ms
Total hot run time: 54174 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170678 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e5a3a613d211ffe09d9dbcbdd50ecc8ef12a82a4, data reload: false

query5	4315	664	509	509
query6	318	218	200	200
query7	4224	564	320	320
query8	341	235	212	212
query9	8831	4040	4031	4031
query10	457	361	310	310
query11	5984	2378	2194	2194
query12	186	151	133	133
query13	1280	644	447	447
query14	6810	5427	5051	5051
query14_1	4357	4320	4350	4320
query15	210	199	180	180
query16	995	460	430	430
query17	1230	767	625	625
query18	2729	501	351	351
query19	241	201	169	169
query20	145	131	135	131
query21	231	140	115	115
query22	13684	13961	14494	13961
query23	17252	16516	16155	16155
query23_1	16216	16336	16247	16247
query24	7772	1989	1412	1412
query24_1	1381	1366	1355	1355
query25	583	528	452	452
query26	1291	316	177	177
query27	2672	606	337	337
query28	4403	1965	1966	1965
query29	1016	664	548	548
query30	307	228	199	199
query31	1128	1078	931	931
query32	83	74	75	74
query33	568	358	308	308
query34	1166	1109	643	643
query35	784	779	675	675
query36	1334	1325	1173	1173
query37	149	97	93	93
query38	3207	3121	3063	3063
query39	924	917	907	907
query39_1	871	878	870	870
query40	233	159	140	140
query41	68	67	63	63
query42	109	109	109	109
query43	321	333	284	284
query44	
query45	209	202	193	193
query46	1126	1187	751	751
query47	2273	2277	2173	2173
query48	406	406	294	294
query49	656	595	446	446
query50	703	274	223	223
query51	4370	4306	4153	4153
query52	106	111	98	98
query53	247	272	204	204
query54	306	278	245	245
query55	91	88	83	83
query56	302	314	293	293
query57	1408	1369	1301	1301
query58	295	271	266	266
query59	1548	1596	1369	1369
query60	343	337	319	319
query61	159	153	153	153
query62	670	609	555	555
query63	243	204	197	197
query64	2317	816	689	689
query65	
query66	1686	501	397	397
query67	29412	29931	29830	29830
query68	
query69	456	341	311	311
query70	1031	974	995	974
query71	347	284	265	265
query72	3061	2724	2235	2235
query73	857	718	405	405
query74	5117	4947	4790	4790
query75	2784	2665	2331	2331
query76	2301	1148	799	799
query77	416	424	343	343
query78	12971	13098	12405	12405
query79	1477	992	697	697
query80	1380	586	481	481
query81	521	278	237	237
query82	1008	157	124	124
query83	319	279	250	250
query84	263	138	118	118
query85	941	517	460	460
query86	434	317	324	317
query87	3418	3354	3202	3202
query88	3522	2670	2622	2622
query89	439	377	343	343
query90	1908	182	177	177
query91	178	165	138	138
query92	79	73	77	73
query93	1124	939	582	582
query94	718	339	298	298
query95	687	452	352	352
query96	1085	793	329	329
query97	2694	2667	2587	2587
query98	242	227	229	227
query99	1088	1120	964	964
Total cold run time: 255088 ms
Total hot run time: 170678 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.81% (47/62) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (27816/37727)
Line Coverage 57.58% (301057/522826)
Region Coverage 54.80% (251109/458234)
Branch Coverage 56.29% (108578/192887)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: N/A

Related PR: apache#63196

Problem Summary: Add regression coverage for the new MURMUR_HASH3_128 scalar function. The tests validate NULL propagation, empty string, single argument, multiple arguments, unicode input, and table column evaluation against deterministic 128-bit LARGEINT results.

### Release note

None

### Check List (For Author)

- Test: Regression test

    - Added coverage in test_hash_function.groovy.

    - Ran git diff --check.

    - Attempted ./run-regression-test.sh --run -d query_p0/sql_functions/hash_functions -s test_hash_function, but the local running cluster does not include MURMUR_HASH3_128 yet and failed with Can not found function 'MURMUR_HASH3_128'.

- Behavior changed: No

- Does this need documentation: No
@seawinde
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29625 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c79471f90a154f510ebb6a6e7b3f8b314da95e4d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17667	3920	3870	3870
q2	q3	10697	890	612	612
q4	4698	469	351	351
q5	7785	1354	1137	1137
q6	272	173	141	141
q7	926	953	762	762
q8	10140	1437	1317	1317
q9	6650	5487	5313	5313
q10	6316	2063	1814	1814
q11	469	265	256	256
q12	689	433	299	299
q13	18157	3311	2786	2786
q14	290	287	264	264
q15	q16	909	854	791	791
q17	988	1129	679	679
q18	6516	5719	5559	5559
q19	1321	1209	996	996
q20	520	412	401	401
q21	4690	2324	1911	1911
q22	472	394	366	366
Total cold run time: 100172 ms
Total hot run time: 29625 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4840	4590	4979	4590
q2	q3	4688	4821	4206	4206
q4	2182	2242	1388	1388
q5	4981	5004	5308	5004
q6	198	180	140	140
q7	2042	1834	1634	1634
q8	3390	3125	3189	3125
q9	8589	8488	8408	8408
q10	4479	4526	4260	4260
q11	612	434	406	406
q12	694	746	511	511
q13	3277	3639	2921	2921
q14	311	304	274	274
q15	q16	764	785	741	741
q17	1404	1391	1281	1281
q18	8011	7149	7303	7149
q19	1183	1196	1170	1170
q20	2284	2238	1945	1945
q21	6164	5397	4826	4826
q22	522	480	393	393
Total cold run time: 60615 ms
Total hot run time: 54372 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170846 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c79471f90a154f510ebb6a6e7b3f8b314da95e4d, data reload: false

query5	4335	678	510	510
query6	327	219	203	203
query7	4264	544	311	311
query8	315	228	220	220
query9	8844	4019	3935	3935
query10	460	347	283	283
query11	5779	2371	2202	2202
query12	182	130	123	123
query13	1302	619	439	439
query14	6618	5349	5003	5003
query14_1	4287	4311	4306	4306
query15	210	202	181	181
query16	1047	441	423	423
query17	1106	736	618	618
query18	2727	488	344	344
query19	221	198	157	157
query20	137	133	126	126
query21	222	135	113	113
query22	13624	13576	13403	13403
query23	17174	16340	16171	16171
query23_1	16105	16195	16127	16127
query24	7465	1761	1356	1356
query24_1	1361	1348	1348	1348
query25	573	531	460	460
query26	1319	318	177	177
query27	2666	584	330	330
query28	4440	1956	1959	1956
query29	1042	652	546	546
query30	325	246	209	209
query31	1158	1072	954	954
query32	91	76	77	76
query33	553	373	312	312
query34	1170	1162	639	639
query35	784	788	686	686
query36	1321	1338	1165	1165
query37	151	99	104	99
query38	3206	3147	3065	3065
query39	939	942	907	907
query39_1	898	903	871	871
query40	257	155	138	138
query41	72	68	68	68
query42	113	109	110	109
query43	318	330	285	285
query44	
query45	212	208	197	197
query46	1059	1204	728	728
query47	2276	2339	2237	2237
query48	404	447	295	295
query49	651	548	446	446
query50	701	299	227	227
query51	4311	4303	4194	4194
query52	107	108	98	98
query53	257	281	213	213
query54	325	285	274	274
query55	95	92	86	86
query56	332	317	324	317
query57	1437	1393	1290	1290
query58	322	279	281	279
query59	1539	1622	1451	1451
query60	354	345	339	339
query61	220	158	154	154
query62	673	620	550	550
query63	229	215	200	200
query64	2341	841	683	683
query65	
query66	1682	510	401	401
query67	30175	29883	29771	29771
query68	
query69	469	334	309	309
query70	1009	980	953	953
query71	312	262	265	262
query72	2949	2724	2455	2455
query73	813	768	428	428
query74	5063	4925	4752	4752
query75	2768	2664	2310	2310
query76	2265	1136	769	769
query77	408	444	342	342
query78	12893	12966	12324	12324
query79	1498	995	732	732
query80	869	586	463	463
query81	495	276	243	243
query82	1322	159	122	122
query83	355	281	248	248
query84	260	144	113	113
query85	912	533	445	445
query86	432	377	320	320
query87	3420	3372	3191	3191
query88	3559	2716	2664	2664
query89	456	387	331	331
query90	1778	181	179	179
query91	181	172	142	142
query92	78	77	76	76
query93	957	932	568	568
query94	626	337	319	319
query95	647	369	435	369
query96	1042	778	359	359
query97	2711	2685	2555	2555
query98	240	231	235	231
query99	1125	1103	984	984
Total cold run time: 254168 ms
Total hot run time: 170846 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 75.81% (47/62) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.58% (20635/38513)
Line Coverage 37.20% (195078/524350)
Region Coverage 33.61% (152683/454289)
Branch Coverage 34.60% (66543/192303)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.81% (47/62) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.80% (27830/37712)
Line Coverage 57.73% (301875/522934)
Region Coverage 55.04% (252399/458594)
Branch Coverage 56.53% (109105/193005)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 46.67% (14/30) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63196

Problem Summary: Align murmur_hash3_128 BE implementation with the existing 128-bit MurmurHash3 primitive for the first argument, while keeping Doris variadic hash semantics for multiple arguments by carrying the intermediate (h1, h2) state between arguments. Also make the LARGEINT packing layout explicit without std::memcpy.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check for the changed BE files.
    - Tried ./run-be-ut.sh --run '--filter=HashFunctionTest.*', but local CMake configuration stopped before building because contrib/openblas could not find OpenMP_C.
- Behavior changed: No
- Does this need documentation: No
@seawinde seawinde marked this pull request as ready for review May 15, 2026 10:19
@morrySnow morrySnow requested a review from zclllyybb May 15, 2026 10:24
@morrySnow
Copy link
Copy Markdown
Contributor

/review

### What problem does this PR solve?

Issue Number: None

Related PR: apache#63196

Problem Summary: Add focused coverage for murmur_hash3_128 argument combinations. BE unit tests now exercise const and non-const argument combinations through check_function_all_arg_comb, and regression tests cover table-column multi-argument calls to validate variadic state propagation with column and constant arguments.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh be/test/exprs/function/function_hash_test.cpp.
    - Ran git diff --check for the changed test files.
    - Tried ./run-be-ut.sh --run '--filter=HashFunctionTest.*', but local CMake configuration stopped before building because contrib/openblas could not find OpenMP_C.
    - Tried ./run-regression-test.sh --run -d query_p0/sql_functions/hash_functions -s test_hash_function, but the connected local cluster did not include MURMUR_HASH3_128 yet and failed with Can not found function 'MURMUR_HASH3_128'.
- Behavior changed: No
- Does this need documentation: No
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review completed. I did not find any blocking correctness issue in the PR.

Checkpoint conclusions:

  • Goal and proof: The PR adds murmur_hash3_128 returning LARGEINT; BE implementation, FE/Nereids registration, unit tests, and regression coverage are present. The submitted test commands were partly blocked in the author's environment; I verified the diff and code paths but did not run the full Doris test suite in this runner.
  • Scope and clarity: The change is focused on one new scalar function and follows the existing variadic hash function structure.
  • Concurrency: No new shared mutable state, locks, threads, or lifecycle-sensitive async behavior are introduced.
  • Lifecycle/static initialization: No new cross-TU static objects with initialization dependencies were found; the added constants/functions are local/simple.
  • Configuration: No config items are added.
  • Compatibility: No storage/editlog format change. FE/BE function metadata and BE registration are both added for the new symbol.
  • Parallel paths: Existing 32/64-bit hash paths are not modified beyond registration context; the new function follows the existing hash path pattern.
  • Conditional checks/error handling: Unsupported BE column types return Status::NotSupported, consistent with neighboring code. No ignored Status in the new BE execution path.
  • Tests: Coverage includes BE unit cases and regression cases for NULL, empty string, single/multiple args, unicode input, and table-column evaluation. Regression assertions are deterministic.
  • Test results: No generated .out result was added for the new assertions; values are compared explicitly in Groovy.
  • Observability: No additional logs/metrics appear necessary for this scalar function.
  • Transactions/persistence/data writes: Not applicable.
  • FE-BE variable passing: Not applicable beyond scalar function registration/signature exposure.
  • Performance/memory: No obvious avoidable allocation or untracked large memory ownership was introduced; execution remains per-column and in-place over the result vector.
  • User focus: No additional user-provided review focus was present.

Residual risk: Full BE/regression execution was not performed by me in this runner, so CI remains the source of truth for build/test validation.

@seawinde
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30812 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 76fb519e146749e3726ca0440c7c35d1094a27d8, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17731	3912	3870	3870
q2	q3	10764	1352	812	812
q4	4682	472	352	352
q5	7614	2256	2121	2121
q6	260	173	135	135
q7	903	776	635	635
q8	9377	1692	1556	1556
q9	6355	4878	4886	4878
q10	6449	2105	1759	1759
q11	434	268	245	245
q12	702	428	296	296
q13	18173	3368	2766	2766
q14	258	255	227	227
q15	q16	811	780	708	708
q17	996	1010	1013	1010
q18	6635	5763	5378	5378
q19	1254	1277	1078	1078
q20	506	386	263	263
q21	5626	2551	2427	2427
q22	424	351	296	296
Total cold run time: 99954 ms
Total hot run time: 30812 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4198	4112	4108	4108
q2	q3	4488	4835	4367	4367
q4	2134	2178	1375	1375
q5	4391	4250	4252	4250
q6	231	174	129	129
q7	1784	1913	1643	1643
q8	2489	2101	2063	2063
q9	7715	7783	7699	7699
q10	4537	4499	4067	4067
q11	583	417	390	390
q12	730	759	522	522
q13	3426	3673	2946	2946
q14	305	302	260	260
q15	q16	753	743	685	685
q17	1346	1296	1315	1296
q18	7811	7322	6925	6925
q19	1132	1079	1099	1079
q20	2182	2190	1916	1916
q21	5321	4584	4459	4459
q22	510	453	394	394
Total cold run time: 56066 ms
Total hot run time: 50573 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169194 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 76fb519e146749e3726ca0440c7c35d1094a27d8, data reload: false

query5	4350	647	530	530
query6	344	215	199	199
query7	4221	577	294	294
query8	317	223	237	223
query9	8860	4012	3985	3985
query10	446	332	306	306
query11	5816	2370	2195	2195
query12	183	130	126	126
query13	1290	637	415	415
query14	6011	5385	5039	5039
query14_1	4340	4383	4342	4342
query15	215	211	190	190
query16	1020	466	435	435
query17	1144	729	604	604
query18	2598	498	368	368
query19	217	210	178	178
query20	141	134	134	134
query21	217	141	126	126
query22	13560	13510	13365	13365
query23	17242	16333	16017	16017
query23_1	16220	16091	16061	16061
query24	7430	1737	1310	1310
query24_1	1320	1296	1315	1296
query25	561	504	451	451
query26	1328	331	181	181
query27	2668	562	347	347
query28	4444	1972	1937	1937
query29	1018	652	539	539
query30	313	246	203	203
query31	1123	1065	931	931
query32	93	78	74	74
query33	546	363	302	302
query34	1193	1124	651	651
query35	772	792	702	702
query36	1366	1361	1147	1147
query37	152	105	88	88
query38	3235	3107	3052	3052
query39	919	912	902	902
query39_1	894	881	892	881
query40	236	144	126	126
query41	69	66	63	63
query42	110	108	113	108
query43	320	328	281	281
query44	
query45	211	198	191	191
query46	1042	1178	729	729
query47	2325	2337	2189	2189
query48	397	431	294	294
query49	639	483	388	388
query50	1036	354	248	248
query51	4335	4333	4255	4255
query52	100	106	99	99
query53	268	303	212	212
query54	310	275	248	248
query55	98	96	90	90
query56	289	299	306	299
query57	1400	1378	1322	1322
query58	302	273	273	273
query59	1564	1590	1380	1380
query60	315	321	307	307
query61	151	157	156	156
query62	679	622	561	561
query63	239	194	205	194
query64	2391	790	658	658
query65	
query66	1740	473	346	346
query67	29249	29837	29818	29818
query68	
query69	476	335	325	325
query70	1053	995	980	980
query71	314	268	268	268
query72	3005	2717	2421	2421
query73	845	782	407	407
query74	5068	4863	4747	4747
query75	2666	2579	2289	2289
query76	2305	1135	734	734
query77	401	407	339	339
query78	12189	12185	11631	11631
query79	1484	1074	774	774
query80	1293	531	460	460
query81	524	279	241	241
query82	980	162	119	119
query83	321	275	250	250
query84	263	135	110	110
query85	929	546	450	450
query86	450	341	346	341
query87	3436	3400	3196	3196
query88	3574	2670	2654	2654
query89	446	383	339	339
query90	1888	183	179	179
query91	179	175	136	136
query92	75	83	73	73
query93	1611	1433	832	832
query94	700	349	272	272
query95	662	384	438	384
query96	1106	802	322	322
query97	2704	2693	2547	2547
query98	232	233	229	229
query99	1112	1114	1000	1000
Total cold run time: 253247 ms
Total hot run time: 169194 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 6.67% (1/15) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.00% (66/75) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.60% (27818/37795)
Line Coverage 57.53% (301340/523761)
Region Coverage 54.78% (251914/459861)
Branch Coverage 56.30% (108868/193357)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 93.33% (14/15) 🎉
Increment coverage report
Complete coverage report

@zclllyybb
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the new BE implementation. The function registration and FE/Nereids plumbing are focused and consistent with adjacent hash functions, and the added regression/BE tests cover ordinary nullable, constant, unicode, and multi-argument behavior. However, the new 128-bit hashing path introduces an unsafe length narrowing before the Murmur primitive, which can read out of bounds for large STRING values.

Critical checkpoint conclusions:

  • Goal/test proof: The PR adds murmur_hash3_128 and tests normal single/multi-argument behavior, but misses the large-input boundary affected by the implementation.
  • Scope/focus: The change is small and localized.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, config, storage format, or FE-BE protocol compatibility concerns found.
  • Parallel paths: FE and BE registration paths for the new function are present; the issue is specific to the new 128-bit implementation path.
  • Error handling/memory safety: Blocking issue found: size_t to int truncation can make murmur_hash3_x64_process operate on a negative length.
  • Test coverage: Good normal-path coverage, but missing the boundary/negative test for lengths beyond int range.
  • Observability/transactions/data writes: Not applicable.
  • Performance: No additional performance issue found.

User focus: No additional user-provided review focus was specified.

uint64_t hash[2] = {0, 0};
// The first SQL argument starts from seed 0, so it can use the existing 128-bit primitive
// directly. Later arguments must use update_hash() to continue from the saved (h1, h2).
murmur_hash3_x64_128(data, static_cast<int>(size), 0, hash);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This narrows a SQL string length from size_t to int before entering murmur_hash3_x64_128 / murmur_hash3_x64_process. Doris string offsets are not limited to INT_MAX (StringLengthType is uint32_t), so a large STRING value can wrap to a negative len; murmur_hash3_x64_process then computes a negative nblocks and derives tail = data + nblocks * 16, which can point before the buffer and read out of bounds. Please either make the 128-bit Murmur primitive accept a non-narrowing length type (as other hash wrappers do at their public boundary) or reject/guard values larger than INT_MAX before calling it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants