Skip to content

[fix](be) Compare JSON numeric values by value#63396

Open
mrhhsg wants to merge 3 commits into
apache:masterfrom
mrhhsg:fix/json-contains-numeric-equality
Open

[fix](be) Compare JSON numeric values by value#63396
mrhhsg wants to merge 3 commits into
apache:masterfrom
mrhhsg:fix/json-contains-numeric-equality

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented May 19, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: json_contains compared JSONB numeric values by their stored binary categories. As a result, semantically equal numbers such as integer 1 and floating-point 1.0, or decimals with different scales, were not treated as contained. This changes JsonbValue::contains to compare integer, floating-point, and decimal JSONB values by numeric value, while keeping exact integer and decimal comparisons to avoid precision-related false matches.

Release note

Fix json_contains to match semantically equal JSON numeric values across numeric storage categories.

Check List (For Author)

  • Test:
    • Build: ./build.sh --be --fe
    • Unit Test: ./run-be-ut.sh --run --filter=JsonbDocumentTest.contains_numeric_equality
    • Regression test: ./run-regression-test.sh --conf output/local-regression/regression-conf-29000.groovy --run -d query_p0/sql_functions/json_functions -s test_json_contains_numeric_equality -forceGenOut
    • Regression test: ./run-regression-test.sh --conf output/local-regression/regression-conf-29000.groovy --run -d query_p0/sql_functions/json_functions -s test_json_contains_numeric_equality
    • Style: build-support/check-format.sh
    • Style: git diff --check
    • Static analysis: build-support/run-clang-tidy.sh --build-dir be/build_Release (attempted; failed on pre-existing header/system include diagnostics unrelated to the changed lines)
  • Behavior changed: Yes (json_contains now treats semantically equal JSON numeric values across integer, floating-point, and decimal JSONB representations as contained)
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 19, 2026

/review

1 similar comment
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 19, 2026

/review

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 20, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30868 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d8e3ea2009bc845705e638373a15960de0f5792a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17723	3847	3855	3847
q2	q3	10814	1406	833	833
q4	4706	474	350	350
q5	7791	2252	2082	2082
q6	380	178	138	138
q7	994	757	658	658
q8	9476	1645	1634	1634
q9	6495	4860	4888	4860
q10	6395	2179	1768	1768
q11	437	287	238	238
q12	643	427	296	296
q13	18168	3463	2787	2787
q14	260	256	231	231
q15	q16	814	774	706	706
q17	988	969	899	899
q18	6902	5682	5525	5525
q19	1181	1285	1101	1101
q20	536	415	261	261
q21	5530	2560	2352	2352
q22	420	359	302	302
Total cold run time: 100653 ms
Total hot run time: 30868 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4195	4361	4102	4102
q2	q3	4528	4939	4367	4367
q4	2101	2206	1378	1378
q5	4417	4274	4294	4274
q6	230	175	129	129
q7	2262	1898	1642	1642
q8	2493	2150	2077	2077
q9	7859	7822	7747	7747
q10	4515	4502	4033	4033
q11	586	408	388	388
q12	725	890	541	541
q13	3419	3712	2982	2982
q14	313	321	290	290
q15	q16	754	737	690	690
q17	1342	1324	1307	1307
q18	8022	7306	7222	7222
q19	1117	1068	1123	1068
q20	2232	2220	1919	1919
q21	5296	4572	4473	4473
q22	531	470	413	413
Total cold run time: 56937 ms
Total hot run time: 51042 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169460 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d8e3ea2009bc845705e638373a15960de0f5792a, data reload: false

query5	4314	653	525	525
query6	340	223	199	199
query7	4225	594	308	308
query8	343	226	213	213
query9	8841	4032	4010	4010
query10	439	342	302	302
query11	5771	2395	2180	2180
query12	177	135	128	128
query13	1260	569	443	443
query14	5956	5379	5059	5059
query14_1	4361	4346	4353	4346
query15	216	210	183	183
query16	1028	462	467	462
query17	1069	758	617	617
query18	2461	489	369	369
query19	226	210	167	167
query20	150	134	134	134
query21	220	148	118	118
query22	13620	13488	13348	13348
query23	17145	16301	16045	16045
query23_1	16116	16147	16215	16147
query24	7576	1740	1330	1330
query24_1	1301	1308	1315	1308
query25	558	464	418	418
query26	1301	318	170	170
query27	2706	542	349	349
query28	4427	1972	1959	1959
query29	985	627	536	536
query30	305	244	200	200
query31	1121	1062	926	926
query32	87	71	76	71
query33	540	357	286	286
query34	1148	1135	639	639
query35	764	778	672	672
query36	1354	1385	1157	1157
query37	154	101	89	89
query38	3211	3099	3021	3021
query39	931	923	901	901
query39_1	866	873	865	865
query40	231	152	123	123
query41	65	63	63	63
query42	109	110	109	109
query43	321	335	298	298
query44	
query45	211	199	196	196
query46	1068	1221	733	733
query47	2371	2335	2219	2219
query48	353	396	286	286
query49	637	500	385	385
query50	984	350	248	248
query51	4322	4412	4243	4243
query52	104	106	98	98
query53	253	284	204	204
query54	320	305	263	263
query55	95	91	86	86
query56	331	297	311	297
query57	1422	1404	1337	1337
query58	304	289	277	277
query59	1538	1649	1429	1429
query60	332	321	303	303
query61	157	146	155	146
query62	676	622	562	562
query63	255	207	214	207
query64	2420	792	615	615
query65	
query66	1718	466	359	359
query67	29946	29896	29849	29849
query68	
query69	463	350	306	306
query70	971	956	988	956
query71	314	285	277	277
query72	2974	2676	2440	2440
query73	870	731	443	443
query74	5095	4908	4772	4772
query75	2807	2570	2263	2263
query76	2323	1138	754	754
query77	399	409	327	327
query78	12119	12032	11677	11677
query79	1450	1007	733	733
query80	660	533	455	455
query81	448	286	238	238
query82	1419	153	117	117
query83	350	281	263	263
query84	314	141	107	107
query85	902	532	449	449
query86	392	318	309	309
query87	3385	3398	3221	3221
query88	3606	2666	2655	2655
query89	440	383	335	335
query90	1936	179	180	179
query91	176	169	137	137
query92	79	77	75	75
query93	1539	1543	899	899
query94	566	349	299	299
query95	664	388	428	388
query96	1021	812	332	332
query97	2713	2715	2562	2562
query98	237	236	232	232
query99	1125	1103	926	926
Total cold run time: 252693 ms
Total hot run time: 169460 ms

Comment thread be/src/util/jsonb_document.h Outdated
decimal.value / scale_multiplier == integer_value;
}

inline wide::Int256 power_of_five(uint32_t exponent) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里能不能弄成查表的?现在这个性能很差

Comment thread be/src/util/jsonb_document.h Outdated
const auto divisor = wide::Int256(1) << divisor_exponent;
return significand % divisor == 0 && value == significand / divisor;
}
if (exponent >= 255) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么是255,需要一些注释

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 79.05% (117/148) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.57% (20743/38722)
Line Coverage 37.22% (196225/527200)
Region Coverage 33.52% (153655/458357)
Branch Coverage 34.61% (67053/193749)

mrhhsg added a commit to mrhhsg/doris that referenced this pull request May 20, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63396

Problem Summary: Optimize the JSON numeric comparison helper by replacing per-call power-of-five multiplication with a precomputed table, document the Int256 shift boundary, and refresh the Presto JSON function expected results affected by numeric equality semantics.

### Release note

None

### Check List (For Author)

- Test: Compile check / Format check

    - build-support/clang-format.sh

    - git diff --check

    - ninja -C be/ut_build_ASAN test/CMakeFiles/doris_be_test.dir/core/jsonb/jsonb_document_test.cpp.o src/util/CMakeFiles/Util.dir/jsonb_document.cpp.o

- Behavior changed: No

- Does this need documentation: No
mrhhsg added 2 commits May 20, 2026 18:08
Issue Number: None

Related PR: None

Problem Summary: json_contains compared JSONB numeric values by their stored binary categories. As a result, semantically equal numbers such as integer 1 and floating-point 1.0, or decimals with different scales, were not treated as contained. This changes JsonbValue::contains to compare integer, floating-point, and decimal JSONB values by numeric value, while keeping exact integer and decimal comparisons to avoid precision-related false matches.

Fix json_contains to match semantically equal JSON numeric values across numeric storage categories.

- Test:
    - Build: ./build.sh --be --fe
    - Unit Test: ./run-be-ut.sh --run --filter=JsonbDocumentTest.contains_numeric_equality
    - Regression test: ./run-regression-test.sh --conf output/local-regression/regression-conf-29000.groovy --run -d query_p0/sql_functions/json_functions -s test_json_contains_numeric_equality -forceGenOut
    - Regression test: ./run-regression-test.sh --conf output/local-regression/regression-conf-29000.groovy --run -d query_p0/sql_functions/json_functions -s test_json_contains_numeric_equality
- Behavior changed: Yes (json_contains now treats semantically equal JSON numeric values across integer, floating-point, and decimal JSONB representations as contained)
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63396

Problem Summary: Optimize the JSON numeric comparison helper by replacing per-call power-of-five multiplication with a precomputed table, document the Int256 shift boundary, and refresh the Presto JSON function expected results affected by numeric equality semantics.

### Release note

None

### Check List (For Author)

- Test: Compile check / Format check

    - build-support/clang-format.sh

    - git diff --check

    - ninja -C be/ut_build_ASAN test/CMakeFiles/doris_be_test.dir/core/jsonb/jsonb_document_test.cpp.o src/util/CMakeFiles/Util.dir/jsonb_document.cpp.o

- Behavior changed: No

- Does this need documentation: No
@mrhhsg mrhhsg force-pushed the fix/json-contains-numeric-equality branch from 378b5b6 to c9db681 Compare May 20, 2026 10:09
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 20, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test: The PR implements value-based JSON numeric equality for JSONB contains and adds BE/regression coverage for common int/double/decimal cases. The main goal is covered for normal valid inputs.
  • Scope: The change is focused and localized to JSONB numeric comparison plus tests.
  • Concurrency/lifecycle: No new concurrency, locking, background thread, or non-trivial lifecycle management is introduced.
  • Configuration/compatibility: No new config or storage-format change. Existing serialized JSONB decimals are reinterpreted with new comparison semantics; invalid scale metadata now matters in a new path.
  • Parallel paths: Array/object contains delegate to the same JsonbValue::contains path, so the behavior is applied consistently.
  • Error handling/memory safety: One issue found: release builds can read past the powers-of-five lookup table for a decimal scale outside MAX_DECIMALV3_SCALE.
  • Tests: Tests cover ordinary valid numeric equality but do not cover boundary/malformed decimal scale handling.
  • Observability/performance: No new observability requirement. Existing review threads already cover the decimal multiplier performance and shift-limit comment concerns, so I did not repeat them.

User focus: No additional user-provided review focus was present.

Comment thread be/src/util/jsonb_document.h
### What problem does this PR solve?

Issue Number: None

Related PR: apache#63396

Problem Summary: Validate JSONB decimal scale metadata before using it in numeric contains comparison so malformed or incompatible serialized decimals cannot index past the powers-of-five lookup table in release builds. Add BE unit coverage for invalid decimal scale metadata.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Format check
    - ninja -C be/ut_build_ASAN test/CMakeFiles/doris_be_test.dir/core/jsonb/jsonb_document_test.cpp.o
    - build-support/clang-format.sh
    - build-support/check-format.sh
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 21, 2026

/review

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented May 21, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary: I found no additional blocking issues beyond the existing inline review context. The prior concerns about power-of-five lookup performance, the Int256 shift-boundary explanation, and release-path decimal scale validation appear addressed in the current head.

Critical checkpoint conclusions:

  • Goal and tests: The change implements value-based numeric comparison for json_contains/JsonbValue::contains across integer, floating-point, and decimal JSONB storage categories. BE unit coverage and a regression case were added, and affected Presto expected outputs were refreshed.
  • Scope: The implementation is focused on JSONB numeric comparison helpers plus related tests/results.
  • Concurrency: No shared mutable state or runtime concurrency path was introduced; the inline lookup table is immutable.
  • Lifecycle/static initialization: No non-trivial cross-TU lifecycle dependency was found; the added table is an inline constexpr value in the header.
  • Configuration/compatibility/persistence/data writes: No config, protocol, storage-format, EditLog, transaction, or data-write path changes were introduced.
  • Parallel paths: The reviewed json_contains path reaches JsonbValue::contains; recursive array/object containment continues to reuse that path.
  • Error handling and invariants: Decimal scale is now validated on the release path before table indexing. Numeric boundary checks avoid the reviewed out-of-range casts/shifts.
  • Test coverage/results: Tests cover core numeric category equality and invalid decimal scale metadata. I did not rerun the build/tests in this review; residual risk is mainly additional extreme numeric boundary combinations not covered by the added cases.
  • Observability/performance: No new observability appears necessary for this local comparison helper. The precomputed table addresses the previously raised obvious performance concern.
  • User focus: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31325 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b32ca54d6ad0c429fe40db6d3951d8ab736e7dae, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17702	3997	3958	3958
q2	q3	10795	1446	831	831
q4	4716	472	350	350
q5	8476	2309	2160	2160
q6	393	181	136	136
q7	935	770	654	654
q8	9657	1969	1693	1693
q9	7009	4986	4933	4933
q10	6464	2108	1789	1789
q11	447	272	245	245
q12	690	425	285	285
q13	18197	3524	2749	2749
q14	263	255	235	235
q15	q16	815	771	708	708
q17	977	970	908	908
q18	6900	5767	5607	5607
q19	1202	1271	1083	1083
q20	518	393	265	265
q21	5732	2681	2434	2434
q22	444	361	302	302
Total cold run time: 102332 ms
Total hot run time: 31325 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4267	4127	4218	4127
q2	q3	4518	4916	4329	4329
q4	2120	2202	1385	1385
q5	4398	4300	4905	4300
q6	261	197	149	149
q7	1933	1903	1577	1577
q8	2460	2133	2109	2109
q9	7805	7814	7754	7754
q10	4596	4525	4344	4344
q11	600	424	377	377
q12	719	735	517	517
q13	3329	3725	2940	2940
q14	297	303	269	269
q15	q16	698	724	653	653
q17	1352	1362	1321	1321
q18	8041	7613	7053	7053
q19	1167	1085	1138	1085
q20	2215	2207	1927	1927
q21	5386	4628	4442	4442
q22	529	498	425	425
Total cold run time: 56691 ms
Total hot run time: 51083 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169323 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b32ca54d6ad0c429fe40db6d3951d8ab736e7dae, data reload: false

query5	4343	644	512	512
query6	330	227	203	203
query7	4257	566	321	321
query8	328	237	219	219
query9	8844	3983	3975	3975
query10	451	355	312	312
query11	5848	2391	2172	2172
query12	183	137	128	128
query13	1308	639	432	432
query14	5986	5330	5060	5060
query14_1	4351	4362	4349	4349
query15	212	207	191	191
query16	1050	452	473	452
query17	1178	784	604	604
query18	2743	528	371	371
query19	221	216	171	171
query20	140	136	131	131
query21	218	143	121	121
query22	13519	13713	13367	13367
query23	17234	16414	15965	15965
query23_1	16154	16209	16111	16111
query24	7454	1782	1322	1322
query24_1	1312	1317	1290	1290
query25	600	505	440	440
query26	1317	350	179	179
query27	2687	581	347	347
query28	4447	1984	1929	1929
query29	1033	642	521	521
query30	314	244	202	202
query31	1136	1075	945	945
query32	101	78	76	76
query33	559	377	306	306
query34	1192	1116	659	659
query35	772	809	674	674
query36	1324	1349	1242	1242
query37	152	103	91	91
query38	3220	3148	3056	3056
query39	926	942	884	884
query39_1	889	876	874	874
query40	227	143	125	125
query41	65	66	63	63
query42	108	109	111	109
query43	326	329	298	298
query44	
query45	218	213	197	197
query46	1091	1215	736	736
query47	2336	2315	2251	2251
query48	405	411	291	291
query49	637	484	382	382
query50	984	359	262	262
query51	4271	4237	4212	4212
query52	105	105	93	93
query53	265	274	207	207
query54	317	269	269	269
query55	94	90	85	85
query56	303	324	307	307
query57	1451	1428	1322	1322
query58	297	266	267	266
query59	1578	1619	1418	1418
query60	320	332	309	309
query61	162	159	157	157
query62	658	620	549	549
query63	234	195	205	195
query64	2399	791	633	633
query65	
query66	1697	491	349	349
query67	30035	30155	29274	29274
query68	
query69	463	336	312	312
query70	1015	1006	1014	1006
query71	304	269	264	264
query72	3019	2725	2386	2386
query73	876	773	447	447
query74	5059	4893	4772	4772
query75	2679	2593	2271	2271
query76	2266	1148	779	779
query77	398	403	333	333
query78	12124	12151	11532	11532
query79	1477	1058	761	761
query80	989	541	467	467
query81	510	289	270	270
query82	1351	156	119	119
query83	360	281	259	259
query84	264	139	116	116
query85	937	541	459	459
query86	443	325	323	323
query87	3436	3365	3268	3268
query88	3577	2673	2652	2652
query89	445	401	338	338
query90	1794	187	182	182
query91	184	168	141	141
query92	81	82	73	73
query93	1555	1475	904	904
query94	640	351	278	278
query95	682	375	355	355
query96	1092	786	369	369
query97	2761	2686	2579	2579
query98	232	226	232	226
query99	1128	1109	984	984
Total cold run time: 253671 ms
Total hot run time: 169323 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.89% (127/157) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.73% (27913/37859)
Line Coverage 57.65% (303274/526067)
Region Coverage 54.91% (254306/463101)
Branch Coverage 56.35% (109724/194724)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants