Skip to content

[fix](be) Validate TDigest sizes in quantile state#63662

Open
zclllyybb wants to merge 1 commit into
apache:masterfrom
zclllyybb:codex-fix-doris-26005
Open

[fix](be) Validate TDigest sizes in quantile state#63662
zclllyybb wants to merge 1 commit into
apache:masterfrom
zclllyybb:codex-fix-doris-26005

Conversation

@zclllyybb
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: quantile_state_from_base64() decodes user-provided base64 strings into QuantileState::deserialize(). QuantileState::is_valid() previously checked only the outer TDigest serialized length, while TDigest::unserialize() trusted nested vector count fields before resizing and copying from the input buffer. A malformed TDigest payload could therefore pass validation and drive out-of-bounds reads or oversized allocations during deserialization.

This PR validates the serialized TDigest layout against the available buffer before QuantileState accepts it, including the total length and all nested vector count fields. It also tightens QuantileState length checks for explicit and TDigest payloads, and adds a BE unit test that starts from a real TDigest-backed QuantileState, corrupts one nested count after a base64 roundtrip, and verifies validation and deserialization reject the payload.

Release note

Malformed TDigest quantile state payloads are now rejected during deserialization.

Check List (For Author)

  • Test:
    • Unit Test: ./run-be-ut.sh --run --filter=function_quantile_state_test.rejects_tdigest_base64_with_corrupted_inner_count -j 90
    • Unit Test: ./run-be-ut.sh --run --filter=function_quantile_state_test.* -j 90
    • Static Analysis: build-support/run-clang-tidy.sh
    • Code Style: python3 build-support/run_clang_format.py --clang-format-executable /mnt/disk6/common/ldb_toolchain_toucan/bin/clang-format --inplace false ...
  • Behavior changed: Yes. Malformed TDigest-backed quantile state payloads are rejected instead of being accepted for deserialization.
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: SQL users can pass base64 strings to quantile_state_from_base64(), which decodes the payload into QuantileState::deserialize(). QuantileState::is_valid() previously checked only the outer TDigest serialized length, while TDigest::unserialize() trusted nested vector count fields before resizing and copying from the input buffer. A malformed TDigest payload could therefore pass validation and drive out-of-bounds reads or oversized allocations during deserialization.

This change validates the serialized TDigest layout against the available buffer before QuantileState accepts it, including the total length and all nested vector count fields. It also tightens QuantileState length checks for explicit and TDigest payloads, and adds a BE unit test that starts from a real TDigest-backed QuantileState, corrupts one nested count after base64 roundtrip, and verifies validation and deserialization reject the payload.

### Release note

Malformed TDigest quantile state payloads are now rejected during deserialization.

### Check List (For Author)

- Test:
    - Unit Test: ./run-be-ut.sh --run --filter=function_quantile_state_test.rejects_tdigest_base64_with_corrupted_inner_count -j 90
    - Unit Test: ./run-be-ut.sh --run --filter=function_quantile_state_test.* -j 90
    - Static Analysis: build-support/run-clang-tidy.sh
    - Code Style: python3 build-support/run_clang_format.py --clang-format-executable /mnt/disk6/common/ldb_toolchain_toucan/bin/clang-format --inplace false ...
- Behavior changed: Yes. Malformed TDigest-backed quantile state payloads are rejected instead of being accepted for deserialization.
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zclllyybb
Copy link
Copy Markdown
Contributor Author

run buildall

@zclllyybb
Copy link
Copy Markdown
Contributor Author

/review

@zclllyybb zclllyybb marked this pull request as ready for review May 26, 2026 07:16
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one remaining malformed-input path in the validation being changed.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to reject malformed TDigest-backed quantile state payloads before deserialization and adds a focused BE unit test for a corrupted TDigest count. The TDigest count case is covered, but the same hardened QuantileState::is_valid() path still accepts an invalid EXPLICIT state with zero elements.
  • Scope/focus: The change is small and focused on serialization validation.
  • Concurrency/lifecycle: No new concurrency, locks, static initialization, or non-obvious lifecycle ownership changes were introduced.
  • Config/compatibility: No config item or serialized-format change was introduced; this only rejects malformed existing payloads.
  • Parallel paths: QuantileState::deserialize() now calls the new TDigest validator, but explicit-state validation remains incomplete in the same function.
  • Special checks/error handling: The added size checks improve bounds safety, but one validation condition is still missing.
  • Tests: Added TDigest malformed-count coverage. I could not run the focused test in this runner because BE UT configuration failed before build: thirdparty/installed/bin/protoc: No such file or directory.
  • Observability/transactions/data writes/protocol variables: Not applicable.
  • Performance: The structural validation is linear in the serialized vector metadata and does not add meaningful overhead beyond deserialization.

User focus: No additional user-provided review focus was supplied.

}
uint16_t num_explicits = decode_fixed16_le(ptr);
ptr += sizeof(uint16_t);
const size_t remaining_size = static_cast<size_t>(end - ptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation still accepts a malformed EXPLICIT state with num_explicits == 0. That payload is not produced by normal QuantileState construction, but quantile_state_from_base64() can supply it directly: is_valid() returns true when the remaining size is zero, then deserialize() does _explicit_data.resize(0) and evaluates &_explicit_data[0] in the subsequent memcpy, which is undefined for an empty vector even with a zero byte count. Since this PR is hardening untrusted serialized input in QuantileState::is_valid(), please reject zero explicit counts here (or the minimum valid explicit count, if that invariant should be >= 2) before accepting the payload.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31607 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ef0b673aa712110c292e007d7339b59d4e379995, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17656	4034	3998	3998
q2	q3	10766	1416	825	825
q4	4686	475	348	348
q5	7549	2294	2122	2122
q6	312	183	144	144
q7	924	779	655	655
q8	9347	1706	1515	1515
q9	6935	4979	4967	4967
q10	6454	2274	1881	1881
q11	428	266	240	240
q12	693	427	321	321
q13	18171	3432	2779	2779
q14	272	260	243	243
q15	q16	825	767	716	716
q17	986	970	927	927
q18	7179	5754	5473	5473
q19	1311	1403	1141	1141
q20	545	422	289	289
q21	5999	2771	2696	2696
q22	467	376	327	327
Total cold run time: 101505 ms
Total hot run time: 31607 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4861	4821	4983	4821
q2	q3	4859	5280	4694	4694
q4	2169	2251	1437	1437
q5	4941	4951	4676	4676
q6	243	175	136	136
q7	1887	1788	1563	1563
q8	2440	2054	1983	1983
q9	7462	7475	7478	7475
q10	4756	4732	4235	4235
q11	548	390	363	363
q12	742	753	557	557
q13	3151	3395	2835	2835
q14	289	292	252	252
q15	q16	683	702	626	626
q17	1317	1280	1281	1280
q18	7486	6985	6820	6820
q19	1100	1111	1159	1111
q20	2238	2232	1981	1981
q21	5364	4725	4615	4615
q22	524	477	417	417
Total cold run time: 57060 ms
Total hot run time: 51877 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172739 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ef0b673aa712110c292e007d7339b59d4e379995, data reload: false

query5	4332	676	539	539
query6	368	222	207	207
query7	4247	581	319	319
query8	341	239	224	224
query9	8821	4144	4139	4139
query10	453	349	290	290
query11	5755	2535	2251	2251
query12	179	130	126	126
query13	1301	603	419	419
query14	6242	5590	5311	5311
query14_1	4584	4548	4612	4548
query15	220	204	190	190
query16	1028	462	426	426
query17	1138	725	604	604
query18	2552	489	359	359
query19	216	206	160	160
query20	149	137	131	131
query21	221	150	127	127
query22	13730	13751	13488	13488
query23	17438	16525	16241	16241
query23_1	16362	16512	16424	16424
query24	7457	1813	1344	1344
query24_1	1340	1337	1331	1331
query25	573	500	459	459
query26	1314	323	175	175
query27	2701	572	363	363
query28	4452	2021	2022	2021
query29	989	640	528	528
query30	304	245	202	202
query31	1160	1100	969	969
query32	90	78	74	74
query33	545	374	304	304
query34	1192	1118	681	681
query35	800	801	708	708
query36	1416	1408	1244	1244
query37	157	104	97	97
query38	3267	3198	3095	3095
query39	937	941	920	920
query39_1	885	880	885	880
query40	237	153	135	135
query41	73	72	67	67
query42	113	114	112	112
query43	350	355	315	315
query44	
query45	224	216	202	202
query46	1120	1181	758	758
query47	2345	2386	2228	2228
query48	414	427	300	300
query49	646	529	406	406
query50	1032	344	258	258
query51	4266	4321	4276	4276
query52	116	109	98	98
query53	268	303	211	211
query54	329	309	266	266
query55	96	95	88	88
query56	339	321	327	321
query57	1439	1414	1343	1343
query58	311	288	282	282
query59	1666	1731	1519	1519
query60	370	312	307	307
query61	155	160	158	158
query62	696	657	590	590
query63	245	198	207	198
query64	2385	793	647	647
query65	
query66	1715	483	355	355
query67	29920	29808	29762	29762
query68	
query69	455	344	313	313
query70	1021	1060	1025	1025
query71	316	281	273	273
query72	3008	2929	2384	2384
query73	846	767	423	423
query74	5182	4950	4834	4834
query75	2717	2613	2276	2276
query76	2279	1160	792	792
query77	425	436	346	346
query78	12366	12489	11980	11980
query79	1474	1032	729	729
query80	953	532	462	462
query81	505	290	242	242
query82	1411	159	122	122
query83	367	285	251	251
query84	265	142	113	113
query85	916	527	443	443
query86	436	345	350	345
query87	3483	3434	3253	3253
query88	3653	2782	2719	2719
query89	452	396	345	345
query90	1768	189	186	186
query91	180	167	143	143
query92	84	80	77	77
query93	1490	1563	939	939
query94	616	358	313	313
query95	675	385	436	385
query96	1067	786	336	336
query97	2768	2732	2590	2590
query98	238	228	221	221
query99	1183	1142	1036	1036
Total cold run time: 256082 ms
Total hot run time: 172739 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.18% (43/55) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.09% (27413/38024)
Line Coverage 55.41% (292515/527869)
Region Coverage 52.49% (243902/464624)
Branch Coverage 53.73% (104968/195353)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.18% (43/55) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.10% (27416/38024)
Line Coverage 55.42% (292560/527869)
Region Coverage 52.51% (243965/464624)
Branch Coverage 53.74% (104990/195353)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants