Skip to content

[fix](parquet) Fix wrong condition#63509

Merged
Gabriel39 merged 5 commits into
apache:masterfrom
Gabriel39:fix_0522
May 25, 2026
Merged

[fix](parquet) Fix wrong condition#63509
Gabriel39 merged 5 commits into
apache:masterfrom
Gabriel39:fix_0522

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a blocking correctness issue in the Parquet bounds check.

Critical checkpoint conclusions:

  • Goal/test: the PR appears intended to make the byte-array bounds check overflow-safe, but the current expression still performs the potentially overflowing 32-bit addition before widening. I did not find a test that proves malformed large lengths are rejected.
  • Scope: the change is small and focused, but the implementation does not accomplish the stated safety property.
  • Concurrency/lifecycle: no concurrency or special lifecycle concern in this decoder path.
  • Config/compatibility/persistence: no new configs, storage format changes, or edit-log compatibility concerns found.
  • Parallel paths: the same overflow pattern is repeated in all three modified checks in this file and should be fixed consistently.
  • Tests: malformed byte-array length coverage appears missing for this overflow case.
  • Observability/performance: no additional observability or performance issue found beyond the correctness problem below.

User focus: no additional user-provided review focus was supplied.

decode_fixed32_le(reinterpret_cast<const uint8_t*>(_data->data) + _offset);
_offset += 4;
if (UNLIKELY(_offset + length) > _data->size) {
if (UNLIKELY(cast_set<uint64_t>(_offset + length) > _data->size)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still overflows before the cast: _offset and length are both uint32_t, so a malformed value such as _offset = 8 and length = UINT32_MAX wraps in _offset + length before cast_set<uint64_t> sees it. That can pass the bounds check and then skip/read outside the Slice. Cast one operand before adding, or compare with subtraction such as length > _data->size - _offset. The same pattern appears in the two decode paths below.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. I did not find additional blocking issues in the current PR head. The previously raised overflow concern in byte_array_plain_decoder.cpp is addressed by casting _offset before adding length in all three changed checks.

Critical checkpoint conclusions:

  • Goal/test: The goal is to prevent integer overflow in byte-array plain decoder bounds checks. The current code accomplishes this for the reviewed overflow path. No new test is included; risk is limited by the small localized change, but a malformed Parquet regression/unit test would improve coverage.
  • Scope: The modification is small, clear, and focused.
  • Concurrency/lifecycle/config/compatibility: Not applicable; no shared state, lifecycle, configuration, protocol, or storage-format change is introduced.
  • Parallel paths: All three equivalent BYTE_ARRAY plain decode/skip paths in this file were updated consistently.
  • Error handling/data correctness: Existing Status error propagation is preserved, and malformed input now fails before using an out-of-bounds slice.
  • Performance/observability: No meaningful performance or observability impact.
  • User focus: No additional user-provided review focus was specified.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31276 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1f7887396b50493038df3c3a294212e688bc3172, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17829	3872	3832	3832
q2	q3	10794	1365	801	801
q4	4688	476	348	348
q5	7595	2265	2046	2046
q6	231	175	138	138
q7	928	771	627	627
q8	9352	1822	1659	1659
q9	5056	4880	4851	4851
q10	6310	2058	1785	1785
q11	430	285	252	252
q12	629	417	296	296
q13	18099	3297	2805	2805
q14	263	252	232	232
q15	q16	816	770	703	703
q17	988	946	920	920
q18	7143	5942	5508	5508
q19	1341	1215	1095	1095
q20	520	397	338	338
q21	6179	2826	2711	2711
q22	454	369	329	329
Total cold run time: 99645 ms
Total hot run time: 31276 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4846	4606	4484	4484
q2	q3	4826	5213	4591	4591
q4	2113	2269	1392	1392
q5	4881	4699	4591	4591
q6	231	176	126	126
q7	1942	1706	1509	1509
q8	2339	2023	2042	2023
q9	7631	7347	7137	7137
q10	4473	4426	3996	3996
q11	514	375	343	343
q12	704	714	506	506
q13	3016	3301	2863	2863
q14	274	281	247	247
q15	q16	680	693	602	602
q17	1251	1226	1207	1207
q18	7150	6922	6803	6803
q19	1103	1082	1116	1082
q20	2220	2191	1940	1940
q21	5275	4613	4467	4467
q22	517	449	400	400
Total cold run time: 55986 ms
Total hot run time: 50309 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169463 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1f7887396b50493038df3c3a294212e688bc3172, data reload: false

query5	4332	645	513	513
query6	338	215	205	205
query7	4226	542	297	297
query8	326	234	214	214
query9	8834	3970	3939	3939
query10	449	334	305	305
query11	5805	2404	2185	2185
query12	185	127	122	122
query13	1299	550	437	437
query14	5983	5337	5048	5048
query14_1	4319	4313	4304	4304
query15	203	200	182	182
query16	1025	439	424	424
query17	1123	700	582	582
query18	2748	475	344	344
query19	215	195	155	155
query20	132	129	125	125
query21	228	141	124	124
query22	13617	13547	13362	13362
query23	17214	16348	15958	15958
query23_1	16039	16096	16133	16096
query24	7459	1770	1262	1262
query24_1	1318	1307	1296	1296
query25	524	465	437	437
query26	1285	331	171	171
query27	2713	532	338	338
query28	4390	1959	1937	1937
query29	1012	591	477	477
query30	300	251	197	197
query31	1098	1069	923	923
query32	87	74	70	70
query33	516	344	292	292
query34	1152	1146	654	654
query35	772	787	665	665
query36	1288	1319	1243	1243
query37	150	99	90	90
query38	3183	3118	3051	3051
query39	925	916	890	890
query39_1	889	885	865	865
query40	230	147	123	123
query41	64	62	64	62
query42	109	106	104	104
query43	322	325	290	290
query44	
query45	213	198	195	195
query46	1071	1194	740	740
query47	2369	2344	2198	2198
query48	398	423	310	310
query49	628	514	372	372
query50	954	355	245	245
query51	4367	4247	4244	4244
query52	101	104	92	92
query53	251	273	201	201
query54	329	261	260	260
query55	93	89	85	85
query56	307	298	310	298
query57	1426	1396	1322	1322
query58	317	290	278	278
query59	1553	1618	1393	1393
query60	328	339	325	325
query61	184	179	176	176
query62	670	632	559	559
query63	244	206	209	206
query64	2431	787	647	647
query65	
query66	1703	478	364	364
query67	29996	29949	29829	29829
query68	
query69	458	338	312	312
query70	978	957	990	957
query71	300	276	263	263
query72	3038	2697	2573	2573
query73	846	764	417	417
query74	5031	4884	4704	4704
query75	2649	2595	2248	2248
query76	2291	1145	741	741
query77	403	412	356	356
query78	12368	12069	11534	11534
query79	1239	1095	754	754
query80	601	559	480	480
query81	459	281	247	247
query82	237	162	125	125
query83	279	283	253	253
query84	263	142	113	113
query85	913	623	529	529
query86	365	364	316	316
query87	3366	3357	3240	3240
query88	3536	2727	2639	2639
query89	420	396	335	335
query90	2168	223	175	175
query91	175	171	136	136
query92	74	80	71	71
query93	1354	1544	863	863
query94	524	360	308	308
query95	676	392	337	337
query96	982	771	326	326
query97	2731	2677	2558	2558
query98	231	224	220	220
query99	1122	1118	1000	1000
Total cold run time: 251344 ms
Total hot run time: 169463 ms

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness/safety issue in the PR. The plain decoder overflow concern already raised in the existing review thread is known context and was not duplicated here.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to harden malformed Parquet byte-array length handling and updates an HDFS TVF regression expectation, but the dictionary decoder fix is still incomplete for truncated dictionary pages.
  • Scope: The change is small and focused, but the dictionary bounds condition does not cover the actual bytes consumed.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, config, wire/storage-format, or compatibility concerns found.
  • Parallel paths: Plain and dictionary paths are both relevant; the dictionary path has a distinct remaining OOB case.
  • Error handling/data correctness/memory safety: The dictionary decoder can read/copy outside the page before returning Status::Corruption, so malformed input can still violate memory safety.
  • Tests: The regression test exercises a plain-decoder error path, but I did not see coverage proving malformed dictionary byte-array pages are rejected before any out-of-bounds access.
  • Observability/performance: No additional observability or performance concerns found for this small change.
  • User focus: No additional user-provided review focus was specified.

return Status::Corruption("Wrong data length in dictionary");
}
uint32_t l = decode_fixed32_le(_dict.get() + offset_cursor);
if (UNLIKELY(offset_cursor + l > cast_set<uint32_t>(length) ||
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check still allows truncated dictionary pages to be read out of bounds. decode_fixed32_le(_dict.get() + offset_cursor) consumes 4 bytes, but the preceding check only rejects offset_cursor > length; for example length = 2 and num_values = 1 reaches the decode with only two bytes available. The payload check also compares offset_cursor + l before accounting for the 4-byte prefix, so length = 4, l = 1 passes here, then the second loop copies one byte from dict_item_address + 4 before reporting corruption. Please validate length - offset_cursor >= 4 before decoding the prefix and then validate l <= length - offset_cursor after advancing past the prefix, using subtraction or widened arithmetic to avoid overflow.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR still has blocking correctness/safety issues already captured in the existing inline review threads, so I am not duplicating those comments. In particular, the byte-array plain decoder still needs overflow-safe length checks in all decode/skip paths, and the byte-array dictionary decoder still needs to verify that at least 4 bytes remain before reading each length prefix and that the payload length fits after advancing past the prefix.

Critical checkpoint conclusions:

  • Goal/test: the goal appears to be hardening malformed Parquet byte-array handling and updating the regression expectation; the current code does not fully accomplish it because malformed pages can still drive out-of-bounds reads before returning an error.
  • Scope: the change is small and focused, but the bounds checks are incomplete.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, or compatibility concerns found in the changed code.
  • Parallel paths: the same plain-decoder length pattern appears in skip, content decode, and filtered-content decode; all need the same overflow-safe treatment.
  • Tests/results: the regression test exercises the ARROW-17100 failure path, but the code still lacks coverage for the truncated dictionary-prefix/payload cases and overflow cases described in the existing comments.
  • Observability: existing error propagation is sufficient once the bounds checks are correct.
  • Transaction/persistence/data writes/FE-BE variables: not applicable.
  • Performance: no additional performance issue found.

User focus: no additional user-provided review focus was specified.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Parquet byte array decoders still used addition-based bounds checks for malformed length-prefixed data. The plain decoder could overflow _offset + length before rejecting a value, and the dictionary decoder could decode or copy from truncated pages before validating that enough bytes remained for the length prefix and payload. This change validates remaining bytes with subtraction before decoding or advancing offsets, rejects negative dictionary page lengths, and adds unit coverage for truncated prefixes, truncated payloads, and overflow-sized values.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
    - Could not run build-support/clang-format.sh because llvm@16/clang-format is not installed in this environment
    - Could not run BE UT because run-be-ut.sh requires JDK-17, but this environment only has JDK-11
- Behavior changed: No
- Does this need documentation: No
@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found in the current GitHub PR diff. The final head revision fixes the previously raised byte-array plain overflow and dictionary truncated-page concerns by checking remaining bytes before decoding/advancing, and adds focused BE unit coverage for malformed prefixes/payloads and overflow-sized plain values.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets malformed Parquet byte-array bounds checks; the final code accomplishes this for the reviewed plain and dictionary paths, with added unit tests.
  • Scope: The effective GitHub diff is small and focused on Parquet decoder validation plus the related regression expectation update.
  • Concurrency/lifecycle: No new shared state, locking, threads, or non-obvious lifecycle/static initialization concerns were introduced.
  • Configuration/compatibility: No new config, protocol, storage format, or rolling-upgrade compatibility concern.
  • Parallel paths: The plain decoder skip/content/filtered-content paths were updated consistently; dictionary set_dict validates both sizing and copy passes.
  • Error handling/data correctness: New checks return non-OK Status before out-of-bounds decode/copy; Status propagation is preserved.
  • Tests: Added BE unit tests cover the core malformed cases. I did not run BE UTs in this review; I ran git diff --check for the final head commit.
  • Observability/performance: No new observability need; added checks are constant-time and appropriate for corrupted input handling.

User focus: No additional user-provided review focus was specified.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31534 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2785888f4ff92783aa1c8dd4a4d9d6f4625d334b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17693	4088	4138	4088
q2	q3	10785	1389	813	813
q4	4683	466	357	357
q5	7566	2271	2136	2136
q6	244	179	138	138
q7	968	750	622	622
q8	9447	1895	1546	1546
q9	5127	4894	4903	4894
q10	6434	2065	1844	1844
q11	436	284	248	248
q12	624	424	292	292
q13	18118	3391	2770	2770
q14	271	256	237	237
q15	q16	816	757	705	705
q17	998	951	1033	951
q18	7148	5717	5515	5515
q19	1330	1304	1057	1057
q20	671	457	301	301
q21	6374	2845	2648	2648
q22	459	372	434	372
Total cold run time: 100192 ms
Total hot run time: 31534 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4775	4687	4661	4661
q2	q3	4797	5349	4616	4616
q4	2126	2220	1422	1422
q5	4979	4633	4632	4632
q6	239	196	140	140
q7	1906	1758	1519	1519
q8	2368	2131	2161	2131
q9	7730	7313	7232	7232
q10	4460	4422	3999	3999
q11	523	376	381	376
q12	708	724	510	510
q13	3086	3402	2869	2869
q14	282	273	244	244
q15	q16	687	700	609	609
q17	1271	1243	1244	1243
q18	7282	6766	6649	6649
q19	1112	1119	1075	1075
q20	2199	2205	1930	1930
q21	5376	4668	4608	4608
q22	528	455	413	413
Total cold run time: 56434 ms
Total hot run time: 50878 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170029 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2785888f4ff92783aa1c8dd4a4d9d6f4625d334b, data reload: false

query5	4299	654	518	518
query6	337	217	200	200
query7	4229	587	291	291
query8	334	228	209	209
query9	8817	4009	4009	4009
query10	468	334	311	311
query11	5755	2424	2225	2225
query12	189	128	121	121
query13	1283	611	452	452
query14	6026	5360	5077	5077
query14_1	4371	4364	4335	4335
query15	211	207	187	187
query16	990	453	441	441
query17	1168	734	622	622
query18	2500	496	364	364
query19	216	205	169	169
query20	134	134	132	132
query21	217	146	120	120
query22	13609	13529	13332	13332
query23	17210	16253	16085	16085
query23_1	16180	16173	16106	16106
query24	7406	1754	1275	1275
query24_1	1293	1279	1346	1279
query25	540	465	399	399
query26	1328	333	173	173
query27	2658	549	334	334
query28	4440	1969	1961	1961
query29	1035	626	526	526
query30	310	241	201	201
query31	1137	1060	927	927
query32	90	73	75	73
query33	529	361	299	299
query34	1170	1138	634	634
query35	758	781	663	663
query36	1302	1313	1205	1205
query37	154	97	86	86
query38	3209	3188	3071	3071
query39	935	909	923	909
query39_1	876	881	876	876
query40	223	142	125	125
query41	73	63	63	63
query42	108	111	105	105
query43	319	325	275	275
query44	
query45	208	194	188	188
query46	1052	1237	702	702
query47	2268	2308	2160	2160
query48	357	442	291	291
query49	641	496	386	386
query50	1015	369	259	259
query51	4456	4280	4237	4237
query52	104	104	94	94
query53	257	273	198	198
query54	326	268	251	251
query55	96	89	90	89
query56	300	317	293	293
query57	1433	1410	1264	1264
query58	293	278	268	268
query59	1551	1629	1442	1442
query60	317	314	301	301
query61	162	158	163	158
query62	675	625	559	559
query63	238	195	209	195
query64	2439	796	640	640
query65	
query66	1722	464	351	351
query67	30082	30057	29982	29982
query68	
query69	455	341	306	306
query70	998	964	965	964
query71	302	274	265	265
query72	3245	2885	2583	2583
query73	826	782	426	426
query74	5068	4897	4730	4730
query75	2675	2595	2262	2262
query76	2282	1161	769	769
query77	390	404	340	340
query78	12284	12338	11662	11662
query79	1504	1028	725	725
query80	642	553	459	459
query81	452	287	241	241
query82	1389	161	124	124
query83	362	282	257	257
query84	267	144	112	112
query85	895	557	456	456
query86	395	311	316	311
query87	3433	3340	3233	3233
query88	3538	2670	2645	2645
query89	436	384	328	328
query90	2022	181	180	180
query91	182	174	140	140
query92	78	75	70	70
query93	1470	1482	886	886
query94	551	348	314	314
query95	672	465	351	351
query96	1017	785	329	329
query97	2683	2698	2543	2543
query98	247	226	223	223
query99	1136	1109	1012	1012
Total cold run time: 253335 ms
Total hot run time: 170029 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 84.62% (33/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.65% (20760/38697)
Line Coverage 37.26% (196664/527875)
Region Coverage 33.61% (154289/458991)
Branch Coverage 34.59% (67168/194161)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 84.62% (33/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.78% (27964/37903)
Line Coverage 57.67% (303646/526529)
Region Coverage 54.78% (253879/463420)
Branch Coverage 56.39% (109892/194888)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@Gabriel39 Gabriel39 merged commit 3bdcb38 into apache:master May 25, 2026
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants