Skip to content

Test 0603#64054

Draft
Gabriel39 wants to merge 14 commits into
apache:masterfrom
Gabriel39:test_0603
Draft

Test 0603#64054
Gabriel39 wants to merge 14 commits into
apache:masterfrom
Gabriel39:test_0603

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39 Gabriel39 marked this pull request as draft June 3, 2026 04:29
@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

Gabriel39 added 2 commits June 3, 2026 13:13
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Squash the refactored reader branch into one commit on top of master. The change adds the refactored TableReader/FileReader stack, the new parquet reader path, table-format readers, nested projection/filter support, aggregate pushdown support, FileScannerV2, and related BE tests and design docs.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check before committing.
- Behavior changed: Yes
- Does this need documentation: No
@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.34% (1906/2433)
Line Coverage 64.85% (34028/52468)
Region Coverage 65.39% (17546/26833)
Branch Coverage 54.06% (9314/17230)

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.34% (1906/2433)
Line Coverage 64.79% (33993/52468)
Region Coverage 65.30% (17521/26833)
Branch Coverage 53.98% (9300/17230)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

Gabriel39 added 7 commits June 3, 2026 14:17
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: TableReader could pass shared COW columns to mutable column builders in two paths. The new Parquet scan scheduler read non-predicate output columns by calling assert_mutable() on columns still owned by the file block, which throws when the block retains another reference. Complex projection materialization also rebuilt struct, array, and map columns with child pointers shared with the source nested column, so projected nested fallback scans could fail with COW::assert_mutable. This change uses scoped block column mutation for Parquet output reads and recursively detaches complex child columns with IColumn::mutate before wrapping them in result complex columns.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/reader/table_reader.h be/src/format/new_parquet/parquet_scan.cpp
    - PARALLEL=8 JDK_17=/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home JAVA_HOME=/usr/local/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home ./run-be-ut.sh --run --filter=TableReaderTest.ReopenSplitAfterClose:TableReaderTest.PushDownMinMaxFallsBackForProjectedListStructLeaf:TableReaderTest.PushDownMinMaxFallsBackForProjectedMapValueStructLeaf (not completed: submodule setup could not lock .git/config in the sandbox and fallback curl could not resolve github.com)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested Parquet scalar assembly copied present scalar values by calling insert_from on the destination column directly. When the destination is a nullable list element, struct child, or map value column, the source batch values are stored in the non-nullable nested data column, so ColumnNullable::insert_from tried to cast the source to ColumnNullable and aborted. This change appends present nested scalar values into the nullable destination's nested column and records a non-null entry in its null map.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested Parquet scalar batches can store values in a nullable column when the leaf schema is nullable, while the destination slot being filled may be a non-nullable child column after parent-level definition checks have already proven the value is present. The direct insert_from path then tried to insert a ColumnNullable value into a non-nullable destination such as ColumnString and aborted with a bad cast. This change unwraps nullable source batches for present values before appending, while preserving nullable destination handling.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested Parquet scalar batches can contain nullable source values. The scalar append helper previously treated any nullable source null at a present value slot as corruption, but nullable destination columns such as struct child b or nullable list/map values should receive a null entry. This change appends a default null when both source value and destination are nullable, while still rejecting null source values for required destinations.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/nested_column_reader.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested Parquet scalar batches mixed two nullability models: nullable leaf values were stored in ColumnNullable while complex assemblers also interpreted definition levels to build element, value, and struct-child null maps. This allowed nullable source columns to leak into required child outputs and caused bad casts in complex type validation. The fix normalizes nested scalar batches so values_column always contains the non-nullable physical values, only max-definition-level slots receive value indices, and struct scalar children use the nullable-aware append helper to materialize child nulls from definition levels.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/src/format/new_parquet/reader/nested_column_reader.cpp be/src/format/new_parquet/reader/struct_column_reader.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Doris DataTypeArray creates nullable element columns by default, but Parquet LIST element nullability must follow the file schema. Required list elements were therefore materialized into ColumnNullable wrappers and later failed validation when callers expected the required element type such as ColumnInt32. This change removes the default nullable wrapper from list element data columns when the Parquet element reader type is non-nullable, including nested list elements and map LIST values.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/list_column_reader.cpp be/src/format/new_parquet/reader/map_column_reader.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

Gabriel39 added 3 commits June 3, 2026 14:53
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested Parquet reads can intentionally leave value slots at the tail of a batch when the levels cross a requested parent batch boundary. The assembler moves those tail levels into overflow for the next read or skip call. read_nested_leaf_batch incorrectly required every value written by Arrow RecordReader to be consumed immediately, so map LIST skip/read overflow cases failed with an extra values corruption error. This removes that eager check and lets overflow handling preserve the tail values.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Nested nullable Parquet leaves can still produce a value slot for null element/value positions. read_nested_leaf_batch skipped value index assignment when the definition level was below the leaf max definition level, which shifted all values after null slots and produced default/empty values in complex list and map reads. This change keeps value slots aligned with Arrow RecordReader output for all slots that belong to the requested shape; null materialization remains controlled by definition levels in the nullable append helpers.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp
    - Focused BE UT was not rerun successfully in this sandbox because run-be-ut.sh dependency setup cannot lock .git/config and fallback curl cannot resolve github.com.
- Behavior changed: No
- Does this need documentation: No
@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29699 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3a994810d44470a4277f395389e7d483c9c7e53d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17663	4056	4067	4056
q2	q3	11033	1438	819	819
q4	4766	492	345	345
q5	8645	902	588	588
q6	353	176	138	138
q7	897	850	654	654
q8	10890	1592	1687	1592
q9	6974	4544	4515	4515
q10	6838	1823	1559	1559
q11	442	270	256	256
q12	641	433	300	300
q13	18143	3389	2783	2783
q14	267	259	251	251
q15	q16	817	770	709	709
q17	987	966	967	966
q18	6870	5696	5619	5619
q19	1185	1328	1227	1227
q20	554	447	314	314
q21	5955	2901	2694	2694
q22	618	376	314	314
Total cold run time: 104538 ms
Total hot run time: 29699 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4841	4820	4840	4820
q2	q3	4903	5305	4671	4671
q4	2143	2205	1411	1411
q5	5018	4661	4692	4661
q6	225	188	134	134
q7	1950	1809	1603	1603
q8	2396	2118	2132	2118
q9	7827	7466	7449	7449
q10	4739	4697	4238	4238
q11	540	393	361	361
q12	757	741	524	524
q13	3090	3397	2781	2781
q14	277	282	250	250
q15	q16	686	692	628	628
q17	1288	1265	1259	1259
q18	7386	6967	6732	6732
q19	1111	1109	1083	1083
q20	2218	2217	1945	1945
q21	5409	5071	4501	4501
q22	545	457	399	399
Total cold run time: 57349 ms
Total hot run time: 51568 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168999 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3a994810d44470a4277f395389e7d483c9c7e53d, data reload: false

query5	4333	624	484	484
query6	466	204	181	181
query7	4890	576	303	303
query8	372	224	200	200
query9	8748	4087	4086	4086
query10	464	317	263	263
query11	5876	2336	2201	2201
query12	160	100	99	99
query13	1300	622	411	411
query14	6367	5387	5067	5067
query14_1	4391	4406	4348	4348
query15	206	195	182	182
query16	1003	455	452	452
query17	1097	689	577	577
query18	2464	468	345	345
query19	200	175	147	147
query20	111	105	104	104
query21	211	136	113	113
query22	13726	13638	13371	13371
query23	17489	16538	16143	16143
query23_1	16243	16358	16379	16358
query24	7873	1766	1313	1313
query24_1	1295	1308	1300	1300
query25	538	460	407	407
query26	1319	296	168	168
query27	2682	557	330	330
query28	4505	2053	2039	2039
query29	1072	597	474	474
query30	312	238	198	198
query31	1133	1091	947	947
query32	107	61	59	59
query33	534	310	260	260
query34	1212	1131	679	679
query35	765	794	675	675
query36	1402	1405	1249	1249
query37	159	106	95	95
query38	3229	3153	3068	3068
query39	935	928	910	910
query39_1	884	878	881	878
query40	229	129	106	106
query41	72	70	69	69
query42	101	98	98	98
query43	324	331	286	286
query44	
query45	200	199	180	180
query46	1112	1221	758	758
query47	2343	2340	2230	2230
query48	410	418	326	326
query49	650	485	377	377
query50	1082	370	273	273
query51	4348	4307	4338	4307
query52	91	92	79	79
query53	255	266	199	199
query54	292	234	224	224
query55	88	78	73	73
query56	252	254	243	243
query57	1415	1435	1354	1354
query58	265	231	224	224
query59	1607	1668	1454	1454
query60	294	263	249	249
query61	186	182	186	182
query62	697	669	595	595
query63	236	190	193	190
query64	2645	850	675	675
query65	
query66	1812	491	367	367
query67	29818	29212	29018	29018
query68	
query69	444	310	275	275
query70	968	977	937	937
query71	304	223	275	223
query72	3093	2637	2388	2388
query73	872	819	441	441
query74	5148	4952	4805	4805
query75	2652	2585	2229	2229
query76	2345	1164	760	760
query77	362	372	285	285
query78	12407	12523	11789	11789
query79	1483	1025	763	763
query80	1256	485	398	398
query81	524	285	248	248
query82	656	160	119	119
query83	319	308	249	249
query84	265	136	110	110
query85	933	530	446	446
query86	429	305	294	294
query87	3393	3385	3175	3175
query88	3641	2786	2758	2758
query89	438	379	330	330
query90	2041	182	187	182
query91	179	165	141	141
query92	64	57	56	56
query93	1525	1391	901	901
query94	721	357	305	305
query95	680	377	354	354
query96	1043	769	387	387
query97	2728	2722	2560	2560
query98	232	209	207	207
query99	1161	1163	1015	1015
Total cold run time: 253495 ms
Total hot run time: 168999 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29366 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a77557ce975ae5df26f04d3c363fb0d4cf06e87b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17703	4142	4105	4105
q2	q3	10929	1354	812	812
q4	4744	484	342	342
q5	8224	885	608	608
q6	311	169	139	139
q7	845	858	621	621
q8	10708	1631	1586	1586
q9	6884	4463	4457	4457
q10	6739	1830	1548	1548
q11	446	278	251	251
q12	647	435	299	299
q13	18154	3376	2780	2780
q14	267	258	247	247
q15	q16	822	772	708	708
q17	1380	1005	957	957
q18	6846	5658	5467	5467
q19	2136	1355	1249	1249
q20	558	422	286	286
q21	6107	2744	2584	2584
q22	466	369	320	320
Total cold run time: 104916 ms
Total hot run time: 29366 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4840	4713	5122	4713
q2	q3	4913	5237	4638	4638
q4	2088	2233	1386	1386
q5	4956	4744	4742	4742
q6	229	178	130	130
q7	1862	1739	1604	1604
q8	2344	1950	1903	1903
q9	7300	7343	7322	7322
q10	4741	4682	4203	4203
q11	530	383	358	358
q12	734	739	527	527
q13	2946	3393	2783	2783
q14	281	290	251	251
q15	q16	683	705	608	608
q17	1268	1238	1242	1238
q18	7208	6824	6873	6824
q19	1114	1124	1112	1112
q20	2222	2223	1954	1954
q21	5268	4560	4510	4510
q22	520	454	400	400
Total cold run time: 56047 ms
Total hot run time: 51206 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169411 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a77557ce975ae5df26f04d3c363fb0d4cf06e87b, data reload: false

query5	4336	638	475	475
query6	434	200	180	180
query7	4833	571	314	314
query8	367	217	203	203
query9	8750	4095	4099	4095
query10	458	311	264	264
query11	5948	2370	2159	2159
query12	156	104	98	98
query13	1320	602	438	438
query14	6421	5491	5167	5167
query14_1	4484	4500	4525	4500
query15	207	197	174	174
query16	1025	450	424	424
query17	1099	682	571	571
query18	2475	485	341	341
query19	202	181	133	133
query20	112	106	105	105
query21	229	147	120	120
query22	13573	13610	13449	13449
query23	17325	16427	16196	16196
query23_1	16259	16296	16195	16195
query24	7694	1763	1326	1326
query24_1	1305	1310	1306	1306
query25	552	463	399	399
query26	1337	321	172	172
query27	2647	586	346	346
query28	4471	2038	2022	2022
query29	1062	595	498	498
query30	346	222	203	203
query31	1123	1074	964	964
query32	116	63	57	57
query33	518	316	257	257
query34	1190	1147	656	656
query35	767	774	690	690
query36	1375	1430	1245	1245
query37	151	108	96	96
query38	3223	3146	3055	3055
query39	941	926	902	902
query39_1	872	874	874	874
query40	220	123	102	102
query41	66	65	64	64
query42	95	96	91	91
query43	321	324	287	287
query44	
query45	202	188	183	183
query46	1069	1245	754	754
query47	2314	2365	2292	2292
query48	414	425	316	316
query49	631	464	357	357
query50	1104	343	261	261
query51	4350	4298	4364	4298
query52	87	90	77	77
query53	254	276	193	193
query54	266	221	194	194
query55	80	76	72	72
query56	238	230	219	219
query57	1467	1420	1324	1324
query58	234	216	199	199
query59	1618	1651	1454	1454
query60	269	252	227	227
query61	159	152	158	152
query62	692	676	592	592
query63	235	180	187	180
query64	2633	859	684	684
query65	
query66	1834	455	341	341
query67	29805	29697	29525	29525
query68	
query69	427	304	265	265
query70	976	1000	966	966
query71	294	220	207	207
query72	3006	2724	2382	2382
query73	822	755	459	459
query74	5122	4987	4752	4752
query75	2660	2584	2239	2239
query76	2340	1174	774	774
query77	358	368	295	295
query78	12368	12218	11801	11801
query79	1253	1024	759	759
query80	536	508	385	385
query81	458	276	246	246
query82	245	155	123	123
query83	270	281	250	250
query84	255	148	113	113
query85	878	534	435	435
query86	335	298	279	279
query87	3361	3376	3171	3171
query88	3645	2758	2720	2720
query89	419	383	330	330
query90	2193	183	182	182
query91	175	168	144	144
query92	65	59	58	58
query93	1559	1449	902	902
query94	590	368	305	305
query95	703	392	443	392
query96	1100	773	344	344
query97	2696	2689	2563	2563
query98	223	202	207	202
query99	1147	1171	1027	1027
Total cold run time: 251288 ms
Total hot run time: 169411 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants