Skip to content

Conversation

@eldenmoon
Copy link
Member

@eldenmoon eldenmoon commented Aug 8, 2025

This commit introduces a new mechanism to improve compaction
performance for tables with VARIANT columns by enabling vertical
compaction for frequently accessed subcolumns.

Key Changes:

  1. Vertical Compaction of Subcolumns: During compaction, the table schema is temporarily extended to treat "hot" subcolumns (frequently accessed paths) within the VARIANT column as separate, top-level columns. This allows them to
    be compacted more efficiently in their own column groups,
    improving overall performance. This behavior is controlled by the new configuration parameter
    enable_vertical_compact_variant_subcolumns, which is enabled by default.

  2. Schema Handling in Rowsets: * To prevent these temporary extracted subcolumns from being persisted permanently, a new flag, strip_variant_extracted_columns_in_rowset_meta, has been added to RowsetWriterContext. When enabled, the RowsetWriter removes these temporary columns from the schema before saving the rowset metadata. * This change ensures that the in-memory Rowset object uses the same schema as the one persisted in its RowsetMeta, preventing potential data inconsistencies.

  3. Sparse Column Merging on Read:

    • A new SparseColumnMergeIterator has been implemented. This iterator is responsible for merging data from the base sparse representation of the VARIANT with data from any extracted subcolumns during reads. This provides a complete and unified view of the VARIANT data to the upper-level query engine.
  4. Other Fixes and Improvements: * The DataTypeVariant constructor now correctly accepts the max_subcolumns_count parameter. * A new correctness check (check_path_stats) has been added to ensure that the final output rowset does not contain any temporary, extracted subcolumns after compaction. * Previously commented-out code related to subcolumn indexing and sparse column merging has been re-enabled. * The build_basic_info method in the compaction process has been refactored to return a Status for more robust error handling.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Aug 8, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

This commit introduces a new mechanism to improve compaction
  performance for tables with VARIANT columns by enabling vertical
  compaction for frequently accessed subcolumns.

Key Changes:

 1. Vertical Compaction of Subcolumns:
    During compaction, the table schema is temporarily extended to
 treat "hot" subcolumns (frequently accessed paths) within the
VARIANT column as separate, top-level columns. This allows them to
 be compacted more efficiently in their own column groups,
improving overall performance. This behavior is controlled by the
new configuration parameter
`enable_vertical_compact_variant_subcolumns`, which is enabled by
default.

 2. Schema Handling in Rowsets:
     * To prevent these temporary extracted subcolumns from being
       persisted permanently, a new flag,
       strip_variant_extracted_columns_in_rowset_meta, has been
       added to RowsetWriterContext. When enabled, the RowsetWriter
       removes these temporary columns from the schema before saving
        the rowset metadata.
     * This change ensures that the in-memory Rowset object uses
       the same schema as the one persisted in its RowsetMeta,
       preventing potential data inconsistencies.

 3. Sparse Column Merging on Read:
     * A new SparseColumnMergeIterator has been implemented. This
       iterator is responsible for merging data from the base
       sparse representation of the VARIANT with data from any
       extracted subcolumns during reads. This provides a complete
       and unified view of the VARIANT data to the upper-level
       query engine.

 4. Other Fixes and Improvements:
     * The DataTypeVariant constructor now correctly accepts the
       max_subcolumns_count parameter.
     * A new correctness check (check_path_stats) has been added to
       ensure that the final output rowset does not contain any
       temporary, extracted subcolumns after compaction.
     * Previously commented-out code related to subcolumn indexing
       and sparse column merging has been re-enabled.
     * The build_basic_info method in the compaction process has
       been refactored to return a Status for more robust error
       handling.
@eldenmoon
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33939 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9ea747f694e6cf653076d175089c5af9dea7b95c, data reload: false

------ Round 1 ----------------------------------
q1	17621	5272	5299	5272
q2	1923	317	193	193
q3	10270	1430	723	723
q4	10233	1011	542	542
q5	7812	2399	2401	2399
q6	179	167	133	133
q7	891	778	609	609
q8	9321	1343	1095	1095
q9	6932	5080	5124	5080
q10	6920	2366	1976	1976
q11	459	277	266	266
q12	359	368	218	218
q13	17774	3444	2964	2964
q14	233	236	226	226
q15	551	467	466	466
q16	424	411	380	380
q17	588	829	351	351
q18	7500	7061	6982	6982
q19	2189	987	550	550
q20	329	307	217	217
q21	3365	3071	2297	2297
q22	1067	1062	1000	1000
Total cold run time: 106940 ms
Total hot run time: 33939 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5457	5309	5429	5309
q2	244	315	217	217
q3	2100	2569	2243	2243
q4	1317	1725	1325	1325
q5	4149	4469	4336	4336
q6	229	184	146	146
q7	1991	1923	1840	1840
q8	2574	2555	2624	2555
q9	7395	7329	7395	7329
q10	3109	3412	2901	2901
q11	544	498	498	498
q12	696	830	613	613
q13	3428	3741	3240	3240
q14	319	347	295	295
q15	519	455	444	444
q16	479	722	472	472
q17	1274	1545	1354	1354
q18	8044	7922	7673	7673
q19	11768	880	858	858
q20	1979	1909	1775	1775
q21	15049	4222	4337	4222
q22	1050	1028	976	976
Total cold run time: 73714 ms
Total hot run time: 50621 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171380 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9ea747f694e6cf653076d175089c5af9dea7b95c, data reload: false

============================================
query1	989	408	456	408
query2	6514	1757	1789	1757
query3	6742	227	223	223
query4	27081	23493	22938	22938
query5	4384	632	523	523
query6	326	256	247	247
query7	4638	518	296	296
query8	300	234	214	214
query9	8596	2975	2917	2917
query10	492	386	303	303
query11	15569	14977	15090	14977
query12	186	138	135	135
query13	1655	564	427	427
query14	8644	5888	5890	5888
query15	221	193	178	178
query16	7144	669	480	480
query17	1215	763	632	632
query18	2075	473	333	333
query19	225	215	195	195
query20	147	147	141	141
query21	222	129	119	119
query22	4323	3902	3916	3902
query23	34678	34519	34392	34392
query24	5241	2419	2449	2419
query25	513	599	449	449
query26	716	291	158	158
query27	2241	494	354	354
query28	3098	2327	2324	2324
query29	646	600	490	490
query30	290	229	196	196
query31	883	805	699	699
query32	90	77	80	77
query33	516	433	361	361
query34	864	830	512	512
query35	796	854	743	743
query36	1014	1043	967	967
query37	133	112	96	96
query38	3908	4035	3881	3881
query39	1440	1402	1404	1402
query40	234	142	143	142
query41	61	58	55	55
query42	142	123	121	121
query43	528	514	487	487
query44	1436	876	881	876
query45	203	188	186	186
query46	950	1060	681	681
query47	1794	1819	1754	1754
query48	400	431	310	310
query49	679	519	443	443
query50	696	691	419	419
query51	4199	4140	4159	4140
query52	125	129	130	129
query53	257	287	214	214
query54	640	635	557	557
query55	93	92	92	92
query56	364	371	354	354
query57	1215	1248	1145	1145
query58	342	334	338	334
query59	2595	2626	2608	2608
query60	401	396	399	396
query61	129	123	121	121
query62	771	723	668	668
query63	251	217	215	215
query64	2346	1096	831	831
query65	4274	4160	4150	4150
query66	1010	452	375	375
query67	query68	16477	889	987	889
query69	1072	282	278	278
query70	1357	1096	1108	1096
query71	734	321	324	321
query72	9197	2300	2226	2226
query73	3404	675	359	359
query74	9177	8666	8802	8666
query75	7568	3130	2692	2692
query76	8823	1231	804	804
query77	1143	469	330	330
query78	9550	11153	9950	9950
query79	14938	663	572	572
query80	1821	562	497	497
query81	558	260	224	224
query82	509	156	119	119
query83	372	294	284	284
query84	292	96	79	79
query85	809	378	339	339
query86	372	301	298	298
query87	4362	4179	4190	4179
query88	5509	2242	2244	2242
query89	487	366	319	319
query90	2424	229	230	229
query91	155	139	112	112
query92	88	73	68	68
query93	6307	980	657	657
query94	1084	401	290	290
query95	418	333	319	319
query96	502	626	283	283
query97	2662	2749	2608	2608
query98	256	240	218	218
query99	1480	1455	1277	1277
Total cold run time: 298797 ms
Total hot run time: 171380 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.52 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9ea747f694e6cf653076d175089c5af9dea7b95c, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.24	0.07	0.07
query4	1.62	0.11	0.11
query5	0.43	0.41	0.46
query6	1.18	0.69	0.67
query7	0.03	0.02	0.02
query8	0.05	0.05	0.04
query9	0.56	0.49	0.48
query10	0.54	0.54	0.52
query11	0.16	0.11	0.10
query12	0.15	0.11	0.11
query13	0.64	0.64	0.67
query14	0.90	1.13	1.11
query15	0.90	0.87	0.88
query16	0.39	0.39	0.38
query17	1.09	1.07	1.05
query18	0.22	0.20	0.20
query19	1.92	1.91	1.90
query20	0.02	0.02	0.02
query21	15.38	0.83	0.57
query22	0.79	1.24	0.72
query23	14.87	1.15	0.64
query24	6.47	0.79	1.20
query25	0.51	0.26	0.21
query26	0.62	0.15	0.13
query27	0.06	0.05	0.05
query28	10.34	0.83	0.45
query29	12.62	3.84	3.32
query30	3.09	2.98	2.94
query31	2.83	0.57	0.40
query32	3.25	0.57	0.49
query33	3.04	3.22	3.20
query34	16.20	5.25	5.02
query35	4.95	4.97	5.05
query36	0.70	0.52	0.50
query37	0.10	0.08	0.07
query38	0.06	0.05	0.04
query39	0.03	0.02	0.03
query40	0.17	0.14	0.13
query41	0.09	0.04	0.03
query42	0.04	0.03	0.02
query43	0.04	0.04	0.03
Total cold run time: 107.41 s
Total hot run time: 33.52 s

@eldenmoon eldenmoon closed this Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants