Skip to content

[opt](be) Optimize row store lookup to support batch row reading per segment#63434

Open
HappenLee wants to merge 1 commit into
apache:masterfrom
HappenLee:new_cache
Open

[opt](be) Optimize row store lookup to support batch row reading per segment#63434
HappenLee wants to merge 1 commit into
apache:masterfrom
HappenLee:new_cache

Conversation

@HappenLee
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Refactor lookup_row_data to accept multiple row IDs for the same segment, reading them in a single batch via ColumnString instead of one-by-one with per-row std::string buffers. This reduces the per-row overhead of column iterator calls and string copy for multi-row fetch scenarios (e.g. batch point queries and index lookup).

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…segment

Refactor `lookup_row_data` to accept multiple row IDs for the same segment,
reading them in a single batch via `ColumnString` instead of one-by-one with
per-row `std::string` buffers. This reduces the per-row overhead of column
iterator calls and string copy for multi-row fetch scenarios (e.g. batch
point queries and index lookup).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@HappenLee
Copy link
Copy Markdown
Contributor Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking correctness issue in the row-store batch lookup path. The PR goal is clear and focused, but the new batched call changes a per-row lookup into ColumnIterator::read_by_rowids() without preserving that iterator's ordering assumptions. Tests are also missing for unordered row ids, which is the scenario that exposes this regression.

Critical checkpoint conclusions: goal is to optimize row-store lookup by batching row ids per segment, but the current implementation does not safely accomplish it for arbitrary request order; scope is small and focused; no new concurrency, lifecycle, config, persistence, or FE/BE protocol compatibility concerns found; no transaction/delete-bitmap changes; rowset/segment lifetimes follow existing shared_ptr/cache-handle patterns; error handling mostly follows existing style; observability is unchanged and sufficient for this refactor; performance intent is good, but it must not rely on unsorted row ids; test coverage is insufficient because no case covers non-monotonic row ids in the batch materialization path.

User focus points: no additional user-provided review focus.

RETURN_IF_ERROR(scope_timer_run(
[&]() {
return tablet->lookup_row_data({}, segment_id, row_ids, rowset, stats,
*row_store_rows);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This batched row-store read can return wrong rows when row_ids are not monotonically increasing. read_batch_doris_format_row() groups only contiguous equal file_ids from the request; it does not sort by row id. The request is built in result-row order (MaterializationSharedState::create_muiltget_result() appends row_location.row_id as rows arrive), so a batch for the same segment can be in arbitrary order such as [100, 1]. FileColumnIterator::read_by_rowids() seeks to rowids[total_read_count] and then advances within the current page, so it assumes the remaining row ids are ordered within/after that page; the old per-row loop did not have this assumption. Please either preserve the old one-by-one path for unordered input, or sort (row_id, original_index) for the iterator and restore the original output order before jsonb_to_block().

Copy link
Copy Markdown
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31506 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 30123fc7550fb55d0aaa2172a5bf51eacfe13e6b, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17801	4014	3941	3941
q2	2108	336	200	200
q3	10248	1427	849	849
q4	4687	482	363	363
q5	7679	2254	2100	2100
q6	348	179	148	148
q7	937	821	612	612
q8	9361	1743	1630	1630
q9	5400	4959	4977	4959
q10	6503	2101	1821	1821
q11	445	274	253	253
q12	689	440	310	310
q13	18182	3428	2814	2814
q14	269	262	241	241
q15	q16	781	776	705	705
q17	998	920	899	899
q18	7026	5897	5646	5646
q19	1256	1337	1021	1021
q20	505	396	273	273
q21	5743	2597	2416	2416
q22	424	360	305	305
Total cold run time: 101390 ms
Total hot run time: 31506 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4291	4158	4135	4135
q2	327	364	222	222
q3	4618	4932	4353	4353
q4	2137	2196	1440	1440
q5	4455	4321	4483	4321
q6	256	197	152	152
q7	2135	1857	1586	1586
q8	2751	2084	2116	2084
q9	7844	7828	7659	7659
q10	4669	4559	4141	4141
q11	610	401	372	372
q12	754	745	522	522
q13	3398	3784	3110	3110
q14	300	302	270	270
q15	q16	733	769	642	642
q17	1341	1325	1358	1325
q18	7936	7397	6955	6955
q19	1141	1093	1142	1093
q20	2216	2227	1941	1941
q21	5374	4714	4509	4509
q22	525	479	427	427
Total cold run time: 57811 ms
Total hot run time: 51259 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 176469 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 30123fc7550fb55d0aaa2172a5bf51eacfe13e6b, data reload: false

query5	4315	667	526	526
query6	343	228	214	214
query7	4236	563	312	312
query8	321	235	219	219
query9	8813	4078	4074	4074
query10	459	368	299	299
query11	5768	2491	2262	2262
query12	185	131	130	130
query13	1297	621	439	439
query14	6004	5432	5150	5150
query14_1	4487	4463	4466	4463
query15	214	212	192	192
query16	1062	464	431	431
query17	1156	756	649	649
query18	2753	504	369	369
query19	240	215	175	175
query20	142	138	137	137
query21	222	146	122	122
query22	13756	13716	13491	13491
query23	17118	16391	16055	16055
query23_1	16215	16279	16318	16279
query24	7589	1778	1322	1322
query24_1	1335	1344	1357	1344
query25	589	506	444	444
query26	1308	323	181	181
query27	2694	566	351	351
query28	4420	1981	1984	1981
query29	1091	646	508	508
query30	309	239	201	201
query31	1115	1099	939	939
query32	88	73	76	73
query33	554	365	295	295
query34	1163	1130	638	638
query35	758	788	677	677
query36	1346	1321	1185	1185
query37	145	104	93	93
query38	3205	3131	3075	3075
query39	933	931	908	908
query39_1	886	884	871	871
query40	225	148	126	126
query41	66	65	63	63
query42	109	117	112	112
query43	323	334	280	280
query44	1390	778	784	778
query45	212	201	194	194
query46	1049	1179	734	734
query47	2277	2318	2139	2139
query48	410	427	284	284
query49	624	505	388	388
query50	999	350	251	251
query51	4274	4339	4275	4275
query52	112	111	99	99
query53	272	286	209	209
query54	315	269	274	269
query55	98	92	91	91
query56	299	312	301	301
query57	1466	1390	1320	1320
query58	297	294	269	269
query59	1552	1626	1390	1390
query60	333	334	305	305
query61	163	160	160	160
query62	675	620	569	569
query63	246	215	219	215
query64	2356	806	644	644
query65	4880	4722	4719	4719
query66	1701	476	349	349
query67	30241	29508	30062	29508
query68	2361	1549	910	910
query69	455	344	304	304
query70	1130	1038	947	947
query71	303	292	277	277
query72	2949	2736	2443	2443
query73	860	776	418	418
query74	5053	4990	4763	4763
query75	2670	2661	2304	2304
query76	2287	1194	808	808
query77	407	421	337	337
query78	12204	12134	11817	11817
query79	1399	1104	730	730
query80	1254	537	474	474
query81	514	287	241	241
query82	1375	164	123	123
query83	350	285	253	253
query84	270	142	114	114
query85	931	551	454	454
query86	433	346	321	321
query87	3417	3348	3237	3237
query88	3519	2682	2683	2682
query89	440	387	358	358
query90	1793	193	191	191
query91	179	169	143	143
query92	83	80	74	74
query93	1487	1435	874	874
query94	621	372	349	349
query95	711	395	370	370
query96	955	757	358	358
query97	2713	2719	2579	2579
query98	250	237	231	231
query99	1088	1122	993	993
Total cold run time: 263095 ms
Total hot run time: 176469 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

ClickBench: Total hot run time: 24.64 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 30123fc7550fb55d0aaa2172a5bf51eacfe13e6b, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.05	0.05
query3	0.26	0.14	0.14
query4	1.61	0.15	0.14
query5	0.25	0.23	0.22
query6	1.23	1.08	1.02
query7	0.05	0.01	0.01
query8	0.08	0.04	0.04
query9	0.37	0.31	0.32
query10	0.59	0.54	0.54
query11	0.20	0.14	0.15
query12	0.19	0.15	0.15
query13	0.49	0.47	0.46
query14	1.00	1.02	1.02
query15	0.61	0.61	0.60
query16	0.31	0.33	0.32
query17	1.07	1.08	1.13
query18	0.23	0.21	0.20
query19	2.07	1.91	1.98
query20	0.02	0.01	0.01
query21	15.47	0.20	0.15
query22	4.95	0.06	0.05
query23	16.10	0.31	0.12
query24	2.97	0.40	0.33
query25	0.12	0.05	0.05
query26	0.72	0.21	0.15
query27	0.04	0.04	0.04
query28	3.51	0.81	0.36
query29	12.48	4.33	3.46
query30	0.27	0.15	0.16
query31	2.77	0.60	0.33
query32	3.23	0.60	0.50
query33	3.13	3.28	3.21
query34	15.66	3.99	3.31
query35	3.26	3.25	3.29
query36	0.55	0.46	0.44
query37	0.10	0.07	0.06
query38	0.06	0.04	0.04
query39	0.04	0.02	0.03
query40	0.18	0.16	0.15
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.51 s
Total hot run time: 24.64 s

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 67.39% (31/46) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.65% (27941/37936)
Line Coverage 57.61% (303191/526302)
Region Coverage 54.82% (253982/463292)
Branch Coverage 56.36% (109780/194783)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants