Skip to content

[chore](regression-test) make test_point_query_ck nonConcurrent#63587

Merged
eldenmoon merged 1 commit into
apache:masterfrom
eldenmoon:codex/point-query-row-cache-master
May 25, 2026
Merged

[chore](regression-test) make test_point_query_ck nonConcurrent#63587
eldenmoon merged 1 commit into
apache:masterfrom
eldenmoon:codex/point-query-row-cache-master

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

@eldenmoon eldenmoon commented May 25, 2026

test_point_query and test_point_query_ck should be serial with nonConcurrent

Copilot AI review requested due to automatic review settings May 25, 2026 06:21
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect point-query row accounting when a row is served from the storage row cache, ensuring downstream logic (notably delete-sign filtering and result block padding) sees the correct “hit row” count and doesn’t return stale deleted rows.

Changes:

  • Increment _row_hits when a row-cache lookup hits in PointQueryExecutor::_lookup_row_key().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon eldenmoon changed the title [fix](be) Count row cache hits in point query results [fix](short circuit) Count row cache hits in point query results May 25, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one data-correctness issue in the row-cache hit path. The change fixes row-hit accounting for cached rows, but the cached-row path still bypasses column-store backfill when the row-store column does not contain all requested columns, so cache hits can return default values for real columns.

Critical checkpoint conclusions:

  • Goal/test: the PR targets point-query row-cache hit accounting for delete-sign filtering. The one-line change addresses full-row cached results, but misses partial row-store projections. The PR description reports manual/regression coverage; I did not run tests in this review.
  • Scope: the change is small, but the affected row-cache path has a parallel missing-column backfill path that also needs to be handled.
  • Concurrency/lifecycle: no new shared state or lifecycle hazard introduced by the line itself; existing cache handles remain RAII-managed.
  • Compatibility/configuration/persistence: no protocol, storage format, config, or EditLog change.
  • Data correctness: blocking issue below for partial row-store cached point queries returning defaults instead of stored column values.
  • Performance/observability: no new material concern beyond the correctness issue.

User focus: no additional user-provided review focus was supplied.

Comment thread be/src/service/point_query_executor.cpp Outdated
if (hit_cache) {
_row_read_ctxs[i]._cached_row_data = std::move(cache_handle);
++_profile_metrics.row_cache_hits;
_row_hits++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache-hit path still skips the column-store backfill needed when the row-store column is partial. RowCache stores only ROW_STORE_COL from BaseTablet::lookup_row_data() and is keyed by tablet id + primary key, not by the requested projection. For a table using row_store_columns, a first point query can populate the cache with only the row-store columns; a later point query for the same key that requests a column outside that row store hits here, increments _row_hits, and _lookup_row_data() deserializes the cached row then continues before the _reusable->missing_col_uids() column-store read. The padding block at the end then inserts defaults for the missing result columns, so the query can return incorrect values. Please either bypass/disable the row cache when _reusable->missing_col_uids() is non-empty, or still perform the row-location lookup on cache hits so the cached row can be supplemented from column store.

@eldenmoon eldenmoon force-pushed the codex/point-query-row-cache-master branch from 4b9923e to 52c40c9 Compare May 25, 2026 06:43
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completed the PR review and found no additional distinct issues beyond the already-open inline thread. That existing thread is still blocking: the row-cache hit path increments _row_hits and then bypasses the column-store backfill path, so point queries on tables using partial row_store_columns can return default values for requested columns that are not present in the cached row-store payload.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to count storage row-cache hits so delete-sign filtering runs for cached rows. The added regression covers the full row-store delete/compaction scenario, but it does not cover partial row-store projections where cache hits still need missing-column backfill.
  • Scope/focus: the code change is minimal, but the affected cache-hit path is not functionally complete because it skips an existing parallel path for _reusable->missing_col_uids().
  • Concurrency/lifecycle: no new shared mutable state, thread entry, lock-order, or cache-handle lifecycle issue was introduced by this one-line change; RowCache::CacheHandle remains RAII-managed.
  • Configuration/compatibility/persistence: no new config, protocol/storage format, FE-BE variable passing, or EditLog compatibility concern.
  • Data correctness: blocking issue remains in the existing inline comment for partial row-store cached point queries returning incorrect defaults.
  • Transaction/write correctness: no transaction or persistence path is directly changed, but the read-side visibility result can still be incorrect for the partial row-store cache-hit scenario.
  • Tests: regression coverage was added for delete-sign filtering after full-row storage row-cache hits. Missing coverage for partial row_store_columns with a cache warm-up followed by a projection outside the row store.
  • Performance/observability: no additional material issue found; no new observability appears necessary for this small fix.

User focus: no additional user-provided review focus was supplied.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31181 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4b9923e6d9cd3345dbb1a6d4182be250eee794f6, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17759	3962	3944	3944
q2	q3	10816	1353	825	825
q4	4680	469	359	359
q5	7574	2260	2106	2106
q6	256	175	136	136
q7	956	805	650	650
q8	9351	1676	1616	1616
q9	6726	4926	4961	4926
q10	6465	2234	1900	1900
q11	441	279	250	250
q12	700	423	296	296
q13	18213	3339	2744	2744
q14	270	254	242	242
q15	q16	816	778	708	708
q17	921	930	1012	930
q18	6899	5849	5612	5612
q19	1364	1315	1087	1087
q20	522	404	257	257
q21	5794	2624	2293	2293
q22	424	352	300	300
Total cold run time: 100947 ms
Total hot run time: 31181 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4357	4270	4236	4236
q2	q3	4581	4925	4377	4377
q4	2078	2238	1381	1381
q5	4422	4325	4704	4325
q6	260	206	147	147
q7	2030	1877	1766	1766
q8	2479	2145	2161	2145
q9	7972	8083	8184	8083
q10	4922	4894	4600	4600
q11	602	432	398	398
q12	761	781	561	561
q13	3357	3737	3126	3126
q14	310	356	286	286
q15	q16	748	762	696	696
q17	1439	1371	1354	1354
q18	7975	7216	7153	7153
q19	1101	1108	1101	1101
q20	2248	2248	1976	1976
q21	5282	4630	4482	4482
q22	547	483	421	421
Total cold run time: 57471 ms
Total hot run time: 52614 ms

test_point_query and test_point_query_ck should be serial with nonConcurrent
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31929 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 52c40c9e4d37547de56fd7cdf0b8359919d8a7ed, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17899	4126	4067	4067
q2	q3	10838	1425	846	846
q4	4682	482	341	341
q5	7542	2305	2083	2083
q6	233	180	138	138
q7	948	808	647	647
q8	9463	1754	1677	1677
q9	5233	5020	4968	4968
q10	6375	2215	1889	1889
q11	444	278	242	242
q12	636	431	305	305
q13	18105	3367	2831	2831
q14	263	257	235	235
q15	q16	816	800	706	706
q17	996	948	1006	948
q18	6839	5772	5512	5512
q19	1318	1328	1217	1217
q20	570	463	299	299
q21	6219	2924	2672	2672
q22	568	377	306	306
Total cold run time: 99987 ms
Total hot run time: 31929 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4773	4873	4948	4873
q2	q3	4974	5321	4683	4683
q4	2143	2211	1445	1445
q5	4975	4839	4690	4690
q6	233	184	130	130
q7	1876	1765	1561	1561
q8	2439	2124	2219	2124
q9	8005	7481	7492	7481
q10	4757	4673	4285	4285
q11	540	387	356	356
q12	745	752	544	544
q13	3017	3408	2839	2839
q14	277	292	248	248
q15	q16	696	703	611	611
q17	1322	1266	1262	1262
q18	7172	6899	6688	6688
q19	1099	1100	1106	1100
q20	2223	2242	1943	1943
q21	5380	4682	4587	4587
q22	527	458	401	401
Total cold run time: 57173 ms
Total hot run time: 51851 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173331 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4b9923e6d9cd3345dbb1a6d4182be250eee794f6, data reload: false

query5	4317	670	538	538
query6	333	223	204	204
query7	4242	571	303	303
query8	323	235	226	226
query9	8829	4161	4158	4158
query10	494	351	302	302
query11	5879	2500	2256	2256
query12	187	129	129	129
query13	1269	634	462	462
query14	6186	5566	5272	5272
query14_1	4621	4601	4555	4555
query15	216	215	192	192
query16	996	467	451	451
query17	1160	744	626	626
query18	2497	509	381	381
query19	220	211	173	173
query20	143	136	144	136
query21	219	145	123	123
query22	13750	13673	13495	13495
query23	17489	16695	16272	16272
query23_1	16576	16579	16450	16450
query24	7512	1819	1335	1335
query24_1	1331	1351	1318	1318
query25	605	508	446	446
query26	1321	329	176	176
query27	2720	583	345	345
query28	4445	2010	2012	2010
query29	992	651	526	526
query30	315	243	205	205
query31	1132	1088	973	973
query32	97	78	74	74
query33	549	364	327	327
query34	1205	1165	671	671
query35	778	796	736	736
query36	1454	1432	1264	1264
query37	156	101	86	86
query38	3256	3193	3160	3160
query39	954	910	902	902
query39_1	869	900	900	900
query40	227	146	125	125
query41	66	64	62	62
query42	108	111	109	109
query43	346	355	303	303
query44	
query45	214	207	208	207
query46	1125	1176	734	734
query47	2353	2375	2275	2275
query48	417	423	299	299
query49	632	499	391	391
query50	998	346	250	250
query51	4372	4313	4277	4277
query52	105	111	95	95
query53	263	282	204	204
query54	324	276	253	253
query55	99	92	89	89
query56	311	307	300	300
query57	1442	1431	1326	1326
query58	308	283	278	278
query59	1641	1672	1456	1456
query60	320	318	310	310
query61	196	159	160	159
query62	701	657	588	588
query63	249	202	214	202
query64	2409	815	655	655
query65	
query66	1717	490	374	374
query67	30237	30154	30036	30036
query68	
query69	464	371	308	308
query70	1003	1048	1016	1016
query71	314	281	276	276
query72	3033	2743	2419	2419
query73	837	811	466	466
query74	5131	5049	4828	4828
query75	2697	2607	2292	2292
query76	2298	1160	801	801
query77	420	430	345	345
query78	12362	12699	11834	11834
query79	1509	1055	746	746
query80	768	561	447	447
query81	476	279	252	252
query82	1370	162	124	124
query83	323	280	249	249
query84	272	141	112	112
query85	938	545	487	487
query86	449	343	312	312
query87	3441	3438	3251	3251
query88	3719	2758	2739	2739
query89	450	391	345	345
query90	1779	188	189	188
query91	181	171	139	139
query92	79	80	75	75
query93	1474	1442	874	874
query94	624	353	316	316
query95	680	396	352	352
query96	1119	834	339	339
query97	2759	2746	2602	2602
query98	247	230	231	230
query99	1200	1154	1046	1046
Total cold run time: 255956 ms
Total hot run time: 173331 ms

@eldenmoon eldenmoon force-pushed the codex/point-query-row-cache-master branch from 52c40c9 to 50db82a Compare May 25, 2026 07:01
@eldenmoon eldenmoon changed the title [fix](short circuit) Count row cache hits in point query results [chore](regression-test) make test_point_query_ck nonConcurrent May 25, 2026
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172699 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 52c40c9e4d37547de56fd7cdf0b8359919d8a7ed, data reload: false

query5	4331	642	528	528
query6	334	235	197	197
query7	4229	538	312	312
query8	321	237	241	237
query9	8830	4075	4073	4073
query10	459	346	305	305
query11	5802	2419	2231	2231
query12	183	131	127	127
query13	1361	599	459	459
query14	6222	5430	5173	5173
query14_1	4478	4487	4448	4448
query15	214	204	185	185
query16	1001	443	443	443
query17	1127	714	589	589
query18	2686	477	341	341
query19	211	203	161	161
query20	136	134	127	127
query21	213	133	115	115
query22	13658	13542	13493	13493
query23	17475	16569	16301	16301
query23_1	16409	16380	16417	16380
query24	7464	1811	1329	1329
query24_1	1319	1352	1327	1327
query25	583	508	442	442
query26	1310	322	183	183
query27	2723	574	349	349
query28	4441	2020	2019	2019
query29	988	641	523	523
query30	309	245	209	209
query31	1137	1079	977	977
query32	89	76	76	76
query33	558	370	307	307
query34	1186	1177	663	663
query35	785	792	698	698
query36	1458	1399	1258	1258
query37	164	107	95	95
query38	3221	3177	3112	3112
query39	938	940	920	920
query39_1	883	880	874	874
query40	238	155	130	130
query41	92	69	69	69
query42	112	110	111	110
query43	331	337	292	292
query44	
query45	215	209	199	199
query46	1122	1192	743	743
query47	2415	2448	2286	2286
query48	427	418	315	315
query49	657	519	402	402
query50	1018	357	264	264
query51	4355	4305	4289	4289
query52	105	107	95	95
query53	253	281	231	231
query54	341	292	265	265
query55	95	93	88	88
query56	311	319	320	319
query57	1453	1440	1385	1385
query58	313	284	275	275
query59	1567	1683	1460	1460
query60	341	337	331	331
query61	204	154	148	148
query62	703	687	581	581
query63	240	205	205	205
query64	2358	795	609	609
query65	
query66	1656	491	354	354
query67	30106	29946	29782	29782
query68	
query69	460	349	299	299
query70	1006	1025	960	960
query71	308	275	264	264
query72	2926	2942	2405	2405
query73	822	797	408	408
query74	5161	4980	4835	4835
query75	2692	2591	2256	2256
query76	2314	1151	802	802
query77	403	428	336	336
query78	12344	12447	11943	11943
query79	1459	1081	751	751
query80	884	535	451	451
query81	511	300	240	240
query82	1354	159	122	122
query83	351	300	250	250
query84	264	145	110	110
query85	908	550	497	497
query86	440	327	298	298
query87	3463	3383	3247	3247
query88	3656	2779	2730	2730
query89	452	398	346	346
query90	1830	190	182	182
query91	184	172	142	142
query92	78	81	74	74
query93	1524	1449	838	838
query94	625	352	303	303
query95	683	466	337	337
query96	1009	804	342	342
query97	2766	2791	2612	2612
query98	240	232	227	227
query99	1185	1146	1039	1039
Total cold run time: 255371 ms
Total hot run time: 172699 ms

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 25, 2026
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Findings: one blocking test-scheduling issue was found. The existing point_query_executor row-cache comment is already-known review context and was not duplicated.

Critical checkpoint conclusions:

  • Goal/test proof: The PR aims to make point_query tests serial, but it only marks test_point_query_ck as nonConcurrent; the sibling test_point_query suite remains concurrent, so the stated scheduling goal is incomplete.
  • Scope/clarity: The change is small, but incomplete for the stated problem.
  • Concurrency: This is directly about regression concurrency. nonConcurrent suites run in the SINGLE phase after NORMAL suites, while test_point_query still runs as NORMAL and mutates global FE config, so it can still interfere with other normal suites.
  • Lifecycle/static initialization: Not applicable.
  • Configuration items: No new config is added; existing global FE config mutation remains a concurrency hazard in test_point_query.
  • Compatibility/storage/transaction/data correctness: Not applicable for this test-only PR.
  • Parallel code paths: test_point_query is the parallel sibling path and should be updated as well.
  • Special checks: Not applicable.
  • Test coverage/results: No new test output is involved; this is regression scheduler metadata only.
  • Observability/performance: Not applicable.

User focus: No additional user-provided review focus was specified.

import java.math.BigDecimal;

suite("test_point_query_ck") {
suite("test_point_query_ck", "nonConcurrent") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only moves test_point_query_ck into the nonConcurrent phase, but the sibling regression-test/suites/point_query_p0/test_point_query.groovy suite is still declared as suite("test_point_query"). That suite also mutates process-wide FE state (ADMIN SET FRONTEND CONFIG ("enable_lightweight_lookup_request" = ...)), and the regression framework runs NORMAL suites concurrently before SINGLE suites (RegressionTest.runScripts waits for NORMAL futures before running GroupExecType.SINGLE). So this PR prevents test_point_query_ck from overlapping with normal suites, but test_point_query can still overlap with other normal suites and cause the same class of global-config interference that the PR title says should be fixed. Please mark test_point_query as nonConcurrent too, or otherwise explain why only the CK variant needs serialization.

@eldenmoon eldenmoon merged commit 0c0dba3 into apache:master May 25, 2026
34 checks passed
github-actions Bot pushed a commit that referenced this pull request May 25, 2026
test_point_query and test_point_query_ck should be serial with
nonConcurrent
github-actions Bot pushed a commit that referenced this pull request May 25, 2026
test_point_query and test_point_query_ck should be serial with
nonConcurrent
morningman pushed a commit that referenced this pull request May 25, 2026
…urrent #63587 (#63623)

Cherry-picked from #63587

Co-authored-by: lihangyu <lihangyu@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.6-merged dev/4.1.x

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants