Skip to content

[fix](fe) set cloud version_cache_ttl to 0 temporarily if retry a query with -230#63721

Open
mymeiyi wants to merge 1 commit into
apache:masterfrom
mymeiyi:fix-fe-230
Open

[fix](fe) set cloud version_cache_ttl to 0 temporarily if retry a query with -230#63721
mymeiyi wants to merge 1 commit into
apache:masterfrom
mymeiyi:fix-fe-230

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented May 27, 2026

If a query get E-230 error and cloud_partition_version_cache_ttl_ms is not set to 0, this pr set the session var to 0 temporarily to get the newest version.

Copilot AI review requested due to automatic review settings May 27, 2026 06:21
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FE query retry behavior in cloud mode to temporarily disable cloud partition/table version caching when retrying after an E-230 error, and adds regression/unit tests to validate the behavior and session variable restoration.

Changes:

  • In StmtExecutor.queryRetry, detect E-230 retry conditions and set cloud_*_version_cache_ttl_ms to 0 for the next retry attempt (via setVarOnce), relying on existing per-execution session var reversion.
  • Add FE unit tests covering the E-230 detection logic and the temporary TTL override + revert path.
  • Extend the cloud regression test to validate that session TTLs remain unchanged after a retry sequence triggered by injected E-230.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
regression-test/suites/cloud_p0/query_retry/test_retry_e-230.groovy Adds a regression scenario verifying TTL values are restored after an E-230-triggered retry.
fe/fe-core/src/test/java/org/apache/doris/qe/StmtExecutorTest.java Adds unit tests for E-230 detection and temporary TTL override + revert behavior.
fe/fe-core/src/main/java/org/apache/doris/qe/StmtExecutor.java Implements cloud-mode E-230 retry hook to temporarily disable version caching for the next attempt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread regression-test/suites/cloud_p0/query_retry/test_retry_e-230.groovy
Comment thread fe/fe-core/src/main/java/org/apache/doris/qe/StmtExecutor.java
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary:

No additional blocking issues found in the PR diff beyond the existing inline threads about the unconditional TTL log message and the typoed test future variable names.

Critical checkpoint conclusions:

  • Goal and proof: The PR targets E-230 retry failures caused by stale cloud version caches by temporarily forcing table/partition version TTLs to 0 for the next retry. The added unit tests cover the decision and revert path, and the cloud regression test covers retry success with restored session TTLs.
  • Scope and clarity: The production change is small and localized to StmtExecutor retry handling.
  • Concurrency: No new shared cross-session mutable state is introduced. The session variable changes are scoped to the current ConnectContext and restored through the existing execute() finally path.
  • Lifecycle: No new static initialization or object lifecycle concerns found. Config.cloud_unique_id changes in the unit test are restored in finally.
  • Configuration: No new config/session variables are added; existing cloud TTL session variables are temporarily overridden through existing SET_VAR-style machinery.
  • Compatibility: No FE-BE protocol, storage format, or persistence compatibility change found.
  • Parallel paths: The change applies to StmtExecutor query retry; async MTMV retry has separate existing coverage and was not modified here.
  • Conditional checks: The E-230/cloud-mode/TTL condition is consistent with the target failure mode.
  • Tests and results: Added tests are relevant. Existing review already covers a minor test readability typo.
  • Observability: Existing review already covers the misleading INFO log when setVarOnce fails; no additional observability blocker found.
  • Transactions/persistence/data writes: No new persistence path is introduced. For query retry, successful execution observes freshly fetched cloud versions while original session TTLs are restored afterward.
  • Performance: The extra work is only on retry after E-230 and consists of two session variable updates before replanning, so no hot-path performance issue found.

User focus: No additional user-provided review focus was supplied.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 27, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue in the retry path. The change handles normal session TTLs, but SQL-level SET_VAR hints can still overwrite the temporary zero before planning reads cloud visible versions.

Critical checkpoint conclusions:

  • Goal/test: the PR targets E-230 retries with stale cloud version cache and adds unit/regression coverage, but the SET_VAR hint path is not covered and can bypass the fix.
  • Scope: the implementation is small and focused.
  • Concurrency/lifecycle: no new shared mutable state, locks, or special lifecycle concerns found.
  • Config/compatibility: no new configuration or incompatible storage/protocol change.
  • Parallel paths: normal session variables are covered, but the SQL-level SET_VAR path is a parallel session-variable mutation path and needs handling.
  • Tests: added tests cover plain session TTL restore; they do not cover SET_VAR overriding the retry TTL.
  • Observability/performance/transactions: no additional blocking issue found beyond the existing known INFO-log review thread.

User focus: no additional user-provided review focus was specified.

Comment thread fe/fe-core/src/main/java/org/apache/doris/qe/StmtExecutor.java
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31011 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8d5d55921b0b61c67bbfb77a3f8dd0ac6dc417e3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17688	3969	3933	3933
q2	q3	10741	1371	788	788
q4	4685	468	342	342
q5	7632	2268	2087	2087
q6	241	173	134	134
q7	956	810	636	636
q8	9355	1724	1527	1527
q9	6644	4950	4900	4900
q10	6448	2249	1896	1896
q11	429	269	242	242
q12	697	414	295	295
q13	18266	3352	2780	2780
q14	267	259	237	237
q15	q16	819	774	707	707
q17	976	962	959	959
q18	6964	5861	5565	5565
q19	1286	1276	1011	1011
q20	498	394	249	249
q21	5603	2455	2416	2416
q22	433	360	307	307
Total cold run time: 100628 ms
Total hot run time: 31011 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4346	4263	4250	4250
q2	q3	4528	4936	4347	4347
q4	2096	2193	1389	1389
q5	4413	4312	4330	4312
q6	225	288	197	197
q7	2035	1942	1685	1685
q8	2439	2165	2117	2117
q9	7970	7947	7941	7941
q10	4821	4772	4479	4479
q11	573	400	376	376
q12	735	754	521	521
q13	3245	3698	2932	2932
q14	300	291	295	291
q15	q16	716	742	649	649
q17	1346	1327	1314	1314
q18	7938	7409	6808	6808
q19	1128	1142	1159	1142
q20	2205	2225	2001	2001
q21	5298	4537	4419	4419
q22	533	454	402	402
Total cold run time: 56890 ms
Total hot run time: 51572 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171852 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8d5d55921b0b61c67bbfb77a3f8dd0ac6dc417e3, data reload: false

query5	4355	674	524	524
query6	321	234	204	204
query7	4254	559	307	307
query8	326	238	226	226
query9	8804	4010	3991	3991
query10	440	361	296	296
query11	5786	2578	2278	2278
query12	184	129	130	129
query13	1282	621	446	446
query14	6168	5452	5163	5163
query14_1	4486	4486	4482	4482
query15	211	206	185	185
query16	1014	455	430	430
query17	1134	722	576	576
query18	2658	481	372	372
query19	215	201	162	162
query20	138	133	130	130
query21	217	136	112	112
query22	13619	13637	13342	13342
query23	17150	16607	16250	16250
query23_1	16261	16364	16241	16241
query24	7401	1775	1265	1265
query24_1	1340	1319	1285	1285
query25	528	461	419	419
query26	837	327	178	178
query27	2710	603	333	333
query28	4436	1952	1935	1935
query29	926	640	505	505
query30	302	236	202	202
query31	1130	1065	944	944
query32	99	79	76	76
query33	582	365	313	313
query34	1174	1146	643	643
query35	778	804	715	715
query36	1414	1395	1273	1273
query37	150	114	91	91
query38	3241	3170	3163	3163
query39	952	924	898	898
query39_1	897	893	881	881
query40	229	151	127	127
query41	71	72	68	68
query42	109	112	110	110
query43	331	330	302	302
query44	
query45	216	208	203	203
query46	1118	1215	782	782
query47	2426	2457	2261	2261
query48	433	418	318	318
query49	627	501	410	410
query50	1034	351	255	255
query51	4368	4463	4282	4282
query52	108	106	95	95
query53	260	279	202	202
query54	323	288	263	263
query55	96	98	88	88
query56	319	315	309	309
query57	1452	1439	1342	1342
query58	307	283	278	278
query59	1602	1700	1469	1469
query60	331	342	316	316
query61	185	181	181	181
query62	706	657	567	567
query63	245	208	238	208
query64	1932	833	679	679
query65	
query66	1621	499	382	382
query67	29759	29601	29627	29601
query68	
query69	460	336	303	303
query70	1043	1018	984	984
query71	318	277	272	272
query72	2924	2667	2354	2354
query73	835	764	453	453
query74	5135	4966	4772	4772
query75	2698	2603	2268	2268
query76	2320	1130	771	771
query77	400	387	334	334
query78	12436	12636	11939	11939
query79	1599	1078	747	747
query80	1280	545	460	460
query81	533	283	238	238
query82	956	163	158	158
query83	317	271	242	242
query84	261	145	110	110
query85	915	526	445	445
query86	460	363	293	293
query87	3431	3376	3229	3229
query88	3628	2721	2736	2721
query89	450	387	354	354
query90	1883	183	171	171
query91	177	166	137	137
query92	83	78	72	72
query93	1652	1404	947	947
query94	720	360	312	312
query95	686	490	334	334
query96	1042	782	363	363
query97	2762	2744	2648	2648
query98	238	231	224	224
query99	1182	1148	1044	1044
Total cold run time: 253967 ms
Total hot run time: 171852 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 1.64% (1/61) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 50.00% (12/24) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Collaborator

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants