[fix](fe) fix host not match if start fe in metadata_failure_recovery by mymeiyi · Pull Request #62748 · apache/doris

mymeiyi · 2026-04-23T07:38:05Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

when fe starts in metadata_failure_recovery mode with different host, the CloudClusterChecker will drop the fe and there is no fe in bdbje, fe can not start normally

2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false
2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false

2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty
2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null
2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes []

modify regression framework to support start fe with restore_snapshot

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-23T07:38:13Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.

Changes:

FE: In cloud recovery mode, locate the FE entry by nodeName and persist an updated host to match the current node before cloud cluster checking.
Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy	Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper.
fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java	Updates FE host in recovered metadata (cloud mode) to prevent `CloudClusterChecker` from dropping self during recovery.
docker/runtime/doris-compose/resource/init_fe.sh	Executes `restore_snapshot.sh` when present and starts FE with `--metadata_failure_recovery` and `--recovery_journal_id`.
docker/runtime/doris-compose/resource/fdb.conf	Adds a `backup_agent.1` section alongside the existing backup agent config.
docker/runtime/doris-compose/command.py	Bumps default FoundationDB image version for cloud clusters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-24T06:30:47Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-04-24T06:30:49Z

PR approved by anyone and no changes requested.

mymeiyi · 2026-04-24T06:32:44Z

/review

github-actions · 2026-04-24T08:28:53Z

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24875837124

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

mymeiyi · 2026-04-24T08:49:30Z

run buildall

hello-stephen · 2026-04-24T11:20:37Z

FE Regression Coverage Report

Increment line coverage 0.00% (0/67) 🎉
Increment coverage report
Complete coverage report

mymeiyi · 2026-05-07T06:17:35Z

run buildall

github-actions · 2026-05-07T08:19:46Z

PR approved by at least one committer and no changes requested.

mymeiyi · 2026-05-07T11:14:57Z

/review

github-actions

I found one blocking reliability issue in the FDB backup-agent credential setup. Existing review threads already cover the recovery-script exit handling in init_fe.sh and the SuiteCluster naming/comment concerns, so I did not duplicate those.

Critical checkpoint conclusions:

Goal/test: The PR aims to allow cloud FE metadata recovery after host changes and add compose support for restore workflows. The FE-side recovery host update matches that goal, but the FDB credential setup is incomplete for already-initialized compose clusters. No PR test evidence is provided in the checklist.
Scope: Most changes are focused, but the FDB backup-agent config change needs the credential file lifecycle fixed.
Concurrency/lifecycle: The FE recovery host update runs during master transition before CloudClusterChecker starts, so no new obvious runtime concurrency issue was found there. The FDB credential creation has a lifecycle bug across container restarts/upgrades.
Config compatibility: Changing enable_check_fe_drop_in_safe_time to mutable looks safe for runtime toggling. The FDB default version/config change needs compatibility with existing initialized compose volumes.
Persistence/transactionality: FE logs the modified frontend host after replay; I did not find an additional persistence gap in that path.
Parallel code paths: CloudClusterChecker self-node diff behavior appears addressed by updating the in-memory FE host before the checker runs.
Tests: No focused regression/manual test evidence was included for recovery with an already-initialized FDB container.
Observability/performance: Existing recovery logs are sufficient for the reviewed paths; no material performance concern found.

User focus: No additional user-provided review focus was specified.

mymeiyi · 2026-05-11T03:00:51Z

run buildall

github-actions · 2026-05-11T03:16:41Z

PR approved by at least one committer and no changes requested.

hello-stephen · 2026-05-11T04:16:36Z

TPC-H: Total hot run time: 29381 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 90317a6c13e401bcf8ca372725641da54b2976a9, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17694	3818	3900	3818
q2	q3	10694	858	605	605
q4	4675	454	342	342
q5	7436	1337	1150	1150
q6	193	166	137	137
q7	898	959	758	758
q8	9596	1354	1258	1258
q9	6246	5412	5309	5309
q10	6301	2090	1841	1841
q11	484	272	254	254
q12	681	422	291	291
q13	18204	3307	2750	2750
q14	295	284	261	261
q15	q16	903	864	787	787
q17	1011	1077	647	647
q18	6411	5661	5587	5587
q19	1280	1186	1086	1086
q20	517	385	257	257
q21	4718	2302	1913	1913
q22	453	390	330	330
Total cold run time: 98690 ms
Total hot run time: 29381 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4668	4554	4595	4554
q2	q3	4698	4773	4242	4242
q4	2215	2225	1412	1412
q5	5016	5029	5252	5029
q6	191	167	133	133
q7	2090	1794	1640	1640
q8	3333	3101	3119	3101
q9	8513	8440	8449	8440
q10	4552	4516	4238	4238
q11	631	419	405	405
q12	693	745	537	537
q13	3298	3600	2907	2907
q14	334	316	275	275
q15	q16	797	808	708	708
q17	1333	1299	1251	1251
q18	7884	7127	7113	7113
q19	1167	1174	1171	1171
q20	2238	2210	1938	1938
q21	6629	5387	4812	4812
q22	515	476	398	398
Total cold run time: 60795 ms
Total hot run time: 54304 ms

hello-stephen · 2026-05-11T04:27:33Z

TPC-DS: Total hot run time: 170403 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 90317a6c13e401bcf8ca372725641da54b2976a9, data reload: false

query5	4361	654	509	509
query6	321	218	200	200
query7	4234	571	317	317
query8	325	227	209	209
query9	8858	3998	4051	3998
query10	453	352	287	287
query11	5793	2446	2261	2261
query12	183	127	126	126
query13	1298	616	433	433
query14	6683	5393	5120	5120
query14_1	4398	4434	4323	4323
query15	214	202	181	181
query16	997	435	429	429
query17	1143	772	621	621
query18	2716	467	347	347
query19	210	203	170	170
query20	137	136	129	129
query21	220	136	116	116
query22	13541	13606	13368	13368
query23	17186	16389	16011	16011
query23_1	16213	16113	16161	16113
query24	7433	1733	1350	1350
query24_1	1358	1363	1334	1334
query25	554	481	444	444
query26	1282	311	170	170
query27	2726	565	326	326
query28	4410	1966	1925	1925
query29	1012	611	499	499
query30	302	242	198	198
query31	1147	1066	940	940
query32	91	77	76	76
query33	560	360	314	314
query34	1167	1078	640	640
query35	763	791	695	695
query36	1380	1376	1133	1133
query37	157	103	92	92
query38	3256	3154	3049	3049
query39	939	923	950	923
query39_1	871	878	864	864
query40	241	156	146	146
query41	72	63	61	61
query42	115	122	115	115
query43	331	329	296	296
query44	
query45	208	200	196	196
query46	1059	1179	728	728
query47	2388	2382	2240	2240
query48	427	420	287	287
query49	630	533	442	442
query50	734	299	225	225
query51	4378	4340	4244	4244
query52	113	105	96	96
query53	245	280	204	204
query54	324	268	253	253
query55	92	95	85	85
query56	304	304	300	300
query57	1421	1397	1298	1298
query58	291	271	276	271
query59	1549	1588	1420	1420
query60	362	339	328	328
query61	157	153	156	153
query62	673	617	551	551
query63	247	196	207	196
query64	2360	827	683	683
query65	
query66	1681	514	403	403
query67	29936	29945	29833	29833
query68	
query69	453	339	300	300
query70	1023	986	995	986
query71	317	286	274	274
query72	3120	2915	2506	2506
query73	780	762	431	431
query74	5077	4942	4776	4776
query75	2809	2701	2324	2324
query76	2324	1117	736	736
query77	408	432	348	348
query78	12943	12977	12320	12320
query79	1490	1017	751	751
query80	1352	598	493	493
query81	529	274	241	241
query82	1074	162	120	120
query83	344	277	252	252
query84	256	141	111	111
query85	928	512	443	443
query86	455	338	302	302
query87	3408	3346	3227	3227
query88	3547	2679	2665	2665
query89	447	385	337	337
query90	1935	181	183	181
query91	180	172	141	141
query92	79	80	74	74
query93	1035	956	556	556
query94	727	357	290	290
query95	682	401	343	343
query96	1047	778	347	347
query97	2701	2718	2570	2570
query98	245	233	235	233
query99	1143	1108	983	983
Total cold run time: 255102 ms
Total hot run time: 170403 ms

hello-stephen · 2026-05-11T04:37:13Z

FE UT Coverage Report

Increment line coverage 0.00% (0/31) 🎉
Increment coverage report
Complete coverage report

mymeiyi · 2026-05-11T08:34:35Z

/review

github-actions

I did not add duplicate inline comments because the remaining blocking issue is already covered by the existing thread on docker/runtime/doris-compose/resource/init_fdb.sh:30.

Critical checkpoint conclusions:

Goal/test proof: the PR aims to support metadata failure recovery with cloud compose backup/restore credentials and recovery startup. The FE recovery-script exit handling has been fixed, but the FDB credentials path is still not correct for already-initialized clusters. I do not see a regression/docker test covering the upgrade/restart path with an existing HAS_INIT_FDB_FILE and backup_agent --blob-credentials.
Scope/focus: the current server-side diff is reasonably focused on compose cloud recovery and FE host correction, with SuiteCluster helper additions.
Concurrency/lifecycle: FE recovery host update runs during master startup before CloudClusterChecker, which matches the stated lifecycle goal. The FDB startup lifecycle remains unsafe because init_db returns before creating blob_creds.json on existing clusters while fdb.conf always passes that file to backup_agent.
Configuration: enable_check_fe_drop_in_safe_time becoming mutable is low risk. --cloud-config is create-time only and flows to FDB envs, but its credential file generation still needs to happen before the initialized-cluster guard or be made conditional.
Compatibility/persistence: logModifyFrontend(selfFe) has matching replay. I did not find an additional persistence issue in the reviewed diff.
Parallel paths: no additional parallel FE recovery path issue found. The existing FDB credential issue applies to the existing-cluster/restart path, distinct from first initialization.
Tests: missing coverage for --cloud-config propagation to blob_creds.json, FDB backup_agent startup with credentials, and metadata recovery startup from restore_snapshot.sh.
Observability: current logs are adequate for the FE recovery-script flow; the FDB credential failure would be easier to diagnose if credential creation/absence is logged before startup.

User focus: no additional user-provided review focus was supplied.

resolve

dataroaring

LGTM

Copilot AI review requested due to automatic review settings April 23, 2026 07:38

mymeiyi requested review from CalvinKirs, dataroaring and morningman as code owners April 23, 2026 07:38

mymeiyi changed the title ~~[fix](fe) fix host not match if fe starts in metadata_failure_recovery~~ [fix](fe) fix host not match if start fe in metadata_failure_recovery Apr 23, 2026

Copilot started reviewing on behalf of mymeiyi April 23, 2026 07:39 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

mymeiyi force-pushed the fix-br-0423 branch from f270635 to 95d643e Compare April 24, 2026 06:14

mymeiyi requested review from gavinchou and w41ter as code owners April 24, 2026 06:14

w41ter previously approved these changes Apr 24, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 24, 2026

github-actions Bot added the reviewed label Apr 24, 2026

mymeiyi dismissed w41ter’s stale review via a39f819 April 27, 2026 03:25

github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Apr 27, 2026

w41ter previously approved these changes May 7, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 7, 2026

github-actions Bot requested changes May 7, 2026

View reviewed changes

Comment thread docker/runtime/doris-compose/resource/init_fdb.sh

doris pr

90317a6

mymeiyi dismissed w41ter’s stale review via 90317a6 May 11, 2026 02:58

mymeiyi force-pushed the fix-br-0423 branch from a39f819 to 90317a6 Compare May 11, 2026 02:58

github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 11, 2026

gavinchou approved these changes May 11, 2026

View reviewed changes

gavinchou added the dev/4.1.x label May 11, 2026

github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 11, 2026

github-actions Bot previously requested changes May 11, 2026

View reviewed changes

dataroaring approved these changes May 12, 2026

View reviewed changes

mymeiyi merged commit 058d978 into apache:master May 12, 2026
34 of 35 checks passed

github-actions Bot added the dev/4.1.x-conflict label May 12, 2026

This was referenced May 18, 2026

branch-4.0: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748) #63360

Open

branch-4.1: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748) #63362

Open

Conversation

mymeiyi commented Apr 23, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

mymeiyi commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

mymeiyi commented Apr 24, 2026

Uh oh!

hello-stephen commented Apr 24, 2026

FE Regression Coverage Report

Uh oh!

mymeiyi commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

mymeiyi commented May 7, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mymeiyi commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

hello-stephen commented May 11, 2026

Uh oh!

hello-stephen commented May 11, 2026

Uh oh!

hello-stephen commented May 11, 2026

FE UT Coverage Report

Uh oh!

mymeiyi commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dataroaring left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants