Skip to content

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748

Merged
mymeiyi merged 1 commit into
apache:masterfrom
mymeiyi:fix-br-0423
May 12, 2026
Merged

[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748
mymeiyi merged 1 commit into
apache:masterfrom
mymeiyi:fix-br-0423

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 23, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

  1. when fe starts in metadata_failure_recovery mode with different host, the CloudClusterChecker will drop the fe and there is no fe in bdbje, fe can not start normally
2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false
2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false

2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty
2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null
2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes []
  1. modify regression framework to support start fe with restore_snapshot

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings April 23, 2026 07:38
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 23, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mymeiyi mymeiyi changed the title [fix](fe) fix host not match if fe starts in metadata_failure_recovery [fix](fe) fix host not match if start fe in metadata_failure_recovery Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.

Changes:

  • FE: In cloud recovery mode, locate the FE entry by nodeName and persist an updated host to match the current node before cloud cluster checking.
  • Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
  • Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper.
fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java Updates FE host in recovered metadata (cloud mode) to prevent CloudClusterChecker from dropping self during recovery.
docker/runtime/doris-compose/resource/init_fe.sh Executes restore_snapshot.sh when present and starts FE with --metadata_failure_recovery and --recovery_journal_id.
docker/runtime/doris-compose/resource/fdb.conf Adds a backup_agent.1 section alongside the existing backup agent config.
docker/runtime/doris-compose/command.py Bumps default FoundationDB image version for cloud clusters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docker/runtime/doris-compose/resource/init_fe.sh
w41ter
w41ter previously approved these changes Apr 24, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 24, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

OpenCode automated review failed and did not complete.

Error: Review step was failure (possibly timeout or cancelled)
Workflow run: https://github.com/apache/doris/actions/runs/24875837124

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 24, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/67) 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Apr 27, 2026
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 7, 2026

run buildall

w41ter
w41ter previously approved these changes May 7, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR approved by at least one committer and no changes requested.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 7, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking reliability issue in the FDB backup-agent credential setup. Existing review threads already cover the recovery-script exit handling in init_fe.sh and the SuiteCluster naming/comment concerns, so I did not duplicate those.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to allow cloud FE metadata recovery after host changes and add compose support for restore workflows. The FE-side recovery host update matches that goal, but the FDB credential setup is incomplete for already-initialized compose clusters. No PR test evidence is provided in the checklist.
  • Scope: Most changes are focused, but the FDB backup-agent config change needs the credential file lifecycle fixed.
  • Concurrency/lifecycle: The FE recovery host update runs during master transition before CloudClusterChecker starts, so no new obvious runtime concurrency issue was found there. The FDB credential creation has a lifecycle bug across container restarts/upgrades.
  • Config compatibility: Changing enable_check_fe_drop_in_safe_time to mutable looks safe for runtime toggling. The FDB default version/config change needs compatibility with existing initialized compose volumes.
  • Persistence/transactionality: FE logs the modified frontend host after replay; I did not find an additional persistence gap in that path.
  • Parallel code paths: CloudClusterChecker self-node diff behavior appears addressed by updating the in-memory FE host before the checker runs.
  • Tests: No focused regression/manual test evidence was included for recovery with an already-initialized FDB container.
  • Observability/performance: Existing recovery logs are sufficient for the reviewed paths; no material performance concern found.

User focus: No additional user-provided review focus was specified.

Comment thread docker/runtime/doris-compose/resource/init_fdb.sh
@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 11, 2026
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 11, 2026

run buildall

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 11, 2026
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29381 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 90317a6c13e401bcf8ca372725641da54b2976a9, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17694	3818	3900	3818
q2	q3	10694	858	605	605
q4	4675	454	342	342
q5	7436	1337	1150	1150
q6	193	166	137	137
q7	898	959	758	758
q8	9596	1354	1258	1258
q9	6246	5412	5309	5309
q10	6301	2090	1841	1841
q11	484	272	254	254
q12	681	422	291	291
q13	18204	3307	2750	2750
q14	295	284	261	261
q15	q16	903	864	787	787
q17	1011	1077	647	647
q18	6411	5661	5587	5587
q19	1280	1186	1086	1086
q20	517	385	257	257
q21	4718	2302	1913	1913
q22	453	390	330	330
Total cold run time: 98690 ms
Total hot run time: 29381 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4668	4554	4595	4554
q2	q3	4698	4773	4242	4242
q4	2215	2225	1412	1412
q5	5016	5029	5252	5029
q6	191	167	133	133
q7	2090	1794	1640	1640
q8	3333	3101	3119	3101
q9	8513	8440	8449	8440
q10	4552	4516	4238	4238
q11	631	419	405	405
q12	693	745	537	537
q13	3298	3600	2907	2907
q14	334	316	275	275
q15	q16	797	808	708	708
q17	1333	1299	1251	1251
q18	7884	7127	7113	7113
q19	1167	1174	1171	1171
q20	2238	2210	1938	1938
q21	6629	5387	4812	4812
q22	515	476	398	398
Total cold run time: 60795 ms
Total hot run time: 54304 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170403 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 90317a6c13e401bcf8ca372725641da54b2976a9, data reload: false

query5	4361	654	509	509
query6	321	218	200	200
query7	4234	571	317	317
query8	325	227	209	209
query9	8858	3998	4051	3998
query10	453	352	287	287
query11	5793	2446	2261	2261
query12	183	127	126	126
query13	1298	616	433	433
query14	6683	5393	5120	5120
query14_1	4398	4434	4323	4323
query15	214	202	181	181
query16	997	435	429	429
query17	1143	772	621	621
query18	2716	467	347	347
query19	210	203	170	170
query20	137	136	129	129
query21	220	136	116	116
query22	13541	13606	13368	13368
query23	17186	16389	16011	16011
query23_1	16213	16113	16161	16113
query24	7433	1733	1350	1350
query24_1	1358	1363	1334	1334
query25	554	481	444	444
query26	1282	311	170	170
query27	2726	565	326	326
query28	4410	1966	1925	1925
query29	1012	611	499	499
query30	302	242	198	198
query31	1147	1066	940	940
query32	91	77	76	76
query33	560	360	314	314
query34	1167	1078	640	640
query35	763	791	695	695
query36	1380	1376	1133	1133
query37	157	103	92	92
query38	3256	3154	3049	3049
query39	939	923	950	923
query39_1	871	878	864	864
query40	241	156	146	146
query41	72	63	61	61
query42	115	122	115	115
query43	331	329	296	296
query44	
query45	208	200	196	196
query46	1059	1179	728	728
query47	2388	2382	2240	2240
query48	427	420	287	287
query49	630	533	442	442
query50	734	299	225	225
query51	4378	4340	4244	4244
query52	113	105	96	96
query53	245	280	204	204
query54	324	268	253	253
query55	92	95	85	85
query56	304	304	300	300
query57	1421	1397	1298	1298
query58	291	271	276	271
query59	1549	1588	1420	1420
query60	362	339	328	328
query61	157	153	156	153
query62	673	617	551	551
query63	247	196	207	196
query64	2360	827	683	683
query65	
query66	1681	514	403	403
query67	29936	29945	29833	29833
query68	
query69	453	339	300	300
query70	1023	986	995	986
query71	317	286	274	274
query72	3120	2915	2506	2506
query73	780	762	431	431
query74	5077	4942	4776	4776
query75	2809	2701	2324	2324
query76	2324	1117	736	736
query77	408	432	348	348
query78	12943	12977	12320	12320
query79	1490	1017	751	751
query80	1352	598	493	493
query81	529	274	241	241
query82	1074	162	120	120
query83	344	277	252	252
query84	256	141	111	111
query85	928	512	443	443
query86	455	338	302	302
query87	3408	3346	3227	3227
query88	3547	2679	2665	2665
query89	447	385	337	337
query90	1935	181	183	181
query91	180	172	141	141
query92	79	80	74	74
query93	1035	956	556	556
query94	727	357	290	290
query95	682	401	343	343
query96	1047	778	347	347
query97	2701	2718	2570	2570
query98	245	233	235	233
query99	1143	1108	983	983
Total cold run time: 255102 ms
Total hot run time: 170403 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/31) 🎉
Increment coverage report
Complete coverage report

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 11, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add duplicate inline comments because the remaining blocking issue is already covered by the existing thread on docker/runtime/doris-compose/resource/init_fdb.sh:30.

Critical checkpoint conclusions:

  • Goal/test proof: the PR aims to support metadata failure recovery with cloud compose backup/restore credentials and recovery startup. The FE recovery-script exit handling has been fixed, but the FDB credentials path is still not correct for already-initialized clusters. I do not see a regression/docker test covering the upgrade/restart path with an existing HAS_INIT_FDB_FILE and backup_agent --blob-credentials.
  • Scope/focus: the current server-side diff is reasonably focused on compose cloud recovery and FE host correction, with SuiteCluster helper additions.
  • Concurrency/lifecycle: FE recovery host update runs during master startup before CloudClusterChecker, which matches the stated lifecycle goal. The FDB startup lifecycle remains unsafe because init_db returns before creating blob_creds.json on existing clusters while fdb.conf always passes that file to backup_agent.
  • Configuration: enable_check_fe_drop_in_safe_time becoming mutable is low risk. --cloud-config is create-time only and flows to FDB envs, but its credential file generation still needs to happen before the initialized-cluster guard or be made conditional.
  • Compatibility/persistence: logModifyFrontend(selfFe) has matching replay. I did not find an additional persistence issue in the reviewed diff.
  • Parallel paths: no additional parallel FE recovery path issue found. The existing FDB credential issue applies to the existing-cluster/restart path, distinct from first initialization.
  • Tests: missing coverage for --cloud-config propagation to blob_creds.json, FDB backup_agent startup with credentials, and metadata recovery startup from restore_snapshot.sh.
  • Observability: current logs are adequate for the FE recovery-script flow; the FDB credential failure would be easier to diagnose if credential creation/absence is logged before startup.

User focus: no additional user-provided review focus was supplied.

Copy link
Copy Markdown
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.x dev/4.1.x-conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants