branch-4.0: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748)#63360
branch-4.0: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748)#63360mymeiyi wants to merge 1 commit into
Conversation
…apache#62748) 1. when fe starts in metadata_failure_recovery mode with different host, the `CloudClusterChecker` will drop the fe and there is no fe in bdbje, fe can not start normally ``` 2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false 2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false 2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty 2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null 2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes [] ``` 2. modify regression framework to support start fe with restore_snapshot
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Cherry-pick of #62748 into branch-4.0. Fixes an issue where starting an FE in metadata_failure_recovery mode with a different host (e.g. after backup-restore on a new machine) would cause CloudClusterChecker to drop the only FE from BDBJE, leaving the replication group empty and unable to start. Also extends the regression docker-compose framework to drive restore_snapshot flows for cloud clusters.
Changes:
- In
Env.checkCurrentNodeExist, when running in recovery mode, look up the local FE by node name and, after validating role and edit-log port match, rewrite its host to the current node's address and persist vialogModifyFrontendbeforeCloudClusterCheckerruns. - Init script
init_fe.shdetectsconf/restore_snapshot.sh, runs it, extractsJOURNAL_ID, and passes--metadata_failure_recovery --recovery_journal_idtorun_fe;init_fdb.shwritesblob_creds.jsonfromDORIS_CLOUD_*env vars andfdb.confenablesbackup_agentwith those credentials. - doris-compose gains a
--cloud-configflag (merged into the cloud store config and propagated to FDB container env) plusfdb-versiondefault bumped to7.3.69;SuiteClusterexposesstartMetaServices/stopMetaServices/startRecyclers/stopRecyclersand acloudStoreConfigsoption.enable_check_fe_drop_in_safe_timebecomes mutable (CONF_mBool).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java | Adds updateRecoveryFrontendHostIfNeeded to rewrite local FE host during metadata recovery in cloud mode. |
| docker/runtime/doris-compose/resource/init_fe.sh | Executes restore_snapshot.sh, extracts JOURNAL_ID, and starts FE with recovery args. |
| docker/runtime/doris-compose/resource/init_fdb.sh | Generates FDB blob_creds.json from DORIS_CLOUD_* env vars (OSS/COS embed bucket in AK). |
| docker/runtime/doris-compose/resource/fdb.conf | Points backup_agent at the generated blob credentials and adds a second agent instance. |
| docker/runtime/doris-compose/command.py | Adds --cloud-config overrides, bumps default fdb-version to 7.3.69, reuses cached cloud config for storage vault. |
| docker/runtime/doris-compose/cluster.py | Propagates cloud_store_config into the FDB node's docker env. |
| cloud/src/common/config.h | Makes enable_check_fe_drop_in_safe_time mutable at runtime. |
| regression-test/framework/.../SuiteCluster.groovy | Adds cloudStoreConfigs option and start/stop helpers for meta services and recyclers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
pick: #62748