Skip to content

branch-4.0: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748)#63360

Open
mymeiyi wants to merge 1 commit into
apache:branch-4.0from
mymeiyi:branch-4.0-pick-62748
Open

branch-4.0: [fix](fe) fix host not match if start fe in metadata_failure_recovery (#62748)#63360
mymeiyi wants to merge 1 commit into
apache:branch-4.0from
mymeiyi:branch-4.0-pick-62748

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented May 18, 2026

pick: #62748

…apache#62748)

1. when fe starts in metadata_failure_recovery mode with different host,
the `CloudClusterChecker` will drop the fe and there is no fe in bdbje,
fe can not start normally

```
2026-04-23 11:37:15,024 INFO (cloud cluster check|82) [Env.dropFrontendFromBDBJE():3515] remove frontend: name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false
2026-04-23 11:37:15,026 INFO (cloud cluster check|82) [CloudSystemInfoService.updateFrontends():442] dropped cloud frontend=name: fe_83d061f4_31b3_43ee_9764_5506795e0bfe, role: FOLLOWER, 183.70.1.1:9010, is alive: false

2026-04-23 11:39:01,373 INFO (mysql-nio-pool-3|491) [BDBEnvironment.getReplicationGroupAdmin():237] addresses is empty
2026-04-23 11:39:01,374 WARN (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():105] failed to get leader: Cannot invoke "com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName()" because "replicationGroupAdmin" is null
2026-04-23 11:39:01,374 INFO (mysql-nio-pool-3|491) [FrontendsProcNode.getFrontendsInfo():124] bdbje fes [], env fes []
```
2. modify regression framework to support start fe with restore_snapshot
Copilot AI review requested due to automatic review settings May 18, 2026 09:27
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Cherry-pick of #62748 into branch-4.0. Fixes an issue where starting an FE in metadata_failure_recovery mode with a different host (e.g. after backup-restore on a new machine) would cause CloudClusterChecker to drop the only FE from BDBJE, leaving the replication group empty and unable to start. Also extends the regression docker-compose framework to drive restore_snapshot flows for cloud clusters.

Changes:

  • In Env.checkCurrentNodeExist, when running in recovery mode, look up the local FE by node name and, after validating role and edit-log port match, rewrite its host to the current node's address and persist via logModifyFrontend before CloudClusterChecker runs.
  • Init script init_fe.sh detects conf/restore_snapshot.sh, runs it, extracts JOURNAL_ID, and passes --metadata_failure_recovery --recovery_journal_id to run_fe; init_fdb.sh writes blob_creds.json from DORIS_CLOUD_* env vars and fdb.conf enables backup_agent with those credentials.
  • doris-compose gains a --cloud-config flag (merged into the cloud store config and propagated to FDB container env) plus fdb-version default bumped to 7.3.69; SuiteCluster exposes startMetaServices/stopMetaServices/startRecyclers/stopRecyclers and a cloudStoreConfigs option. enable_check_fe_drop_in_safe_time becomes mutable (CONF_mBool).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java Adds updateRecoveryFrontendHostIfNeeded to rewrite local FE host during metadata recovery in cloud mode.
docker/runtime/doris-compose/resource/init_fe.sh Executes restore_snapshot.sh, extracts JOURNAL_ID, and starts FE with recovery args.
docker/runtime/doris-compose/resource/init_fdb.sh Generates FDB blob_creds.json from DORIS_CLOUD_* env vars (OSS/COS embed bucket in AK).
docker/runtime/doris-compose/resource/fdb.conf Points backup_agent at the generated blob credentials and adds a second agent instance.
docker/runtime/doris-compose/command.py Adds --cloud-config overrides, bumps default fdb-version to 7.3.69, reuses cached cloud config for storage vault.
docker/runtime/doris-compose/cluster.py Propagates cloud_store_config into the FDB node's docker env.
cloud/src/common/config.h Makes enable_check_fe_drop_in_safe_time mutable at runtime.
regression-test/framework/.../SuiteCluster.groovy Adds cloudStoreConfigs option and start/stop helpers for meta services and recyclers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented May 18, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.44% (1801/2267)
Line Coverage 64.90% (32496/50072)
Region Coverage 65.63% (16236/24737)
Branch Coverage 56.20% (8665/15418)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants