[fix](fe) fix host not match if start fe in metadata_failure_recovery#62748
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Fixes cloud FE startup in metadata_failure_recovery when restored metadata contains a stale FE host/IP, which previously caused CloudClusterChecker to drop the only FE and leave the BDBJE group empty. Also extends the docker-compose runtime and regression framework to better support restore/snapshot recovery workflows.
Changes:
- FE: In cloud recovery mode, locate the FE entry by
nodeNameand persist an updated host to match the current node before cloud cluster checking. - Regression framework: Add SuiteCluster start/stop helpers for meta services and recyclers.
- Docker-compose runtime: Auto-detect and run a restore script, start FE with recovery flags, update default FDB version, and adjust fdb monitor config.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/framework/src/main/groovy/org/apache/doris/regression/suite/SuiteCluster.groovy | Adds start/stop helpers for meta services and recyclers in the regression cluster wrapper. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java | Updates FE host in recovered metadata (cloud mode) to prevent CloudClusterChecker from dropping self during recovery. |
| docker/runtime/doris-compose/resource/init_fe.sh | Executes restore_snapshot.sh when present and starts FE with --metadata_failure_recovery and --recovery_journal_id. |
| docker/runtime/doris-compose/resource/fdb.conf | Adds a backup_agent.1 section alongside the existing backup agent config. |
| docker/runtime/doris-compose/command.py | Bumps default FoundationDB image version for cloud clusters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was failure (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
/review |
There was a problem hiding this comment.
I found one blocking reliability issue in the FDB backup-agent credential setup. Existing review threads already cover the recovery-script exit handling in init_fe.sh and the SuiteCluster naming/comment concerns, so I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: The PR aims to allow cloud FE metadata recovery after host changes and add compose support for restore workflows. The FE-side recovery host update matches that goal, but the FDB credential setup is incomplete for already-initialized compose clusters. No PR test evidence is provided in the checklist.
- Scope: Most changes are focused, but the FDB backup-agent config change needs the credential file lifecycle fixed.
- Concurrency/lifecycle: The FE recovery host update runs during master transition before CloudClusterChecker starts, so no new obvious runtime concurrency issue was found there. The FDB credential creation has a lifecycle bug across container restarts/upgrades.
- Config compatibility: Changing enable_check_fe_drop_in_safe_time to mutable looks safe for runtime toggling. The FDB default version/config change needs compatibility with existing initialized compose volumes.
- Persistence/transactionality: FE logs the modified frontend host after replay; I did not find an additional persistence gap in that path.
- Parallel code paths: CloudClusterChecker self-node diff behavior appears addressed by updating the in-memory FE host before the checker runs.
- Tests: No focused regression/manual test evidence was included for recovery with an already-initialized FDB container.
- Observability/performance: Existing recovery logs are sufficient for the reviewed paths; no material performance concern found.
User focus: No additional user-provided review focus was specified.
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
TPC-H: Total hot run time: 29381 ms |
TPC-DS: Total hot run time: 170403 ms |
FE UT Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I did not add duplicate inline comments because the remaining blocking issue is already covered by the existing thread on docker/runtime/doris-compose/resource/init_fdb.sh:30.
Critical checkpoint conclusions:
- Goal/test proof: the PR aims to support metadata failure recovery with cloud compose backup/restore credentials and recovery startup. The FE recovery-script exit handling has been fixed, but the FDB credentials path is still not correct for already-initialized clusters. I do not see a regression/docker test covering the upgrade/restart path with an existing
HAS_INIT_FDB_FILEandbackup_agent --blob-credentials. - Scope/focus: the current server-side diff is reasonably focused on compose cloud recovery and FE host correction, with SuiteCluster helper additions.
- Concurrency/lifecycle: FE recovery host update runs during master startup before CloudClusterChecker, which matches the stated lifecycle goal. The FDB startup lifecycle remains unsafe because
init_dbreturns before creatingblob_creds.jsonon existing clusters whilefdb.confalways passes that file tobackup_agent. - Configuration:
enable_check_fe_drop_in_safe_timebecoming mutable is low risk.--cloud-configis create-time only and flows to FDB envs, but its credential file generation still needs to happen before the initialized-cluster guard or be made conditional. - Compatibility/persistence:
logModifyFrontend(selfFe)has matching replay. I did not find an additional persistence issue in the reviewed diff. - Parallel paths: no additional parallel FE recovery path issue found. The existing FDB credential issue applies to the existing-cluster/restart path, distinct from first initialization.
- Tests: missing coverage for
--cloud-configpropagation toblob_creds.json, FDBbackup_agentstartup with credentials, and metadata recovery startup fromrestore_snapshot.sh. - Observability: current logs are adequate for the FE recovery-script flow; the FDB credential failure would be easier to diagnose if credential creation/absence is logged before startup.
User focus: no additional user-provided review focus was supplied.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
CloudClusterCheckerwill drop the fe and there is no fe in bdbje, fe can not start normallyRelease note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)