[Feature] Doris Cluster Snapshot Backup#61465
[Feature] Doris Cluster Snapshot Backup#61465kaori-seasons wants to merge 13 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Code Review Summary: [Feature] Doris Cluster Snapshot Backup
Reviewed all 17 changed files (4376 additions, 43 deletions). This PR introduces a new DorisSnapshotManager (C++) and DorisCloudSnapshotHandler (Java) implementing the full snapshot lifecycle (begin, update, commit, abort, drop, list, clone) plus recycling, compaction, and migration.
Critical Issues (2)
-
Dead/unreachable code in
drop_snapshot(doris_snapshot_manager.cpp:552-555): The idempotency check forSNAPSHOT_RECYCLEDis unreachable because the guard at line 542-548 already rejects all statuses except NORMAL and ABORTED. A user dropping an already-recycled snapshot gets an error instead of the intended OK response. -
TOCTOU race condition in
begin_snapshot(doris_snapshot_manager.cpp:248-264): The two-transaction pattern creates a window where the snapshot exists withoutimage_url. The second transaction does a blind overwrite without re-reading, so any concurrentupdate_snapshotmodifications between txn1 and txn2 are silently lost. If txn2 fails, an orphaned PREPARE snapshot remains.
Major Issues (5)
-
abort_snapshotallows aborting NORMAL/RECYCLED snapshots (doris_snapshot_manager.cpp:474-486): Only ABORTED status has an idempotency guard. Any other status (including NORMAL, RECYCLED) is blindly transitioned to ABORTED, which is semantically incorrect and could interfere with the recycler. -
txn->commit()return value ignored inrecycle_snapshots(doris_snapshot_manager.cpp:962-964, 978-980, 1020-1023): When marking snapshots as ABORTED/RECYCLED during recycling, commit failures are silently ignored. The in-memory status is already mutated, so subsequent classification logic may incorrectly add still-NORMAL/PREPARE snapshots to the recycling queue, potentially causing premature data deletion. -
inverted_check_mvcc_meta_keyis a no-op (doris_snapshot_manager.cpp:1330-1383): Builds avalid_snapshotsset but never queries it. The reference key scan only counts entries without validating them. Always returns 0 (success). -
TOCTOU race in
submitJob/submitAutoJob(DorisCloudSnapshotHandler.java:118-121): The check-then-act onAtomicReference<SnapshotState>is not atomic. Two concurrent callers can both pass the null-check and enqueue work, potentially creating two snapshot workflows. -
buildObjectKeyignoresobjInfo.prefix(DorisCloudSnapshotHandler.java:260-264): The method acceptsobjInfobut never usesobjInfo.getPrefix(). If the object store has a non-empty prefix configured, uploads go to the wrong path and images won't be found during restore.
Minor Issues
-
Duplicate
txn_kv_member (doris_snapshot_manager.h:100): Both base and derived class store ashared_ptr<TxnKv>. Should make base class memberprotectedor add accessor. -
current_time_seconds()called twice (doris_snapshot_manager.cpp:819-820):ctimeandmtimecould differ if second boundary is crossed. Capture once. -
Byte increment overflow (
doris_snapshot_manager.cpp:1095-1098):ref_end.back() + 1wraps to 0x00 if last byte is 0xFF, producing an empty/incorrect range scan. -
Massive code duplication (
DorisCloudSnapshotHandler.java):executeSnapshotWorkflowandexecuteAutoSnapshotWorkfloware ~90% identical. Should be unified with anisAutoparameter. -
Config typo:
cloud_auto_snapshot_max_reversed_numshould becloud_auto_snapshot_max_reserved_num(Config.java:3858). Fixing later is a breaking change for users.
Test Quality Observations
- C++ tests have significant boilerplate duplication (instance setup copied 5x).
testSubmitJobRejectsDuplicatein Java tests has no actual rejection assertion.- Regression test
test_snapshot_lifecycle.groovyuses hardcoded column indices (e.g.,rows[0][6]) which are fragile. - Auto-snapshot test (Test 9 in lifecycle) is a no-op — never verifies count increased.
fe/fe-core/src/main/java/org/apache/doris/cloud/snapshot/DorisCloudSnapshotHandler.java
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/cloud/snapshot/DorisCloudSnapshotHandler.java
Show resolved
Hide resolved
- Fix mock_accessor.h include guard (#ifdef UNIT_TEST) in: - snapshot_chain_compactor.cpp - snapshot_data_migrator.cpp - Fix hdfs_accessor.h include guard (#ifdef ENABLE_HDFS_STORAGE_VAULT) - Add 15 C++ test cases: CloneInstance, RecycleSnapshots, Update/Commit/Drop - Add 5 Java test methods: RPC failures, TTL, error paths, orphan abort - Strengthen Groovy assertion from weak warn to hard fail
f5e45ca to
09675f1
Compare
a1ff2ff to
e5781aa
Compare
What problem does this PR solve?
Issue Number: Related to issue-61464
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)