Skip to content

[Feature] Doris Cluster Snapshot Backup#61465

Open
kaori-seasons wants to merge 13 commits intoapache:masterfrom
kaori-seasons:isolated-backup
Open

[Feature] Doris Cluster Snapshot Backup#61465
kaori-seasons wants to merge 13 commits intoapache:masterfrom
kaori-seasons:isolated-backup

Conversation

@kaori-seasons
Copy link

What problem does this PR solve?

Issue Number: Related to issue-61464

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaori-seasons kaori-seasons changed the title Isolated backup [Feature] Doris Cluster Snapshot Backup Mar 18, 2026
@kaori-seasons
Copy link
Author

run buildall

@gavinchou
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary: [Feature] Doris Cluster Snapshot Backup

Reviewed all 17 changed files (4376 additions, 43 deletions). This PR introduces a new DorisSnapshotManager (C++) and DorisCloudSnapshotHandler (Java) implementing the full snapshot lifecycle (begin, update, commit, abort, drop, list, clone) plus recycling, compaction, and migration.

Critical Issues (2)

  1. Dead/unreachable code in drop_snapshot (doris_snapshot_manager.cpp:552-555): The idempotency check for SNAPSHOT_RECYCLED is unreachable because the guard at line 542-548 already rejects all statuses except NORMAL and ABORTED. A user dropping an already-recycled snapshot gets an error instead of the intended OK response.

  2. TOCTOU race condition in begin_snapshot (doris_snapshot_manager.cpp:248-264): The two-transaction pattern creates a window where the snapshot exists without image_url. The second transaction does a blind overwrite without re-reading, so any concurrent update_snapshot modifications between txn1 and txn2 are silently lost. If txn2 fails, an orphaned PREPARE snapshot remains.

Major Issues (5)

  1. abort_snapshot allows aborting NORMAL/RECYCLED snapshots (doris_snapshot_manager.cpp:474-486): Only ABORTED status has an idempotency guard. Any other status (including NORMAL, RECYCLED) is blindly transitioned to ABORTED, which is semantically incorrect and could interfere with the recycler.

  2. txn->commit() return value ignored in recycle_snapshots (doris_snapshot_manager.cpp:962-964, 978-980, 1020-1023): When marking snapshots as ABORTED/RECYCLED during recycling, commit failures are silently ignored. The in-memory status is already mutated, so subsequent classification logic may incorrectly add still-NORMAL/PREPARE snapshots to the recycling queue, potentially causing premature data deletion.

  3. inverted_check_mvcc_meta_key is a no-op (doris_snapshot_manager.cpp:1330-1383): Builds a valid_snapshots set but never queries it. The reference key scan only counts entries without validating them. Always returns 0 (success).

  4. TOCTOU race in submitJob/submitAutoJob (DorisCloudSnapshotHandler.java:118-121): The check-then-act on AtomicReference<SnapshotState> is not atomic. Two concurrent callers can both pass the null-check and enqueue work, potentially creating two snapshot workflows.

  5. buildObjectKey ignores objInfo.prefix (DorisCloudSnapshotHandler.java:260-264): The method accepts objInfo but never uses objInfo.getPrefix(). If the object store has a non-empty prefix configured, uploads go to the wrong path and images won't be found during restore.

Minor Issues

  1. Duplicate txn_kv_ member (doris_snapshot_manager.h:100): Both base and derived class store a shared_ptr<TxnKv>. Should make base class member protected or add accessor.

  2. current_time_seconds() called twice (doris_snapshot_manager.cpp:819-820): ctime and mtime could differ if second boundary is crossed. Capture once.

  3. Byte increment overflow (doris_snapshot_manager.cpp:1095-1098): ref_end.back() + 1 wraps to 0x00 if last byte is 0xFF, producing an empty/incorrect range scan.

  4. Massive code duplication (DorisCloudSnapshotHandler.java): executeSnapshotWorkflow and executeAutoSnapshotWorkflow are ~90% identical. Should be unified with an isAuto parameter.

  5. Config typo: cloud_auto_snapshot_max_reversed_num should be cloud_auto_snapshot_max_reserved_num (Config.java:3858). Fixing later is a breaking change for users.

Test Quality Observations

  • C++ tests have significant boilerplate duplication (instance setup copied 5x).
  • testSubmitJobRejectsDuplicate in Java tests has no actual rejection assertion.
  • Regression test test_snapshot_lifecycle.groovy uses hardcoded column indices (e.g., rows[0][6]) which are fragile.
  • Auto-snapshot test (Test 9 in lifecycle) is a no-op — never verifies count increased.

- Fix mock_accessor.h include guard (#ifdef UNIT_TEST) in:
  - snapshot_chain_compactor.cpp
  - snapshot_data_migrator.cpp
- Fix hdfs_accessor.h include guard (#ifdef ENABLE_HDFS_STORAGE_VAULT)
- Add 15 C++ test cases: CloneInstance, RecycleSnapshots, Update/Commit/Drop
- Add 5 Java test methods: RPC failures, TTL, error paths, orphan abort
- Strengthen Groovy assertion from weak warn to hard fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants