Skip to content

branch-4.0: [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin #61366#61674

Merged
yiguolei merged 1 commit intoapache:branch-4.0from
morrySnow:pick-61366-branch-4.0
Mar 25, 2026
Merged

branch-4.0: [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin #61366#61674
yiguolei merged 1 commit intoapache:branch-4.0from
morrySnow:pick-61366-branch-4.0

Conversation

@morrySnow
Copy link
Contributor

@morrySnow morrySnow commented Mar 24, 2026

picked from #61366

…Bin (apache#61366)

All methods in `CatalogRecycleBin.java` use `synchronized` (single
monitor lock), creating extremely coarse lock granularity. When
`erasePartition()` runs slowly with many partitions, other
`synchronized` methods block waiting for the lock. Callers like
`recyclePartition()` hold TABLE WRITE LOCK while waiting, causing
cascading blocking that can bring down the entire Doris metadata
service.

Two complementary optimizations:

- **Lock-free** (8 methods): Simple ConcurrentHashMap lookups
(`isRecyclePartition`, `getRecycleTimeById`, etc.)
- **Read lock** (4 methods): Read-only iterations
(`allTabletsInRecycledStatus`, `getInfo`, `write`, etc.)
- **Write lock** (11 methods): Map mutations
(`recycleDatabase/Table/Partition`, `recover*`, `clearAll`)

Refactored all 12 erase methods to process items **one at a time** with
lock release between items:
- **Inside write lock (per item)**: cleanup RPCs + map removal + edit
log write
- **Release lock between items**: other operations can proceed

This reduces lock hold time from **O(N × T)** (all items) to **O(T)**
(one item) per acquisition.

Changed 4 internal maps from `HashMap` to `ConcurrentHashMap` to enable
lock-free reads.

1. **NPE in `getIdListToEraseByRecycleTime`**: Used `getOrDefault` to
handle stale IDs that may be concurrently removed between snapshot and
processing
2. **DdlException in cascade erase**: Added try-catch in
`eraseDatabaseInstantly`/`eraseTableInstantly` for partitions/tables
concurrently erased by daemon

- All 24 existing unit tests pass
- Added 3 new concurrency tests:
  - `testConcurrentReadsDoNotBlock` — 10 concurrent reader threads
  - `testConcurrentRecycleAndRead` — writer + 5 readers simultaneously
- `testMicrobatchEraseReleasesLockBetweenItems` — verifies
recyclePartition succeeds during erase

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@morrySnow morrySnow requested a review from yiguolei as a code owner March 24, 2026 09:24
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morrySnow morrySnow changed the title [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin (#61366) branch-4.0: [Enhancement](recyclebin) Optimize lock granularity in CatalogRecycleBin #61366 Mar 24, 2026
@morrySnow
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 63.29% (400/632) 🎉
Increment coverage report
Complete coverage report

@yiguolei yiguolei merged commit 0145cea into apache:branch-4.0 Mar 25, 2026
28 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants