[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Blocking issue found.
-
Goal of this PR
Reduce per-rowset point reads/writes while scanning recycler keys. The batching direction is reasonable, but the current implementation changes recycler semantics on a correctness-critical path and is not safe as written. -
Critical checkpoint conclusions
- Goal accomplished: Partially. Point operations are reduced, but the deferred abort path introduces a race that can delete rowset data for a transaction/job that commits before
loop_done()runs. - Change size/focus: Focused to recycler batching, but it changes behavior in both
recycle_rowsets()andrecycle_tmp_rowsets(). - Concurrency: Involved and currently unsafe. The recycler scan thread now queues abort work and only executes it at batch end, while concurrent
commit_rowset/commit_txn/finish_tablet_jobRPCs can still succeed during that window whenenable_mark_delete_rowset_before_recycle=false. - Lifecycle/static init: No special lifecycle or static initialization concerns found in this PR.
- Config changes: No new configs added, but an existing supported config combination regresses:
enable_abort_txn_and_job_for_delete_rowset_before_recycle=truewithenable_mark_delete_rowset_before_recycle=false. - Compatibility: No protocol/storage compatibility change observed.
- Parallel paths: The same regression exists in both formal recycle-rowset and tmp-rowset paths.
- Special conditions: The existing
end_version() != 1gate remains; no new explanation issues beyond the race above. - Test coverage: Existing tests mainly exercise the mark-before-delete flow and do not cover the interleaving where a commit/job-finish wins the race before deferred abort execution. I did not find new coverage for this regression.
- Observability: Logging is adequate for tracing the new path.
- Transaction/persistence: No new persistence format issue, but transaction/job state handling is where the correctness regression is introduced.
- Data writes/modifications: Not safe on the affected path because object deletion can proceed after a successful commit that happened before the deferred abort ran.
- FE/BE variable passing: Not applicable.
- Performance: The batching optimization is valid in principle.
- Other issues: No second independent blocker found beyond the deferred-abort race.
- Recycler-specific checkpoints
- Mark-before-delete two-phase flow: Preserved when mark-delete is enabled.
- Abort-before-delete aligned with origin: The mapping is still correct (load -> txn, compaction/schema-change -> job), but timing is no longer safe because abort is deferred.
- Packed files: Not affected by this PR.
- Conflict/retry/idempotency: The new deferred abort flow is not restart-safe enough on the affected config path because a concurrent commit can invalidate the recycler's stale decision before deletion submission.
Because this is a data-correctness issue on a supported code path, this should not be approved yet.
23e5acb to
2b1057e
Compare
2b1057e to
1600c44
Compare
|
/review |
There was a problem hiding this comment.
Requesting changes.
The batching direction is reasonable, but the current implementation does not preserve the recycler's safety contract.
Critical Checkpoints
- Goal: Batch point reads/writes while preserving two-phase recycle safety. Partially achieved. The point operations are reduced, but the new async batching delays
is_recycled/abort and changes error propagation, so the safety goal is not preserved. No new tests cover the new race/failure modes. - Scope/Minimality: The patch is focused on recycler rowset/tmp-rowset paths, but the helper refactor changes semantics, not just batching mechanics.
- Concurrency: Applicable. Mark/abort now run asynchronously in worker threads. I did not find a lock-order problem, but the new ordering opens a race between recycler scans and
commit_rowset/finish_job/commit_txn. - Lifecycle/Static Init: No lifecycle or static-initialization issue found in this PR.
- Configs: No new configuration items.
- Compatibility: No FE/BE protocol or storage-format compatibility issue found.
- Parallel Paths: Applicable. The same deferred logic was added to both
recycle_rowsets()andrecycle_tmp_rowsets(), and both inherit the same problems. - Special Checks: The new deferred checks do not establish why delaying the mark step is safe; in the current code it is not.
- Test Coverage: Insufficient for this change. No test covers page-sized scans, concurrent commit/finish while a rowset is only queued for marking, or batched mark failure such as
TXN_BYTES_TOO_LARGE/ conflict. - Observability: Logging exists, but the new worker-side failures are only logged and the outer recycle pass can still report success, so observability is not sufficient.
- Transaction/Persistence: Applicable. The new batched metadata rewrite path does not respect TxnKv's byte-limit assumptions, and mark/abort failures no longer propagate to the caller.
- Data Writes/Modifications: Applicable. Mark-before-delete is weakened because marking is deferred until the worker runs after
loop_done(). - FE/BE Variable Passing: Not applicable.
- Performance: It likely reduces point reads/writes, but the fixed-size batch can hit TxnKv byte/conflict limits and the async error handling can hide failed work.
- Other Issues: See inline comments for the concrete blocking cases.
Because of the above correctness regressions, I am requesting changes.
1600c44 to
f024980
Compare
|
run buildall |
f024980 to
019e763
Compare
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I am requesting changes based on the existing unresolved inline findings in this PR. I did not add duplicate inline comments because the current open threads already cover the blocking issues I found: deferred mark-before-delete reopens the commit race, deferred abort-before-delete can delete data that becomes committed, fixed-size batch marking can exceed TxnKv byte limits, and worker-side mark/abort failures are logged but no longer fail the recycle pass.
Critical checkpoint conclusions:
- Goal and tests: the goal is to reduce point read/write overhead during cloud recycler rowset cleanup while preserving recycle safety semantics. The current code does not fully preserve those semantics, and the PR does not add an automated regression/unit test proving the race-sensitive mark/abort paths.
- Scope: the change is focused on recycler batching, but it changes ordering and failure propagation in safety-critical paths rather than being a local performance-only optimization.
- Concurrency: this code is inherently concurrent with commit_rowset, commit_txn, finish_job/tablet_job, and recycler worker tasks. The existing threads identify windows where the scanner observes stale rowset state, defers mark/abort until loop_done/worker execution, and then deletes using stale eligibility. That violates the recycler's two-phase and abort-before-delete contracts.
- Lifecycle/static initialization: no new static/global lifecycle issue found.
- Configuration: existing config paths enable_mark_delete_rowset_before_recycle and enable_abort_txn_and_job_for_delete_rowset_before_recycle are affected. The supported disabled/enabled combinations are not preserved safely by the deferred flow.
- Compatibility/storage format: no new wire/storage format incompatibility found.
- Parallel paths: both recycle_rowsets and recycle_tmp_rowsets have the same class of deferred mark/abort failure; existing comments cover both paths.
- Conditional checks: the special base-version and PREPARE checks were reviewed; the main issue remains operation ordering and propagation rather than a missing local condition.
- Test coverage: no new automated test coverage was added for the changed recycler concurrency/failure behavior.
- Test results: no generated/modified test result files are part of this PR.
- Observability: added logs are not sufficient because worker failures are converted to log-and-return while the recycle pass can still report success.
- Transaction/persistence: the PR changes TxnKv transaction grouping and failure behavior; existing comments cover byte-limit risk and success reporting despite failed mark/abort transactions.
- Data writes/modifications: physical deletion can be scheduled after stale eligibility checks when mark/abort are delayed, which risks deleting data that concurrently becomes live.
- FE/BE variable passing: not applicable.
- Performance: batching may reduce per-rowset KV calls, but the optimization is not safe until ordering, byte-limit handling, and error propagation are fixed.
User focus: no additional user-provided review focus was specified.
…ally reading key (#62476) fix: #58459 This PR reduces point-read overhead in Cloud recycler rowset cleanup. Previously, each scanned rowset could immediately trigger metadata reads/writes to mark it as recycled or abort its related transaction/job. The new flow records rowset keys during scanning, then batch-processes recycled marks and deferred abort tasks in worker batches. Prepare rowset deletion is also deferred so the recycler re-reads the latest metadata before deleting data. This keeps the existing recycle safety semantics while reducing per-rowset KV operations during large recycle scans. **Release mode test** Recycling 10,000 rowsets, after enabling `enable_mark_delete_rowset_before_recycle` and `enable_abort_txn_and_job_for_delete_rowset_before_recycle`, the processing time increased by approximately 10%. **3514 ms -> 175 ms(mark) + 3811(abort and recycle)**
…ally reading key (#62476) fix: #58459 This PR reduces point-read overhead in Cloud recycler rowset cleanup. Previously, each scanned rowset could immediately trigger metadata reads/writes to mark it as recycled or abort its related transaction/job. The new flow records rowset keys during scanning, then batch-processes recycled marks and deferred abort tasks in worker batches. Prepare rowset deletion is also deferred so the recycler re-reads the latest metadata before deleting data. This keeps the existing recycle safety semantics while reducing per-rowset KV operations during large recycle scans. **Release mode test** Recycling 10,000 rowsets, after enabling `enable_mark_delete_rowset_before_recycle` and `enable_abort_txn_and_job_for_delete_rowset_before_recycle`, the processing time increased by approximately 10%. **3514 ms -> 175 ms(mark) + 3811(abort and recycle)**
What problem does this PR solve?
fix: #58459
This PR reduces point-read overhead in Cloud recycler rowset cleanup. Previously, each scanned rowset could immediately trigger metadata reads/writes to mark it as recycled or abort its related transaction/job. The new flow records rowset keys during scanning, then batch-processes recycled marks and deferred abort tasks in worker batches. Prepare rowset deletion is also deferred so the recycler re-reads the latest metadata before deleting data.
This keeps the existing recycle safety semantics while reducing per-rowset KV operations during large recycle scans.
Release mode test
Recycling 10,000 rowsets, after enabling
enable_mark_delete_rowset_before_recycleandenable_abort_txn_and_job_for_delete_rowset_before_recycle, the processing time increased by approximately 10%.3514 ms -> 175 ms(mark) + 3811(abort and recycle)
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)