Skip to content

[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476

Merged
gavinchou merged 1 commit into
apache:masterfrom
wyxxxcat:reduce_point_read_at_recycle
May 11, 2026
Merged

[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476
gavinchou merged 1 commit into
apache:masterfrom
wyxxxcat:reduce_point_read_at_recycle

Conversation

@wyxxxcat
Copy link
Copy Markdown
Collaborator

@wyxxxcat wyxxxcat commented Apr 14, 2026

What problem does this PR solve?

fix: #58459

This PR reduces point-read overhead in Cloud recycler rowset cleanup. Previously, each scanned rowset could immediately trigger metadata reads/writes to mark it as recycled or abort its related transaction/job. The new flow records rowset keys during scanning, then batch-processes recycled marks and deferred abort tasks in worker batches. Prepare rowset deletion is also deferred so the recycler re-reads the latest metadata before deleting data.

This keeps the existing recycle safety semantics while reducing per-rowset KV operations during large recycle scans.

Release mode test
Recycling 10,000 rowsets, after enabling enable_mark_delete_rowset_before_recycle and enable_abort_txn_and_job_for_delete_rowset_before_recycle, the processing time increased by approximately 10%.

3514 ms -> 175 ms(mark) + 3811(abort and recycle)

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issue found.

  1. Goal of this PR
    Reduce per-rowset point reads/writes while scanning recycler keys. The batching direction is reasonable, but the current implementation changes recycler semantics on a correctness-critical path and is not safe as written.

  2. Critical checkpoint conclusions

  • Goal accomplished: Partially. Point operations are reduced, but the deferred abort path introduces a race that can delete rowset data for a transaction/job that commits before loop_done() runs.
  • Change size/focus: Focused to recycler batching, but it changes behavior in both recycle_rowsets() and recycle_tmp_rowsets().
  • Concurrency: Involved and currently unsafe. The recycler scan thread now queues abort work and only executes it at batch end, while concurrent commit_rowset / commit_txn / finish_tablet_job RPCs can still succeed during that window when enable_mark_delete_rowset_before_recycle=false.
  • Lifecycle/static init: No special lifecycle or static initialization concerns found in this PR.
  • Config changes: No new configs added, but an existing supported config combination regresses: enable_abort_txn_and_job_for_delete_rowset_before_recycle=true with enable_mark_delete_rowset_before_recycle=false.
  • Compatibility: No protocol/storage compatibility change observed.
  • Parallel paths: The same regression exists in both formal recycle-rowset and tmp-rowset paths.
  • Special conditions: The existing end_version() != 1 gate remains; no new explanation issues beyond the race above.
  • Test coverage: Existing tests mainly exercise the mark-before-delete flow and do not cover the interleaving where a commit/job-finish wins the race before deferred abort execution. I did not find new coverage for this regression.
  • Observability: Logging is adequate for tracing the new path.
  • Transaction/persistence: No new persistence format issue, but transaction/job state handling is where the correctness regression is introduced.
  • Data writes/modifications: Not safe on the affected path because object deletion can proceed after a successful commit that happened before the deferred abort ran.
  • FE/BE variable passing: Not applicable.
  • Performance: The batching optimization is valid in principle.
  • Other issues: No second independent blocker found beyond the deferred-abort race.
  1. Recycler-specific checkpoints
  • Mark-before-delete two-phase flow: Preserved when mark-delete is enabled.
  • Abort-before-delete aligned with origin: The mapping is still correct (load -> txn, compaction/schema-change -> job), but timing is no longer safe because abort is deferred.
  • Packed files: Not affected by this PR.
  • Conflict/retry/idempotency: The new deferred abort flow is not restart-safe enough on the affected config path because a concurrent commit can invalidate the recycler's stale decision before deletion submission.

Because this is a data-correctness issue on a supported code path, this should not be approved yet.

Comment thread cloud/src/recycler/recycler.cpp
Comment thread cloud/src/recycler/recycler.cpp
@wyxxxcat wyxxxcat marked this pull request as draft April 14, 2026 07:23
@wyxxxcat wyxxxcat force-pushed the reduce_point_read_at_recycle branch 2 times, most recently from 23e5acb to 2b1057e Compare April 17, 2026 09:41
@wyxxxcat wyxxxcat marked this pull request as ready for review April 20, 2026 08:05
@wyxxxcat wyxxxcat force-pushed the reduce_point_read_at_recycle branch from 2b1057e to 1600c44 Compare April 22, 2026 08:31
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes.

The batching direction is reasonable, but the current implementation does not preserve the recycler's safety contract.

Critical Checkpoints

  • Goal: Batch point reads/writes while preserving two-phase recycle safety. Partially achieved. The point operations are reduced, but the new async batching delays is_recycled/abort and changes error propagation, so the safety goal is not preserved. No new tests cover the new race/failure modes.
  • Scope/Minimality: The patch is focused on recycler rowset/tmp-rowset paths, but the helper refactor changes semantics, not just batching mechanics.
  • Concurrency: Applicable. Mark/abort now run asynchronously in worker threads. I did not find a lock-order problem, but the new ordering opens a race between recycler scans and commit_rowset / finish_job / commit_txn.
  • Lifecycle/Static Init: No lifecycle or static-initialization issue found in this PR.
  • Configs: No new configuration items.
  • Compatibility: No FE/BE protocol or storage-format compatibility issue found.
  • Parallel Paths: Applicable. The same deferred logic was added to both recycle_rowsets() and recycle_tmp_rowsets(), and both inherit the same problems.
  • Special Checks: The new deferred checks do not establish why delaying the mark step is safe; in the current code it is not.
  • Test Coverage: Insufficient for this change. No test covers page-sized scans, concurrent commit/finish while a rowset is only queued for marking, or batched mark failure such as TXN_BYTES_TOO_LARGE / conflict.
  • Observability: Logging exists, but the new worker-side failures are only logged and the outer recycle pass can still report success, so observability is not sufficient.
  • Transaction/Persistence: Applicable. The new batched metadata rewrite path does not respect TxnKv's byte-limit assumptions, and mark/abort failures no longer propagate to the caller.
  • Data Writes/Modifications: Applicable. Mark-before-delete is weakened because marking is deferred until the worker runs after loop_done().
  • FE/BE Variable Passing: Not applicable.
  • Performance: It likely reduces point reads/writes, but the fixed-size batch can hit TxnKv byte/conflict limits and the async error handling can hide failed work.
  • Other Issues: See inline comments for the concrete blocking cases.

Because of the above correctness regressions, I am requesting changes.

Comment thread cloud/src/recycler/recycler.cpp
Comment thread cloud/src/recycler/recycler.cpp Outdated
Comment thread cloud/src/recycler/recycler.cpp
@wyxxxcat wyxxxcat force-pushed the reduce_point_read_at_recycle branch from 1600c44 to f024980 Compare April 23, 2026 06:31
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@wyxxxcat wyxxxcat force-pushed the reduce_point_read_at_recycle branch from f024980 to 019e763 Compare May 7, 2026 13:03
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

wyxxxcat commented May 9, 2026

run buildall

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 9, 2026
@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 83.22% (238/286) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.05% (1849/2369)
Line Coverage 64.73% (33225/51327)
Region Coverage 65.27% (16447/25198)
Branch Coverage 55.82% (8781/15732)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am requesting changes based on the existing unresolved inline findings in this PR. I did not add duplicate inline comments because the current open threads already cover the blocking issues I found: deferred mark-before-delete reopens the commit race, deferred abort-before-delete can delete data that becomes committed, fixed-size batch marking can exceed TxnKv byte limits, and worker-side mark/abort failures are logged but no longer fail the recycle pass.

Critical checkpoint conclusions:

  • Goal and tests: the goal is to reduce point read/write overhead during cloud recycler rowset cleanup while preserving recycle safety semantics. The current code does not fully preserve those semantics, and the PR does not add an automated regression/unit test proving the race-sensitive mark/abort paths.
  • Scope: the change is focused on recycler batching, but it changes ordering and failure propagation in safety-critical paths rather than being a local performance-only optimization.
  • Concurrency: this code is inherently concurrent with commit_rowset, commit_txn, finish_job/tablet_job, and recycler worker tasks. The existing threads identify windows where the scanner observes stale rowset state, defers mark/abort until loop_done/worker execution, and then deletes using stale eligibility. That violates the recycler's two-phase and abort-before-delete contracts.
  • Lifecycle/static initialization: no new static/global lifecycle issue found.
  • Configuration: existing config paths enable_mark_delete_rowset_before_recycle and enable_abort_txn_and_job_for_delete_rowset_before_recycle are affected. The supported disabled/enabled combinations are not preserved safely by the deferred flow.
  • Compatibility/storage format: no new wire/storage format incompatibility found.
  • Parallel paths: both recycle_rowsets and recycle_tmp_rowsets have the same class of deferred mark/abort failure; existing comments cover both paths.
  • Conditional checks: the special base-version and PREPARE checks were reviewed; the main issue remains operation ordering and propagation rather than a missing local condition.
  • Test coverage: no new automated test coverage was added for the changed recycler concurrency/failure behavior.
  • Test results: no generated/modified test result files are part of this PR.
  • Observability: added logs are not sufficient because worker failures are converted to log-and-return while the recycle pass can still report success.
  • Transaction/persistence: the PR changes TxnKv transaction grouping and failure behavior; existing comments cover byte-limit risk and success reporting despite failed mark/abort transactions.
  • Data writes/modifications: physical deletion can be scheduled after stale eligibility checks when mark/abort are delayed, which risks deleting data that concurrently becomes live.
  • FE/BE variable passing: not applicable.
  • Performance: batching may reduce per-rowset KV calls, but the optimization is not safe until ordering, byte-limit handling, and error propagation are fixed.

User focus: no additional user-provided review focus was specified.

@gavinchou gavinchou merged commit 43a7b49 into apache:master May 11, 2026
37 checks passed
github-actions Bot pushed a commit that referenced this pull request May 11, 2026
…ally reading key (#62476)

fix: #58459


This PR reduces point-read overhead in Cloud recycler rowset cleanup.
Previously, each scanned rowset could immediately trigger metadata
reads/writes to mark it as recycled or abort its related
transaction/job. The new flow records rowset keys during scanning, then
batch-processes recycled marks and deferred abort tasks in worker
batches. Prepare rowset deletion is also deferred so the recycler
re-reads the latest metadata before deleting data.

This keeps the existing recycle safety semantics while reducing
per-rowset KV operations during large recycle scans.

**Release mode test**
Recycling 10,000 rowsets, after enabling
`enable_mark_delete_rowset_before_recycle` and
`enable_abort_txn_and_job_for_delete_rowset_before_recycle`, the
processing time increased by approximately 10%.


**3514 ms -> 175 ms(mark) + 3811(abort and recycle)**
github-actions Bot pushed a commit that referenced this pull request May 11, 2026
…ally reading key (#62476)

fix: #58459


This PR reduces point-read overhead in Cloud recycler rowset cleanup.
Previously, each scanned rowset could immediately trigger metadata
reads/writes to mark it as recycled or abort its related
transaction/job. The new flow records rowset keys during scanning, then
batch-processes recycled marks and deferred abort tasks in worker
batches. Prepare rowset deletion is also deferred so the recycler
re-reads the latest metadata before deleting data.

This keeps the existing recycle safety semantics while reducing
per-rowset KV operations during large recycle scans.

**Release mode test**
Recycling 10,000 rowsets, after enabling
`enable_mark_delete_rowset_before_recycle` and
`enable_abort_txn_and_job_for_delete_rowset_before_recycle`, the
processing time increased by approximately 10%.


**3514 ms -> 175 ms(mark) + 3811(abort and recycle)**
yiguolei pushed a commit that referenced this pull request May 11, 2026
…ing sequentially reading key #62476 (#63123)

Cherry-picked from #62476

Co-authored-by: Yixuan Wang <wangyixuan@selectdb.com>
morningman pushed a commit that referenced this pull request May 16, 2026
…ing sequentially reading key #62476 (#63122)

Cherry-picked from #62476

Co-authored-by: Yixuan Wang <wangyixuan@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. cloud dev/3.1.x dev/4.0.6-merged dev/4.1.1-merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants