Skip to content

[fix](regression) handle cumulative delete-version compaction wait#64945

Open
shuke987 wants to merge 1 commit into
apache:masterfrom
shuke987:fix-compaction-delete-version-wait
Open

[fix](regression) handle cumulative delete-version compaction wait#64945
shuke987 wants to merge 1 commit into
apache:masterfrom
shuke987:fix-compaction-delete-version-wait

Conversation

@shuke987

@shuke987 shuke987 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

fix handle cumulative delete-version compaction wait

Summary

Fix trigger_and_wait_compaction so Cloud cumulative compaction that meets a delete version does not wait until the 300s timeout after valid progress has already happened.

When cumulative compaction meets a delete version, BE can return [E-2010] cumulative compaction meet delete version, advance the cumulative point, and let base compaction handle the rowsets. In that path the cumulative success/failure timestamps may not change, so the old helper kept polling even after base compaction had completed and run_status=false.

This patch treats E-2010 plus cumulative point advancement plus a changed base success time as an equivalent completed cumulative delete-version path while still waiting when run_status=true. If E-2010 advances the cumulative point but base success time has not changed yet, the helper keeps waiting even if the cumulative failure timestamp changed.

Root Cause

The case compaction/test_compacation_with_delete.groovy creates alternating data and delete rowsets, then calls trigger_and_wait_compaction(tableName, "cumulative").

In Cloud mode this can legally follow:

  1. cumulative compaction meets delete version and returns E-2010
  2. cumulative point advances
  3. base compaction handles the delete-version rowsets

The helper only watched cumulative success/failure timestamp changes. In the failing log, base compaction completed in 448 ms, but the helper waited for 5 minutes because the cumulative timestamps did not change.

Validation

  • git diff --check
  • git diff --check origin/master..HEAD
  • Local Groovy condition simulation:
    • E-2010 + cumulative point advanced + base success time changed + run_status=false exits wait
    • E-2010 + cumulative point advanced keeps waiting if base success time has not changed
    • E-2010 + cumulative point advanced + cumulative failure time changed still keeps waiting if base success time has not changed
    • run_status=true keeps waiting even if the delete-version/base-success condition is met
    • normal cumulative success timestamp change still exits wait when there is no delete-version handoff

Cloud P0 rerun is still needed for final validation.

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@shuke987

Copy link
Copy Markdown
Collaborator Author

run buildall

@shuke987 shuke987 force-pushed the fix-compaction-delete-version-wait branch from 7812dbb to ee175b9 Compare June 29, 2026 08:36
@shuke987

Copy link
Copy Markdown
Collaborator Author

run buildall

@shuke987

Copy link
Copy Markdown
Collaborator Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed PR #64945. I did not find a blocking issue or a valuable inline comment to add.

Critical checkpoint conclusions:

  • Goal and proof: The change addresses the regression-helper wait path for cumulative compaction that reports E-2010 after meeting a delete version. The new exit condition requires the E-2010 status, cumulative point advancement, and a changed base success timestamp, while still preserving the existing run_status wait.
  • Scope: The PR is small and focused; the only changed file is regression-test/plugins/plugin_compaction.groovy.
  • Concurrency/lifecycle: No production concurrency or lifecycle logic is changed. The helper continues to poll BE state and exits only after the tablet is no longer reporting a running compaction plus the observed delete-version handoff evidence.
  • Config/session/protocol compatibility: No new config, session variable, FE/BE protocol field, or expected-output format is introduced. The status fields read by the helper are already exposed by local and cloud tablet compaction status JSON.
  • Parallel paths: I checked the local and cloud compaction HTTP/status paths. Both expose cumulative point, cumulative/base status strings, and compaction timestamps used by this helper.
  • Tests/results: No .out file is changed. I did not run a local regression suite in this runner, but GitHub reports successful P0 Regression, Cloud P0, NonConcurrent Regression, compile, FE UT, BE UT, and Cloud UT checks for the PR head.
  • Performance/observability: The polling interval and logging behavior are unchanged; the added status checks are local parsing/comparison work inside the existing wait loop.

User focus: No additional user-provided review focus was supplied.

Subagent conclusions:

  • optimizer-rewrite: no candidate issue; the helper-only patch has no optimizer/rewrite, join, aggregate, or Nereids semantic surface.
  • tests-session-config: no candidate issue; no session/config propagation, expected-output, compatibility, or CI/style concern remained after static review.
  • Convergence: Round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set after the main merged dismissals and empty proposed inline comment set were recorded.

def success_time_unchanged = (oldStatus["last ${compaction_type} success time"] == tabletStatus["last ${compaction_type} success time"])
def failure_time_unchanged = (oldStatus["last ${compaction_type} failure time"] == tabletStatus["last ${compaction_type} failure time"])
running = running || (success_time_unchanged && failure_time_unchanged)
running = running || ((success_time_unchanged && failure_time_unchanged) && !completedByBaseCompactionAfterDeleteVersion)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When e-2010 advances cumu point, but base compaction hasn't run, the condition would be failure_time_unchanged=false, so running=false, but we expect running=true in this case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updated the wait condition so once E-2010 advances the cumulative point, the helper treats it as a delete-version handoff and keeps waiting until last base success time changes. A cumulative failure timestamp change alone no longer lets this path finish.

@shuke987 shuke987 force-pushed the fix-compaction-delete-version-wait branch from ee175b9 to 5a6d9d1 Compare July 1, 2026 07:26
@shuke987

shuke987 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

run buildall

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants