[fix](cloud) Delete local rowsets before add_rowsets in cloud schema change#62256
[fix](cloud) Delete local rowsets before add_rowsets in cloud schema change#62256liaoxin01 merged 4 commits intoapache:masterfrom
Conversation
…change During cloud schema change, compaction on the new tablet may produce rowsets with wider version ranges (e.g. [818-822]) that overlap with the individual SC output rowsets (e.g. [818],[819],...,[822]). The MS correctly recycles these compaction rowsets, but the BE side did not mirror this behavior, leaving stale rowsets visible and causing duplicate keys in MOW tables. Fix: Before calling add_rowsets for SC output, delete all local rowsets in [2, alter_version] from the new tablet -- mirroring the MS-side recycle behavior. A new CloudTablet method delete_rowsets_for_schema_change is added that also removes edges from the version graph, preventing the greedy capture algorithm from preferring the wider stale compaction path over the individual SC output rowsets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Add tests verifying that delete_rowsets_for_schema_change correctly: 1. Removes compaction rowsets from rs_version_map and version graph so that capture_consistent_versions returns SC output rowsets instead of the stale compaction rowset (DORIS-25014 scenario) 2. Is a no-op when given empty input 3. Handles multiple compaction rowsets and preserves post-alter rowsets Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…hange Do not use the stale tracking mechanism (_stale_rs_version_map / _stale_version_path_map) for SC-deleted rowsets. SC output will create new rowsets with identical version ranges; a later compaction could put those into stale as well, causing two stale paths to reference the same version key. When one path is cleaned first, the other hits DCHECK(false) in delete_expired_stale_rowsets(). Instead, use same_version=true in modify_rs_metas to skip _stale_rs_metas, and schedule the rowsets for direct cache cleanup via add_unused_rowsets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
run buildall |
|
run p0 |
|
run nonConcurrent |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…change (#62256) ### What problem does this PR solve? Problem Summary: During cloud schema change, the MS (Meta Service) side correctly recycles rowsets in `[2, alter_version]` on the new tablet when committing the SC job. However, the BE side did not mirror this behavior — it directly called `add_rowsets` for the SC output without first removing existing local rowsets. This could leave stale rowsets (e.g., compaction outputs on the new tablet) visible in `_rs_version_map`, and since their delete bitmap does not cover the SC output rows, duplicate keys may appear in MOW tables. PR #61089 increased the likelihood of triggering this issue by enabling compaction on new tablets during SC, which makes it more common for the new tablet to have compaction rowsets with wider version ranges (e.g., `[818-822]`) that overlap with individual SC output rowsets (e.g., `[818],[819],...,[822]`). The `add_rowsets` overlap check (`to_add_v.contains(v)`) is one-directional: `[818].contains([818-822])` evaluates to false, so the stale compaction rowset was not removed. Fix: Before calling `add_rowsets` for SC output, delete all local rowsets in `[2, alter_version]` from the new tablet, mirroring the MS-side recycle behavior. A new `CloudTablet::delete_rowsets_for_schema_change` method is added that also removes edges from the version graph, preventing the greedy capture algorithm from preferring the wider stale compaction path over the individual SC output rowsets.
…change (#62256) ### What problem does this PR solve? Problem Summary: During cloud schema change, the MS (Meta Service) side correctly recycles rowsets in `[2, alter_version]` on the new tablet when committing the SC job. However, the BE side did not mirror this behavior — it directly called `add_rowsets` for the SC output without first removing existing local rowsets. This could leave stale rowsets (e.g., compaction outputs on the new tablet) visible in `_rs_version_map`, and since their delete bitmap does not cover the SC output rows, duplicate keys may appear in MOW tables. PR #61089 increased the likelihood of triggering this issue by enabling compaction on new tablets during SC, which makes it more common for the new tablet to have compaction rowsets with wider version ranges (e.g., `[818-822]`) that overlap with individual SC output rowsets (e.g., `[818],[819],...,[822]`). The `add_rowsets` overlap check (`to_add_v.contains(v)`) is one-directional: `[818].contains([818-822])` evaluates to false, so the stale compaction rowset was not removed. Fix: Before calling `add_rowsets` for SC output, delete all local rowsets in `[2, alter_version]` from the new tablet, mirroring the MS-side recycle behavior. A new `CloudTablet::delete_rowsets_for_schema_change` method is added that also removes edges from the version graph, preventing the greedy capture algorithm from preferring the wider stale compaction path over the individual SC output rowsets.
| // Step 3: delete_expired_stale_rowsets — this is where CI crashed | ||
| // With old code: stale path from SC and compaction both reference [2-6] key, | ||
| // causing DCHECK(false). With fix: only compaction stale path exists, no conflict. | ||
| config::tablet_rowset_stale_sweep_time_sec = 0; // expire immediately |
There was a problem hiding this comment.
shall we add a guard to restore this config
| _tablet_meta->modify_rs_metas({}, rs_metas, false); | ||
| } | ||
|
|
||
| void CloudTablet::delete_rowsets_for_schema_change(const std::vector<RowsetSharedPtr>& to_delete, |
There was a problem hiding this comment.
Could we reuse the same abstraction as the local schema-change path here? Local schema change already calls delete_rowsets(..., false) to directly remove versions, so I wonder if cloud could extend delete_rowsets with the same semantics rather than adding a dedicated delete_rowsets_for_schema_change method.
What problem does this PR solve?
Problem Summary:
During cloud schema change, the MS (Meta Service) side correctly recycles rowsets in
[2, alter_version]on the new tablet when committing the SC job. However, the BE side did not mirror this behavior — it directly calledadd_rowsetsfor the SC output without first removing existing local rowsets. This could leave stale rowsets (e.g., compaction outputs on the new tablet) visible in_rs_version_map, and since their delete bitmap does not cover the SC output rows, duplicate keys may appear in MOW tables.PR #61089 increased the likelihood of triggering this issue by enabling compaction on new tablets during SC, which makes it more common for the new tablet to have compaction rowsets with wider version ranges (e.g.,
[818-822]) that overlap with individual SC output rowsets (e.g.,[818],[819],...,[822]). Theadd_rowsetsoverlap check (to_add_v.contains(v)) is one-directional:[818].contains([818-822])evaluates to false, so the stale compaction rowset was not removed.Fix: Before calling
add_rowsetsfor SC output, delete all local rowsets in[2, alter_version]from the new tablet, mirroring the MS-side recycle behavior. A newCloudTablet::delete_rowsets_for_schema_changemethod is added that also removes edges from the version graph, preventing the greedy capture algorithm from preferring the wider stale compaction path over the individual SC output rowsets.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)