Skip to content

[fix](cloud) Add SC_COMPACTION_CONFLICT error code to retry cross-V1 compaction failures#62272

Merged
luwei16 merged 4 commits into
apache:masterfrom
Yukang-Lian:fix/cloud-sc-compaction-conflict-retry
Apr 20, 2026
Merged

[fix](cloud) Add SC_COMPACTION_CONFLICT error code to retry cross-V1 compaction failures#62272
luwei16 merged 4 commits into
apache:masterfrom
Yukang-Lian:fix/cloud-sc-compaction-conflict-retry

Conversation

@Yukang-Lian
Copy link
Copy Markdown
Collaborator

Summary

  • Add dedicated SC_COMPACTION_CONFLICT(101) TStatusCode for the cross-V1 compaction detection in cloud schema change
  • Include this error code in FE's schema change retry whitelist (AlterJobV2.getRetryTimes())
  • Previously, this transient condition returned INTERNAL_ERROR which FE does not retry, causing the schema change job to be permanently cancelled

Background

The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within [2, V1'].

The bug was that FE treated INTERNAL_ERROR as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled.

Changes

File Change
gensrc/thrift/Status.thrift Add SC_COMPACTION_CONFLICT = 101
be/src/common/status.h Register TStatusError(SC_COMPACTION_CONFLICT, false)
be/src/cloud/cloud_schema_change_job.cpp Cross-V1 check: INTERNAL_ERRORSC_COMPACTION_CONFLICT
fe/.../alter/AlterJobV2.java Add SC_COMPACTION_CONFLICT to retry whitelist
fe/.../alter/AlterJobV2RetryTest.java Unit tests for getRetryTimes() (7 cases)
regression-test/.../test_sc_compaction_cross_v1_retry.groovy E2E test: cross-V1 failure → retry → success

Test plan

  • FE unit test: AlterJobV2RetryTest - verifies retry whitelist for all error codes, null handling, and config toggle
  • Regression test: test_sc_compaction_cross_v1_retry - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Yukang-Lian Yukang-Lian force-pushed the fix/cloud-sc-compaction-conflict-retry branch from 8183948 to dad5109 Compare April 9, 2026 07:58
@Yukang-Lian Yukang-Lian force-pushed the fix/cloud-sc-compaction-conflict-retry branch from dad5109 to bc9b76e Compare April 9, 2026 08:58
Comment thread be/src/cloud/cloud_schema_change_job.cpp Outdated
@Yukang-Lian Yukang-Lian force-pushed the fix/cloud-sc-compaction-conflict-retry branch from 43ac15e to 779a597 Compare April 10, 2026 07:08
Copy link
Copy Markdown
Collaborator

@Hastyshell Hastyshell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 11.11% (1/9) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.70% (27409/37189)
Line Coverage 57.35% (295985/516141)
Region Coverage 54.55% (246509/451883)
Branch Coverage 56.21% (106815/190044)

…compaction failures

The cloud SC compaction optimization introduced a cross-V1 safety check
that returns INTERNAL_ERROR when a compaction rowset crosses the
alter_version boundary. However, FE does not retry INTERNAL_ERROR,
causing the schema change job to be permanently cancelled.

Add a dedicated SC_COMPACTION_CONFLICT(101) error code for this transient
condition, and include it in FE's schema change retry whitelist. When the
cross-V1 check triggers, abort the SC job in meta-service first so the
next retry registers a fresh job with a higher alter_version.
@Yukang-Lian Yukang-Lian force-pushed the fix/cloud-sc-compaction-conflict-retry branch from 779a597 to 91835d1 Compare April 14, 2026 08:11
@Yukang-Lian Yukang-Lian requested a review from w41ter as a code owner April 14, 2026 08:11
…ion init

Log per_row, sample_bytes, sample_rows immediately after all merge inputs
finish loading their first block, before the actual merge starts. This helps
diagnose memory issues by showing the actual per-row memory size at init time.
@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run buildall

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run feut

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run nonConcurrent

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/32) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 60.00% (6/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.09% (20171/37996)
Line Coverage 36.67% (189984/518102)
Region Coverage 32.94% (147595/448122)
Branch Coverage 34.05% (64568/189628)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 60.00% (6/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.95% (26773/37210)
Line Coverage 55.00% (284104/516518)
Region Coverage 52.06% (235444/452247)
Branch Coverage 53.46% (101681/190206)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 60.00% (6/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.98% (26783/37210)
Line Coverage 55.00% (284110/516518)
Region Coverage 52.05% (235392/452247)
Branch Coverage 53.45% (101671/190206)

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 60.00% (6/10) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.98% (26783/37210)
Line Coverage 55.00% (284110/516518)
Region Coverage 52.05% (235392/452247)
Branch Coverage 53.45% (101671/190206)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/32) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/32) 🎉
Increment coverage report
Complete coverage report

@Yukang-Lian
Copy link
Copy Markdown
Collaborator Author

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.02% (1842/2361)
Line Coverage 64.65% (32942/50958)
Region Coverage 65.24% (16331/25034)
Branch Coverage 55.80% (8714/15616)

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/2) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (23/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.23% (20333/38197)
Line Coverage 36.79% (191443/520375)
Region Coverage 33.11% (148898/449701)
Branch Coverage 34.18% (65024/190246)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (23/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.77% (27598/37410)
Line Coverage 57.53% (298444/518795)
Region Coverage 54.73% (248394/453877)
Branch Coverage 56.33% (107491/190832)

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@luwei16 luwei16 merged commit c847318 into apache:master Apr 20, 2026
35 of 36 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 20, 2026
…compaction failures (#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
github-actions Bot pushed a commit that referenced this pull request Apr 20, 2026
…compaction failures (#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
yiguolei pushed a commit that referenced this pull request Apr 22, 2026
…ry cross-V1 compaction failures #62272 (#62643)

Cherry-picked from #62272

Co-authored-by: Jimmy <lianyukang@selectdb.com>
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Apr 22, 2026
…compaction failures (apache#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Apr 22, 2026
…compaction failures (apache#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Apr 22, 2026
…compaction failures (apache#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Apr 23, 2026
…compaction failures (apache#62272)

## Summary

- Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the
cross-V1 compaction detection in cloud schema change
- Include this error code in FE's schema change retry whitelist
(`AlterJobV2.getRetryTimes()`)
- Previously, this transient condition returned `INTERNAL_ERROR` which
FE does not retry, causing the schema change job to be permanently
cancelled

## Background

The cloud SC compaction optimization allows new tablets to do compaction
during schema change queue wait. A safety check detects when a
compaction rowset crosses the alter_version (V1) boundary. This is a
transient condition - on retry, V1 will be higher and the crossing
rowset falls within `[2, V1']`.

The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no
retry), so this transient condition caused the entire schema change to
be cancelled.

## Changes

| File | Change |
|------|--------|
| `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` |
| `be/src/common/status.h` | Register
`TStatusError(SC_COMPACTION_CONFLICT, false)` |
| `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check:
`INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` |
| `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry
whitelist |
| `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for
`getRetryTimes()` (7 cases) |
| `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E
test: cross-V1 failure → retry → success |

## Test plan

- [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for
all error codes, null handling, and config toggle
- [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug
points to trigger cross-V1, verifies SC stays RUNNING under override,
then succeeds after override removal. Asserts data integrity and schema
correctness.
yiguolei pushed a commit that referenced this pull request Apr 27, 2026
Cherry-pick the following PRs to `branch-4.0`:

- #61696 [Feature](compaction) add CompactionTaskTracker with system
table and HTTP API
- #61621 [fix](metrics) Fix prepared statement QPS metrics not counted
when audit log disabled
- #62272 [fix](cloud) Add SC_COMPACTION_CONFLICT error code to retry
cross-V1 compaction failures

#61696 had conflicts due to the BE directory layout difference between
master (`be/src/storage/...`) and branch-4.0 (`be/src/olap/...`). New
files were relocated to the branch-4.0 paths; include paths and FE
schema/thrift entries were merged accordingly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.6-merged dev/4.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants