[fix](cloud) Add SC_COMPACTION_CONFLICT error code to retry cross-V1 compaction failures#62272
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
8183948 to
dad5109
Compare
dad5109 to
bc9b76e
Compare
43ac15e to
779a597
Compare
|
PR approved by anyone and no changes requested. |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…compaction failures The cloud SC compaction optimization introduced a cross-V1 safety check that returns INTERNAL_ERROR when a compaction rowset crosses the alter_version boundary. However, FE does not retry INTERNAL_ERROR, causing the schema change job to be permanently cancelled. Add a dedicated SC_COMPACTION_CONFLICT(101) error code for this transient condition, and include it in FE's schema change retry whitelist. When the cross-V1 check triggers, abort the SC job in meta-service first so the next retry registers a fresh job with a higher alter_version.
779a597 to
91835d1
Compare
…ion init Log per_row, sample_bytes, sample_rows immediately after all merge inputs finish loading their first block, before the actual merge starts. This helps diagnose memory issues by showing the actual per-row memory size at init time.
… compaction init" This reverts commit 6cfa629.
|
run buildall |
|
run feut |
|
run nonConcurrent |
|
run p0 |
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
|
run p0 |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
…compaction failures (#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
…compaction failures (#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
…compaction failures (apache#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
…compaction failures (apache#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
…compaction failures (apache#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
…compaction failures (apache#62272) ## Summary - Add dedicated `SC_COMPACTION_CONFLICT(101)` TStatusCode for the cross-V1 compaction detection in cloud schema change - Include this error code in FE's schema change retry whitelist (`AlterJobV2.getRetryTimes()`) - Previously, this transient condition returned `INTERNAL_ERROR` which FE does not retry, causing the schema change job to be permanently cancelled ## Background The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within `[2, V1']`. The bug was that FE treated `INTERNAL_ERROR` as a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled. ## Changes | File | Change | |------|--------| | `gensrc/thrift/Status.thrift` | Add `SC_COMPACTION_CONFLICT = 101` | | `be/src/common/status.h` | Register `TStatusError(SC_COMPACTION_CONFLICT, false)` | | `be/src/cloud/cloud_schema_change_job.cpp` | Cross-V1 check: `INTERNAL_ERROR` → `SC_COMPACTION_CONFLICT` | | `fe/.../alter/AlterJobV2.java` | Add `SC_COMPACTION_CONFLICT` to retry whitelist | | `fe/.../alter/AlterJobV2RetryTest.java` | Unit tests for `getRetryTimes()` (7 cases) | | `regression-test/.../test_sc_compaction_cross_v1_retry.groovy` | E2E test: cross-V1 failure → retry → success | ## Test plan - [x] FE unit test: `AlterJobV2RetryTest` - verifies retry whitelist for all error codes, null handling, and config toggle - [x] Regression test: `test_sc_compaction_cross_v1_retry` - uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.
Cherry-pick the following PRs to `branch-4.0`: - #61696 [Feature](compaction) add CompactionTaskTracker with system table and HTTP API - #61621 [fix](metrics) Fix prepared statement QPS metrics not counted when audit log disabled - #62272 [fix](cloud) Add SC_COMPACTION_CONFLICT error code to retry cross-V1 compaction failures #61696 had conflicts due to the BE directory layout difference between master (`be/src/storage/...`) and branch-4.0 (`be/src/olap/...`). New files were relocated to the branch-4.0 paths; include paths and FE schema/thrift entries were merged accordingly.
Summary
SC_COMPACTION_CONFLICT(101)TStatusCode for the cross-V1 compaction detection in cloud schema changeAlterJobV2.getRetryTimes())INTERNAL_ERRORwhich FE does not retry, causing the schema change job to be permanently cancelledBackground
The cloud SC compaction optimization allows new tablets to do compaction during schema change queue wait. A safety check detects when a compaction rowset crosses the alter_version (V1) boundary. This is a transient condition - on retry, V1 will be higher and the crossing rowset falls within
[2, V1'].The bug was that FE treated
INTERNAL_ERRORas a permanent failure (no retry), so this transient condition caused the entire schema change to be cancelled.Changes
gensrc/thrift/Status.thriftSC_COMPACTION_CONFLICT = 101be/src/common/status.hTStatusError(SC_COMPACTION_CONFLICT, false)be/src/cloud/cloud_schema_change_job.cppINTERNAL_ERROR→SC_COMPACTION_CONFLICTfe/.../alter/AlterJobV2.javaSC_COMPACTION_CONFLICTto retry whitelistfe/.../alter/AlterJobV2RetryTest.javagetRetryTimes()(7 cases)regression-test/.../test_sc_compaction_cross_v1_retry.groovyTest plan
AlterJobV2RetryTest- verifies retry whitelist for all error codes, null handling, and config toggletest_sc_compaction_cross_v1_retry- uses debug points to trigger cross-V1, verifies SC stays RUNNING under override, then succeeds after override removal. Asserts data integrity and schema correctness.