[fix](cloud) skip stale tablet cache check for STOP_TOKEN to fix spurious schema change failure by Hastyshell · Pull Request #61380 · apache/doris

Hastyshell · 2026-03-16T08:38:17Z

Problem

In cloud mode, schema change on MOW (Merge-on-Write) tables intermittently fails with:

task type: ALTER, status_code: INTERNAL_ERROR, status_message:
[(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
meta_service_job.cpp could not perform compaction on expired tablet cache.
req_base_compaction_cnt=0, base_compaction_cnt=0,
req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]

Root Cause

Schema change on a MOW table calls _process_delete_bitmap(), which registers a STOP_TOKEN compaction job via CloudCompactionStopToken::do_register(). The STOP_TOKEN is not a real compaction — it is a lock marker that blocks concurrent compactions during delete bitmap recalculation.

However, start_compaction_job() in the meta-service applies the stale tablet cache check unconditionally to all compaction types, including STOP_TOKEN. If a concurrent compaction on another BE node advances cumulative_compaction_cnt in the meta-service while the schema change BE still holds its old cached value, the STOP_TOKEN registration is rejected with STALE_TABLET_CACHE. This error propagates back to the FE as a fatal ALTER task failure.

Fix

Skip the stale tablet cache check when the compaction job type is STOP_TOKEN. Since STOP_TOKEN does not read or compact any rowsets, verifying the freshness of cached compaction counts is meaningless for it.

// Before
if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
    compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt()) {

// After
if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
    (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
     compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt())) {

Testing

Added regression test StopTokenSkipsStaleTabletCacheCheck in cloud/test/meta_service_job_test.cpp that:

Sets up a tablet with cumulative_compaction_cnt=9 on the meta-service side
Verifies that a regular CUMULATIVE compaction with stale count=8 is still correctly rejected with STALE_TABLET_CACHE
Verifies that a STOP_TOKEN with the same stale count=8 succeeds with OK

STOP_TOKEN is a lock marker registered by schema change (MOW tables) to block concurrent compactions during delete bitmap recalculation. It does not perform any actual compaction work, so checking whether the BE's cached compaction counts are up-to-date is meaningless for it. Before this fix, when a concurrent compaction on another BE advanced the cumulative_compaction_cnt in the meta-service (e.g. from 8 to 9) while a schema change was in progress, the subsequent STOP_TOKEN registration would be rejected with STALE_TABLET_CACHE and propagate back as an ALTER task failure, even though no stale-cache hazard actually existed for the lock operation. Fix: guard the stale-cache check with a type != STOP_TOKEN condition so that STOP_TOKEN registrations always proceed regardless of cached counts. Add a regression test (StopTokenSkipsStaleTabletCacheCheck) that reproduces the exact scenario from CORE-5964.

Thearas · 2026-03-16T08:38:25Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Hastyshell · 2026-03-16T08:49:45Z

run buildall

hello-stephen · 2026-03-16T09:21:43Z

Cloud UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	79.24% (1798/2269)
Line Coverage	64.46% (32278/50077)
Region Coverage	65.35% (16163/24732)
Branch Coverage	55.78% (8612/15438)

Hastyshell requested review from dataroaring, gavinchou and w41ter as code owners March 16, 2026 08:38

Merge branch 'master' into fix/CORE-5964-stop-token-stale-cache-master

61b2a84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](cloud) skip stale tablet cache check for STOP_TOKEN to fix spurious schema change failure#61380

[fix](cloud) skip stale tablet cache check for STOP_TOKEN to fix spurious schema change failure#61380
Hastyshell wants to merge 2 commits intoapache:masterfrom
Hastyshell:fix/CORE-5964-stop-token-stale-cache-master

Hastyshell commented Mar 16, 2026

Uh oh!

Thearas commented Mar 16, 2026

Uh oh!

Hastyshell commented Mar 16, 2026

Uh oh!

hello-stephen commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Hastyshell commented Mar 16, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

Thearas commented Mar 16, 2026

Uh oh!

Hastyshell commented Mar 16, 2026

Uh oh!

hello-stephen commented Mar 16, 2026

Cloud UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants