Skip to content

[fix](cloud) skip stale tablet cache check for STOP_TOKEN to fix spurious schema change failure#61380

Open
Hastyshell wants to merge 2 commits intoapache:masterfrom
Hastyshell:fix/CORE-5964-stop-token-stale-cache-master
Open

[fix](cloud) skip stale tablet cache check for STOP_TOKEN to fix spurious schema change failure#61380
Hastyshell wants to merge 2 commits intoapache:masterfrom
Hastyshell:fix/CORE-5964-stop-token-stale-cache-master

Conversation

@Hastyshell
Copy link
Collaborator

Problem

In cloud mode, schema change on MOW (Merge-on-Write) tables intermittently fails with:

task type: ALTER, status_code: INTERNAL_ERROR, status_message:
[(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
meta_service_job.cpp could not perform compaction on expired tablet cache.
req_base_compaction_cnt=0, base_compaction_cnt=0,
req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]

Root Cause

Schema change on a MOW table calls _process_delete_bitmap(), which registers a STOP_TOKEN compaction job via CloudCompactionStopToken::do_register(). The STOP_TOKEN is not a real compaction — it is a lock marker that blocks concurrent compactions during delete bitmap recalculation.

However, start_compaction_job() in the meta-service applies the stale tablet cache check unconditionally to all compaction types, including STOP_TOKEN. If a concurrent compaction on another BE node advances cumulative_compaction_cnt in the meta-service while the schema change BE still holds its old cached value, the STOP_TOKEN registration is rejected with STALE_TABLET_CACHE. This error propagates back to the FE as a fatal ALTER task failure.

Fix

Skip the stale tablet cache check when the compaction job type is STOP_TOKEN. Since STOP_TOKEN does not read or compact any rowsets, verifying the freshness of cached compaction counts is meaningless for it.

// Before
if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
    compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt()) {

// After
if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
    (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
     compaction.cumulative_compaction_cnt() < stats.cumulative_compaction_cnt())) {

Testing

Added regression test StopTokenSkipsStaleTabletCacheCheck in cloud/test/meta_service_job_test.cpp that:

  1. Sets up a tablet with cumulative_compaction_cnt=9 on the meta-service side
  2. Verifies that a regular CUMULATIVE compaction with stale count=8 is still correctly rejected with STALE_TABLET_CACHE
  3. Verifies that a STOP_TOKEN with the same stale count=8 succeeds with OK

STOP_TOKEN is a lock marker registered by schema change (MOW tables) to
block concurrent compactions during delete bitmap recalculation. It does
not perform any actual compaction work, so checking whether the BE's
cached compaction counts are up-to-date is meaningless for it.

Before this fix, when a concurrent compaction on another BE advanced the
cumulative_compaction_cnt in the meta-service (e.g. from 8 to 9) while a
schema change was in progress, the subsequent STOP_TOKEN registration
would be rejected with STALE_TABLET_CACHE and propagate back as an ALTER
task failure, even though no stale-cache hazard actually existed for the
lock operation.

Fix: guard the stale-cache check with a type != STOP_TOKEN condition so
that STOP_TOKEN registrations always proceed regardless of cached counts.
Add a regression test (StopTokenSkipsStaleTabletCacheCheck) that
reproduces the exact scenario from CORE-5964.
@Thearas
Copy link
Contributor

Thearas commented Mar 16, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Hastyshell
Copy link
Collaborator Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.24% (1798/2269)
Line Coverage 64.46% (32278/50077)
Region Coverage 65.35% (16163/24732)
Branch Coverage 55.78% (8612/15438)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants