support async publish for cloud mow#61634
Draft
bobhan1 wants to merge 30 commits intoapache:masterfrom
Draft
Conversation
Add 10 self-contained design documents for the Cloud MOW two-phase commit feature, covering Proto/KV schema, MS commit/convert/publish APIs, FE commit phase, publish daemon, recovery, BE calc bitmap changes, and cleanup of legacy locks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…maParam, per-BE batching Key changes across module design docs: - Add MS tablet-level lock design (replaces table-level lock for 2PC tables) - Change load param persistence from custom proto to TOlapTableSchemaParam (Thrift serialized bytes in TxnInfoPB.load_schema_param) - Document per-BE batch dispatch model (one request per BE with all tablets) - Clarify BE independently completes full pipeline per tablet - Emphasize all old code preserved, feature gated by table property Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t lock, old code preservation - Replace TxnLoadInfoPB with bytes load_schema_param (TOlapTableSchemaParam Thrift serialized bytes) throughout the document - Add TOlapTableSchemaParam definition and cross-type serialization examples - Add Section 8: MS Tablet-level lock KV/RPC design - Add Section 9: Old code preservation principle - Update proto summary with tablet lock RPC extensions - Update KV summary with tablet lock KV - Renumber subsequent sections (10->12, 11->13) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nces Update FE recovery and BE calc bitmap docs to use load_schema_param (TOlapTableSchemaParam Thrift bytes) instead of the removed TxnLoadInfoPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… with user-facing 2PC Doris has an existing user-facing 2PC API (is_2pc, precommit_txn). The internal commit+publish separation mechanism is a different concept. Rename all references to avoid confusion: - two_phase_commit -> mow_async_publish (proto fields) - enable_two_phase_commit -> enable_mow_async_publish (table property) - is_cloud_mow_2pc -> is_mow_async_publish (request fields) - 两阶段提交 -> 异步发布 (Chinese docs, when referring to internal mechanism) The only remaining "两阶段提交" reference is for the user-facing 2PC (is_2pc field description), which is correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Module 0: Define foundational proto messages and KV schema for the async publish (two-phase commit) feature in Cloud MOW tables. Proto changes: - Add TxnTabletInfoPB message for tablet-to-BE mapping - Add ConvertTmpRowsetRequest/Response for per-tablet rowset conversion - Extend TxnInfoPB with async publish fields (30-35) - Extend CommitTxnRequest with async publish/lightweight publish fields (20-23) - Extend CommitTxnResponse with commit_versions field (10) - Extend GetDeleteBitmapUpdateLockRequest with tablet-level lock fields (20-21) - Extend RemoveDeleteBitmapUpdateLockRequest with tablet-level lock fields (20-21) - Register convert_tmp_rowset RPC in MetaService KV Schema changes: - Add partition_commit_version_key (PartitionCommitVersionKeyInfo) - Add meta_delete_bitmap_tablet_lock_key (MetaDeleteBitmapTabletLockInfo) - Add encode/decode functions for both new key types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add commit_txn_async_publish() which implements the fast commit phase for MOW async publish. This phase only updates partition commit versions and TxnInfoPB without rowset conversion or visible version update. Key changes: - Add commit_txn_async_publish() method in meta_service - Add txn_async_publish_test to CMakeLists.txt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…owset conversion
Module 2 of async publish: Implement MS-side per-tablet tmp rowset to formal
rowset conversion via new RPC.
MS changes:
- Add convert_tmp_rowset() RPC method in meta_service.{h,cpp}
- Implement idempotency check (formal rowset existence)
- Read tmp rowset KV and parse metadata
- Set version fields (start_version, end_version, visible_ts_ms)
- Write formal rowset KV and versioned rowset (if enabled)
- Update tablet stats (split mode via atomic_add)
- Delete tmp rowset KV
- Add monitoring variables (g_bvar_ms_convert_tmp_rowset, KV counters)
BE changes:
- Add convert_tmp_rowset() method in cloud_meta_mgr.{h,cpp}
- Build ConvertTmpRowsetRequest with txn_id, tablet_id, version, etc.
- Call MS RPC via MetaService_Stub::convert_tmp_rowset
- Return rowset_meta if requested
Tests:
- Add 8 unit tests in txn_async_publish_test.cpp:
- ConvertTmpRowsetBasic: normal conversion + stats update
- ConvertTmpRowsetIdempotent: retry safety
- ConvertTmpRowsetNotFound: tmp not exists error
- ConvertTmpRowsetInvalidParams: param validation
- ConvertTmpRowsetInvalidInstance: instance resolution error
- ConvertTmpRowsetRecycled: recycled tmp rowset error
- ConvertTmpRowsetVersionConflict: version conflict detection
- ConvertTmpRowsetMultiTablet: independent multi-tablet conversion
All tests passed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add commit_txn_2pc_lightweight_publish() function in meta_service.h and meta_service_txn.cpp - Add routing logic in commit_txn() for is_lightweight_publish flag - Implement lightweight publish logic: - Read and validate TxnInfoPB (must be COMMITTED with mow_async_publish flag) - Validate commit_version == current_visible_version + 1 - Update partition visible versions to commit versions - Update TxnInfoPB status to VISIBLE - Delete txn_running_key - Create CommitTxnLogPB and recycle information - Update table versions - Add 9 unit tests for lightweight publish covering: - Basic functionality - Idempotency - Error scenarios (aborted, not found, wrong status) - Multi-partition support - End-to-end flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CommittedTxnEntry class for tracking committed transactions - Add CommittedTxnManager for managing committed transaction set - Add enable_mow_async_publish table property - Add Config.mow_async_publish_publish_timeout_seconds config - Implement two-phase commit flow in CloudGlobalTransactionMgr: - Add routing logic in commitAndPublishTransaction() - Implement commitAndPublishOnePhase() (original flow) - Implement commitAndPublishTwoPhase() (new async publish flow) - Add populateCommitAttachment() helper function - Add isEnableTwoPhaseCommit() and isMowAsyncPublish() methods to OlapTable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CloudPublishDaemon: background daemon that dispatches CalcDeleteBitmapTask to BEs and triggers lightweight publish when all tasks finish - Modify MasterImpl.finishCalcDeleteBitmap: split async/sync paths cleanly; async mode keeps failed tasks in AgentTaskQueue for heartbeat auto-retry - Add CalcDeleteBitmapTask.isAsyncPublish() to distinguish async mode (latch=null) - Add Config: cloud_publish_interval_ms, cloud_publish_thread_pool_size - Register CloudPublishDaemon in Env startup for cloud mode - Add CloudPublishDaemonTest with unit tests
…ities 1. Add checkCommitInfo validation to executeCommitTxnRequestTwoPhase 2. Add MS RPC retry mechanism with backoff for both phases 3. Extract common callback handling to TxnUtil.executeCommitCallbacks 4. Extract common after-commit operations to TxnUtil.afterCommitCommon 5. Extract backoff logic to TxnUtil.backoff 6. CloudPublishDaemon now uses TxnUtil for callback and after-commit ops 7. Update test to adapt to new retry behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This change propagates the enable_mow_async_publish table property from FE to BE tablet metadata, enabling BE to detect async publish tables at the tablet level. Changes: 1. Proto: Add enable_mow_async_publish field to TabletMetaPB (field 36) and TabletMetaCloudPB (field 41) 2. Proto: Add is_mow_async_publish to UpdateDeleteBitmapRequest (field 53) 3. FE: CloudInternalCatalog.createTabletMetaBuilder() sets the flag 4. BE: pb_convert.cpp converts the flag in both directions 5. BE: TabletMeta stores and provides accessor for the flag 6. BE: BaseTablet exposes enable_mow_async_publish() method Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ap RPC This change propagates the enable_mow_async_publish flag from tablet metadata to the update_delete_bitmap RPC, allowing MS to skip pending delete bitmap operations for async publish tables. Changes: 1. Add is_mow_async_publish parameter to CloudMetaMgr::update_delete_bitmap() 2. Set the flag in UpdateDeleteBitmapRequest proto 3. Pass tablet.enable_mow_async_publish() in all call sites: - CloudTablet::save_delete_bitmap_to_ms() (load path) - CloudFullCompaction (compaction path) - CloudSchemaChangeJob (schema change path) - CloudTablet::calc_delete_bitmap_for_compaction() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…in MS For tables with async publish enabled, skip pending delete bitmap operations in update_delete_bitmap RPC: 1. Skip remove_pending_delete_bitmap() call 2. Skip pending delete bitmap KV write This is safe because async publish tables don't need pending cleanup mechanism - the publish phase ensures all operations complete successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…toring for async publish For async publish tables, replace table-level MS lock with tablet-level MS lock (delete_bitmap_tablet_lock). The lock acquisition order is: rowset_update_lock (BE memory) -> delete_bitmap_tablet_lock (MS distributed) This minimizes MS lock contention since compaction and publish are usually on the same BE, so the memory lock handles most cases. MS changes: - Add get_delete_bitmap_tablet_lock() using per-tablet KV keys - Add remove_delete_bitmap_tablet_lock() for tablet-level lock release - Dispatch based on tablet_level_lock flag in existing RPC handlers BE changes: - CloudMetaMgr: Add get/remove_delete_bitmap_tablet_lock() methods - cloud_tablet.cpp (cumulative/base/index compaction): For async publish tablets, acquire rowset_update_lock then tablet lock - cloud_full_compaction.cpp: Same lock ordering for full compaction - cloud_engine_calc_delete_bitmap_task.cpp: Acquire/release tablet lock around convert_tmp_rowset in publish path - compaction.cpp: Error recovery uses tablet lock for async publish tables - cloud_schema_change_job.cpp: Pass is_mow_async_publish in update_delete_bitmap - Skip pending delete bitmap in MS for async publish tables Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. MS update_delete_bitmap: check tablet-level lock instead of table-level lock for async publish tables 2. MS update_delete_bitmap: fix is_mow_async_publish scoping bug (moved from inner block to outer scope) 3. MS check_delete_bitmap_lock: add tablet_level_lock parameter to skip v2 mow_tablet_job_key check for tablet-level locks 4. MS finish_tablet_job/abort_tablet_job: release tablet-level lock for async publish compaction jobs 5. Proto: add use_delete_bitmap_tablet_lock to TabletCompactionJobPB 6. BE: all compaction types (cumulative/base/full/index_change) set use_delete_bitmap_tablet_lock flag in compaction job Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… calculation Move get_delete_bitmap_tablet_lock from _handle_async_publish to handle(), right after rowset_update_lock acquisition: Lock order: rowset_update_lock (memory) -> tablet lock (MS) -> calc delete bitmap -> convert_tmp_rowset -> local apply -> release tablet lock Use Defer to ensure MS tablet lock is released on any exit path. _handle_async_publish no longer manages lock lifecycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Start CloudPublishDaemon in cloud mode so committed async-publish txns can be picked up by the background publish loop. Include involved_tablets in the async commit request so MS can derive partition commit versions and persist the async publish metadata. Propagate enable_mow_async_publish, db_id, table_id, and index_ids to CalcDeleteBitmapTask so BE can run convert_tmp_rowset in async publish. Add FE unit tests for the async commit request and publish task payload.
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
- Add property validation in InternalCatalog (cloud mode and MoW unique table only) - Add setEnableMowAsyncPublish method to OlapTable - Include property in SHOW CREATE TABLE output when two-phase commit is enabled - Add unit tests for validation and DDL generation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… error handling - Implement _handle_async_publish with proper flow: - Empty rowset: create hole filler with local time - Non-empty rowset: MS convert_tmp_rowset + local apply with MS time - Only convert_tmp_rowset errors propagate; local apply is best-effort - Get rowset from local txn_delete_bitmap_cache (has tablet_schema) - Use set_cloud_fields_after_visible with proper time source Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o 10ms Reduce the default scheduling interval for CloudPublishDaemon to improve publish latency for MOW tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…agent task type Add CALC_DELETE_BITMAP_ASYNC_PUBLISH (1002) agent task to fully separate the async publish delete bitmap calculation code path from the original sync CALCULATE_DELETE_BITMAP (1000) task. Previously both modes shared the same task type and were differentiated by a enable_mow_async_publish flag, making the code interleaved and hard to maintain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a94b31d to
d4f4b10
Compare
Issue Number: None Related PR: None Problem Summary: Remove sub_txn and compaction counter handling from CloudTabletCalcDeleteBitmapAsyncPublishTask, and add a dedicated CloudTablet lock so cloud compaction and async publish stay mutually exclusive across delete bitmap calculation and local rowset layout update. None - Test: No need to test (not run in this workspace; previous BE build in this environment was blocked by submodule/network setup and the user said they will build locally) - Behavior changed: Yes (cloud async publish task no longer handles sub_txn or compaction cnts, and cloud compaction now serializes with async publish across delete bitmap calculation and local rowset layout update) - Does this need documentation: No
d4f4b10 to
d485aac
Compare
Issue Number: None Related PR: None Problem Summary: Return PUBLISH_VERSION_NOT_CONTINUOUS for cloud async publish version gaps, and handle that status in a dedicated cloud publish worker so short-lived discontinuous versions are retried inside BE instead of waiting for FE task report retries. None - Test: No need to test (not run here; FE changes in the worktree were intentionally excluded from this commit, and build/test were left to local verification) - Behavior changed: Yes (cloud publish now distinguishes version-gap retries from delete bitmap lock conflicts and requeues retries inside BE worker threads) - Does this need documentation: No
ac45d7b to
6f7568d
Compare
### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Restore compaction stats and tablet state checks for cloud async publish by reading them from MS while acquiring the tablet-level delete bitmap lock, then using those values in BE to decide whether sync_rowsets is needed and whether cached publish results can be reused. ### Release note None ### Check List (For Author) - Test: No need to test (not run in this workspace; added meta_service_test coverage only) - Behavior changed: Yes (cloud async publish now syncs rowsets again when the locked MS tablet state or compaction stats are newer than local state) - Does this need documentation: No
…_delete_bitmap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In cloud-native (disaggregated storage-compute) mode, high-frequency concurrent loads on MOW (Merge-on-Write) tables suffer from a serious throughput bottleneck that cannot be alleviated by horizontally scaling resources.
How It Works in Shared-Nothing Mode (Baseline)
In shared-nothing mode, MOW loads use a two-phase commit:
This architecture scales horizontally: adding more BEs reduces commit + publish latency and increases throughput.
The Problem in Cloud Mode
Cloud mode currently uses a single-phase commit. The entire process runs serially under an FE table-level in-memory write lock:
This causes:
Solution
Introduce a two-phase async publish mode for cloud MOW tables, controlled by the table property
enable_mow_async_publish(new tables only). The core idea is to move the expensive publish operations (delete bitmap calculation + rowset conversion) out of the lock-holding path and execute them asynchronously.Phase 1: Commit (Fast, Brief FE Table Lock)
commit_txnwithenable_mow_async_publish=true:partition_commit_version_key, separate from visible version)TxnInfoPB(involved tablets, commit versions, load schema params, etc.)COMMITTEDCommittedTxnManagerPhase 2: Publish (Async, No FE Table Lock)
Driven by the new
CloudPublishDaemonbackground thread:CalcDeleteBitmapTasks. Failed tasks auto-retry viaAgentTaskQueue/ BE heartbeat mechanism.convert_tmp_rowsetto convert tmp rowset to formal versioned rowset (per-tablet, independent)commit_txnwithis_lightweight_publish=true:VISIBLEKey Design Decisions
commit_txnconvert_tmp_rowsetImplementation Details
Proto & KV Schema (
gensrc/proto/cloud.proto,cloud/src/meta-store/keys.cpp)TxnTabletInfoPBmessage: records per-tablet BE endpoint info for publish-phase task dispatchTxnInfoPBnew fields:committed_versions,involved_tablets,load_schema_param,published_tablet_idsCommitTxnRequestnew routing flags:enable_mow_async_publishandis_lightweight_publishconvert_tmp_rowsetRPC withConvertTmpRowsetRequest/ConvertTmpRowsetResponsepartition_commit_version_key(commit version storage),meta_delete_bitmap_tablet_lock_key(tablet-level lock)Meta Service (
cloud/src/meta-service/)commit_txn_async_publish: Commit-phase handler — reserves commit versions, persists txn metadata; idempotent (re-returns stored versions if already committed)commit_txn_2pc_lightweight_publish: Publish-phase handler — advances visible versions with strict version continuity validationconvert_tmp_rowset: Per-tablet tmp→formal rowset conversion; idempotent (detects duplicate by rowset_id at target version)get_delete_bitmap_tablet_lock/remove_delete_bitmap_tablet_lock— fine-grained mutual exclusion between loads and compactionsupdate_delete_bitmapFE (
fe/fe-core/.../cloud/transaction/)CloudGlobalTransactionMgr: NewcommitAndPublishTwoPhaseflow — fast commit thenawaitPublish(timeout)for async publish completionCloudPublishDaemon(new):MasterDaemonbackground thread with two-phase loop — dispatch calc tasks → try finish transactions via lightweight publishCommittedTxnManager(new): Thread-safe committed transaction tracking with dual indexes (txnId + tableId)CommittedTxnEntry(new): Per-transaction state tracking + CountDownLatch-based publish await mechanismTxnUtil(new): Shared utilities (commit callbacks, retry backoff)enable_mow_async_publishsupport (OlapTable,TableProperty,PropertyAnalyzer)BE (
be/src/cloud/)CloudEngineCalcDeleteBitmapTask: Async publish path adds MS tablet-level lock acquisition → delete bitmap calculation →convert_tmp_rowsetcall → best-effort local apply (merge delete bitmap + add rowset to tablet memory)Tests
cloud/test/txn_async_publish_test.cpp: 1500+ line MS integration test covering commit/publish/convert_tmp_rowset normal and error pathsCloudPublishDaemonTest.java: FE publish daemon unit testsCloudGlobalTransactionMgrTest.java,OlapTableTest.java,CreateTableTest.java: Related unit testsRisks & Limitations
Test Plan
txn_async_publish_testpassesCloudPublishDaemonTestpasses