[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199
[feature](tso) Add global monotonically increasing Timestamp Oracle(TSO)#61199AntiTopQuark wants to merge 29 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
996adea to
66a8f8e
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
66a8f8e to
fa48c2a
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
fa48c2a to
dc6a3a6
Compare
|
run buildall |
…TSO) Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
dc6a3a6 to
9c9dec0
Compare
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
Signed-off-by: Jingzhe Jia <AntiTopQuark1350@outlook.com>
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Code Review Summary for TSO Feature PR
This is a large, well-structured PR introducing a global monotonically increasing Timestamp Oracle (TSO) service. The end-to-end flow from FE (TSOService → transaction commit → thrift) to BE (publish version → txn_manager → rowset meta) is cleanly plumbed. Below are the critical checkpoint conclusions and specific issues found.
Critical Checkpoint Conclusions
1. Goal & Correctness: The PR implements the TSO feature as described. The core bit-manipulation in TSOTimestamp is correct and well-tested. However, there are correctness risks around TSO persistence — see inline comments.
2. Modification focus: The PR is relatively focused for its scope but touches 52 files. The BE changes are mechanical plumbing of commit_tso and are correct.
3. Concurrency: TSOService uses a ReentrantLock for the globalTimestamp object. Lock usage is lightweight (only read/write of two longs). calibrateTimestamp() and updateTimestamp() are called only from the single daemon thread. getTSO() can be called from transaction commit threads concurrently, but correctly acquires the lock only in generateTSO(). No deadlock risk identified.
4. Lifecycle: TSOService extends MasterDaemon and starts in startMasterOnlyDaemonThreads(). The isInitialized AtomicBoolean guards startup. However, see the issue about resetting isInitialized when the feature is disabled at runtime.
5. Configuration: Multiple configs added with appropriate defaults. The enable_tso_persist_journal and enable_tso_checkpoint_module flags both default to false, meaning TSO has no persistence by default. This is a significant design choice — see inline comment.
6. Incompatible changes: FeMetaVersion.MINIMUM_VERSION_REQUIRED was bumped to VERSION_141. This is concerning — see inline comment.
7. Parallel code paths: The 2PC commit path, sub-transaction path, and normal commit path all correctly call getCommitTSO(). The LocalTabletInvertedIndex.generatePartitionVersionInfoWhenReport() also correctly propagates commitTSO for the report path.
8. Test coverage: Unit tests cover TSOTimestamp bit operations, TSOService initialization/calibration/overflow, TSOAction REST API, and DatabaseTransactionMgr commit TSO propagation. Regression tests verify rowset commit TSO and API behavior. BE unit tests cover async publish with commit TSO. Coverage is reasonable.
9. Observability: Six new metrics added (drift, backward, calculated, updated, update_failed, get_success). The REST API /api/tso provides inspection without incrementing. Logging is present at appropriate levels.
10. Transaction/persistence: EditLog write/replay for OP_TSO_TIMESTAMP_WINDOW_END is correctly wired. However, the persistence is guarded by two separate config flags that default to false. See inline comments for risks.
11. Data writes: The commit_tso is written to RowsetMetaPB during make_visible(). This is atomic with the existing rowset meta save. No concurrency issues.
12. FE-BE variable passing: commit_tso is added to TPartitionVersionInfo in Thrift with default -1. Both the publish path and the report path correctly propagate it. Cloud pb_convert.cpp correctly handles the new field in both directions.
13. Performance: getTSO() has retry loops with sleep(200ms) on the commit path. For an experimental feature this is acceptable, but should be noted. The lock in generateTSO() is very lightweight (increment a counter).
Issues Found
See inline comments for specific issues. Summary:
- [High] TSO persistence default off (
Config.java:3401): With bothenable_tso_persist_journal=falseandenable_tso_checkpoint_module=false, TSO monotonicity is NOT guaranteed across FE restarts. - [Medium] MINIMUM_VERSION_REQUIRED bump (
FeMetaVersion.java:115): Should not be bumped to VERSION_141 as the TSO image module is conditionally written. - [Medium] Runtime disable resets isInitialized (
TSOService.java:80): Disablingenable_tso_featureat runtime resetsisInitialized, but concurrentgetTSO()callers may still be in-flight. - [Low] writeTimestampToBDBJE swallows all exceptions (
TSOService.java:420): Silent failure can lead to monotonicity violations. - [Low] updateTimestamp logs error but continues when physicalTime==0 (
TSOService.java:313): Should return early. - [Info]
tsomodule unconditionally registered (PersistMetaModules.java:47): Acceptable for experimental but worth noting.
| @ConfField(mutable = true, masterOnly = true, description = { | ||
| "Whether to enable persisting TSO window end into edit log. Enabling emits new op code," | ||
| + " which may break rollback to older versions."}) | ||
| public static boolean enable_tso_persist_journal = false; |
There was a problem hiding this comment.
[High] No TSO persistence by default risks monotonicity violation.
Both enable_tso_persist_journal (here) and enable_tso_checkpoint_module default to false. This means:
- On FE restart,
windowEndTSOwill be 0 calibrateTimestamp()will set physical time toSystem.currentTimeMillis()- If NTP adjusts the clock backward between restarts, previously issued TSOs could be reissued
This breaks the "monotonically increasing" guarantee stated in the PR description. For an experimental feature, consider either:
- Making
enable_tso_persist_journal = truethe default whenenable_tso_featureis enabled, OR - Clearly documenting that TSO monotonicity across restarts requires enabling these persistence flags, OR
- Having
calibrateTimestamp()refuse to start ifenable_tso_persist_journalis false (to make it fail-fast)
| // if (FE_METAVERSION < VERSION_94) ... | ||
| // these clause will be useless and we could remove them | ||
| public static final int MINIMUM_VERSION_REQUIRED = VERSION_140; | ||
| public static final int MINIMUM_VERSION_REQUIRED = VERSION_141; |
There was a problem hiding this comment.
[Medium] MINIMUM_VERSION_REQUIRED should not be bumped to VERSION_141.
Bumping MINIMUM_VERSION_REQUIRED means FE instances running this code will refuse to load any meta image written by an FE with version < 141. Since the TSO image module is conditionally written (enable_tso_checkpoint_module defaults to false), and the "tso" module in PersistMetaModules won't produce data unless enabled, there's no reason to require version 141 as minimum.
This bump prevents rollback to the previous FE version even when TSO is not in use. Consider keeping MINIMUM_VERSION_REQUIRED = VERSION_140 and only bumping VERSION_CURRENT.
| @Override | ||
| protected void runAfterCatalogReady() { | ||
| if (!Config.enable_tso_feature) { | ||
| isInitialized.set(false); |
There was a problem hiding this comment.
[Medium] Resetting isInitialized while getTSO() may be in-flight.
When enable_tso_feature is dynamically set to false, this code resets isInitialized to false. However, a concurrent transaction commit thread could be between the isInitialized.get() check at line 159 and the generateTSO() call at line 189. This is a TOCTOU race.
Moreover, fatalClockBackwardReported is also reset here. If the feature is re-enabled and a clock backward is detected, the TSOClockBackwardException at line 99 would be thrown again (which is RuntimeException subclass), potentially crashing the daemon thread depending on how MasterDaemon handles it.
Consider:
- Not resetting
isInitializedwhen disabling — just checkConfig.enable_tso_featureat thegetTSO()entry point - Or use a separate flag for "feature disabled" vs "not yet calibrated"
| tsoTimestamp); | ||
| } | ||
| } catch (Exception e) { | ||
| LOG.error("Failed to write TSO timestamp to BDBJE", e); |
There was a problem hiding this comment.
[Low] writeTimestampToBDBJE catches all exceptions and only logs.
Silently swallowing persistence failures means isInitialized can be set to true (in calibrateTimestamp() at line 274) even when the window end was never persisted. On FE restart, windowEndTSO would be 0, and TSO monotonicity could be violated.
Consider either:
- Propagating the exception (at least from
calibrateTimestamp(), where persistence failure should prevent initialization) - Or returning a boolean success/failure that the caller can check
| } | ||
|
|
||
| if (prevPhysicalTime == 0) { | ||
| LOG.error("TSO timestamp is not calibrated, please check"); |
There was a problem hiding this comment.
[Low] updateTimestamp() only logs error when prevPhysicalTime == 0 but continues execution.
If prevPhysicalTime == 0 (meaning TSO was never calibrated), the method continues to compute timeLag using this 0 value, which would always trigger the clock-drift warning at line 318. The method should return early here since continuing with uncalibrated state produces meaningless warnings and incorrect state transitions.
| "binlogs", "resourceGroups", "AnalysisMgrV2", "AsyncJobManager", "workloadSchedPolicy", | ||
| "insertOverwrite", "plsql", "dictionaryManager", "indexPolicy", "KeyManagerStore", | ||
| "constraintManager" | ||
| "constraintManager", "tso" |
There was a problem hiding this comment.
[Info] The "tso" module is unconditionally registered in MODULE_NAMES.
While saveTSO() checks Config.enable_tso_feature && Config.enable_tso_checkpoint_module before writing data, the module is always present. This means:
- Images written with TSO data could fail to load on older FEs that don't know about the
"tso"module (unlessignore_unknown_metadata_module=true) - The module will be iterated during every image save/load even when TSO is disabled
This is acceptable given the ignore_unknown_metadata_module safety valve and the experimental nature, but worth noting for future production readiness.
|
run buildall |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 26791 ms |
TPC-DS: Total hot run time: 169782 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 26377 ms |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-DS: Total hot run time: 168160 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
Hi @morningman, please help review the PR again. Thank you very much. |
What problem does this PR solve?
Issue Number: close #61198
Related #57921
Problem Summary:
Release note
The service calibrates its initial timestamp at startup and periodically updates it to maintain a time window.
A TSO timestamp encodes the physical time and a logical counter; it is assembled and extracted by the new TSOTimestamp class.
The service exposes two main methods:
getTSO()– returns a new TSO timestamp for transaction commits.getCurrentTSO()– returns the current TSO without bumping the logical counter.experimental_enable_feature_tso– enables/disables the TSO feature.tso_service_update_interval_ms– interval in milliseconds for the TSO service to update its window.max_update_tso_retry_countandmax_get_tso_retry_count– retry limits for updating and obtaining TSOs.tso_service_window_duration_ms– length of the time window allocated by the TSO service.tso_time_offset_debug_mode– debug offset for the physical time.enable_tso_persist_journalandenable_tso_checkpoint_module– persistence switches for edit log and checkpoint.enable_tsowhich can be configured inCREATE TABLEor modified viaALTER TABLE. Only tables withenable_tso = truegenerate commit TSO for transactions; when disabled, commit_tso remains-1.TransactionStatenow fetches a commit TSO fromTSOServicewhen TSO is enabled and stores it in the transaction state andTableCommitInfo.TPartitionVersionInfo.commit_tso), and is persisted with each rowset (see next item).Rowset::make_visiblenow accepts acommit_tsoparameter and writes it toRowsetMeta.RowsetMetaPBadds a new fieldcommit_tsoto persist commit timestamps.information_schema.rowsetsintroduces a new columnCOMMIT_TSOallowing users to query the commit timestamp for each rowset.A new REST endpoint
/api/tsois added for retrieving current TSO information. It returns a JSON payload containing:window_end_physical_time– end of the current TSO time window.current_tso– the current composed 64‑bit TSO.current_tso_physical_timeandcurrent_tso_logical_counter– the decomposed physical and logical parts of the current TSO. This API does not increment the logical counter.New metrics counters (e.g.,
tso_clock_drift_detected,tso_clock_backward_detected,tso_clock_calculated,tso_clock_updated) expose state and health of the TSO service.FeMetaVersion.VERSION_CURRENTis bumped toVERSION_141to indicate the addition of the TSO module.New unit tests verify
TSOTimestampbit manipulation,TSOServicebehavior, commit TSO propagation, and the/api/tsoendpoint. Regression tests verify that rowset commit timestamps are populated when TSO is enabled and that the API returns increasing TSOs.Impact and Compatibility
experimental_enable_feature_tso. It is disabled by default and can be enabled in front-end configuration. When enabled, old FE versions without this feature cannot replay edit log entries containing TSO operations; therefore upgrade all FEs before enabling.enable_tsototrue. Tables with TSO enabled will produce commit TSO for each rowset and may require downstream consumers to handle the newcommit_tsofield./api/tsoto inspect current TSO values. No existing API is modified.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
docs: add TSO-related functions description and documentation. doris-website#3454
Check List (For Reviewer who merge this PR)