Skip to content

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4174

Merged
czy006 merged 3 commits intoapache:masterfrom
wardlican:amoro#4171
Apr 16, 2026
Merged

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4174
czy006 merged 3 commits intoapache:masterfrom
wardlican:amoro#4171

Conversation

@wardlican
Copy link
Copy Markdown
Contributor

Why are the changes needed?

Close #4171 .

Brief change log

  1. For existing historical tables, if Master-Slave mode is enabled, a bucket_id must be assigned to them.
  2. Fixed an issue where concurrent addition of new tables could result in multiple tables being assigned the same bucket_id, leading to bucket imbalance.
  3. Fixed an issue in Master-Slave mode where the -msm flag was not passed during the optimizer's startup.
  4. Fixed an issue where tables were erroneously deleted following a migration.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@github-actions github-actions bot added type:docs Improvements or additions to documentation module:ams-server Ams server module module:ams-optimizer AMS optimizer module labels Apr 10, 2026
@wardlican wardlican changed the title [Subtask]: Optimizations and troubleshooting for the Master-Slave mode. #4171 [Subtask]: Optimizations and troubleshooting for the Master-Slave mode. Apr 10, 2026
@czy006
Copy link
Copy Markdown
Contributor

czy006 commented Apr 14, 2026

Offline nodes with missing last_update_time may never be reclaimed

In AmsAssignService.detectNodeChanges (around lines 528-545), a node is marked offline only when lastUpdateTime > 0 && (currentTime - lastUpdateTime) > nodeOfflineTimeoutMs.

However, both DBBucketAssignStore#getLastUpdateTime and ZkBucketAssignStore#getLastUpdateTime return 0 when the timestamp is missing. In that case, a node that is already absent from the alive-node list will never enter the offline branch, so its buckets are never redistributed. This can leave bucket ownership stranded and prevent load recovery.

Treat lastUpdateTime <= 0 as an offline-eligible case when the node is not in aliveNodeKeys, and reclaim buckets directly; or Keep a short grace period, but after the grace period, force offline even if timestamp is missing.

@wardlican
Copy link
Copy Markdown
Contributor Author

Offline nodes with missing last_update_time may never be reclaimed

In AmsAssignService.detectNodeChanges (around lines 528-545), a node is marked offline only when lastUpdateTime > 0 && (currentTime - lastUpdateTime) > nodeOfflineTimeoutMs.

However, both DBBucketAssignStore#getLastUpdateTime and ZkBucketAssignStore#getLastUpdateTime return 0 when the timestamp is missing. In that case, a node that is already absent from the alive-node list will never enter the offline branch, so its buckets are never redistributed. This can leave bucket ownership stranded and prevent load recovery.

Treat lastUpdateTime <= 0 as an offline-eligible case when the node is not in aliveNodeKeys, and reclaim buckets directly; or Keep a short grace period, but after the grace period, force offline even if timestamp is missing.

Okay, I will fix this issue.

@czy006 czy006 requested review from czy006 and xxubai April 15, 2026 06:04
Copy link
Copy Markdown
Contributor

@czy006 czy006 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@czy006 czy006 merged commit 7b19b7e into apache:master Apr 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-optimizer AMS optimizer module module:ams-server Ams server module type:docs Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: Optimizations and troubleshooting for the Master-Slave mode.

3 participants