Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bookie decommission check wait interval #4070

Conversation

hangc0276
Copy link
Contributor

Motivation

When triggering one bookie decommission, the bookie check max interval is 10 minutes.

2023-08-10T13:56:08,911-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task
2023-08-10T13:56:50,793-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 23140
2023-08-10T14:08:47,350-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984
2023-08-10T14:19:02,330-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984
2023-08-10T14:29:17,332-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984
2023-08-10T14:39:32,395-0400 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984

It has the following issues:

  • Each check needs to wait 10 minutes if the waiting-to-be-replicated ledgers count is greater than 60, which is too much for small bookie decommission. For example, the bookie has 70 ledgers that need to be replicated.
  • We set each bookie replicate time to 10s. For some ledgers with few data, such as 100KB, it only takes 2 or 3 seconds to replicate.
  • The ledgers count waiting to be replicated in the first round is inaccurate because those ledgers are not validated by validateBookieIsNotPartOfEnsemble
  • The first count of need to be replicated ledgers is 23140, but after 10 minutes, the ledger count is 2984. But the first check interval is calculated based on 23140.

Changes

  • Reduce the max check interval from 10 minutes to 5 minutes
  • Reduce the sleepTimePerLedger from 10 seconds to 3 seconds
  • Trigger validateBookieIsNotPartOfEnsemble check in the first round before going to sleep to keep the count of ledgers waiting for replication accurate.

@hangc0276
Copy link
Contributor Author

@horizonzy Please help take a look at this PR, thanks.

horizonzy

This comment was marked as duplicate.

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@merlimat merlimat merged commit 09aec9c into apache:master Sep 8, 2023
16 checks passed
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
* Optimise bookie decommission check wait interval

* fix a bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants