Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix no known bookies after reset racks for all BKs #4128

Merged
merged 2 commits into from
Nov 20, 2023

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Nov 9, 2023

Motivation

Background

  • NetworkTopologyImpl does not allow multi Bks register with different depthOfAllLeaves.
  • Regarding NetworkTopologyImpl, once a node is added, the depthOfAllLeaves will be initiated.
  • When a new BK is registering:
    • EnsemblePlacementPolicy.networkTopology.add( newBkNode )
    • EnsemblePlacementPolicy.knownBookies.add( newBkNode )
    • Once network topology.add(newBkNode) fail, it will not call knownBookies.add( newBkNode )

The scenarios that would hit bugs.

Scenario 1: Reset racks for all BK nodes, for Example:

  • Register [BK1,BK2] with rack /r_1
    • depthOfAllLeaves is 2 now.
  • Restart [BK1,BK2] with rack /region_1/r_1
    • the depth of /region_1/r_1 is 3, different with 2
    • Since depthOfAllLeaves of NetworkTopologyImpl is still 2, [BK1, BK2] could not be added, and get an error can't add leaf node BK1 at depth 3 to the topology.

You can reproduce this by the new test testRestartBKWithNewRackDepth.


Scenario 2: A race condition caused depthOfAllLeaves to be initialized with a wrong value.

Time Event: BK1 start/shutdown thread: ZK main worker thread of RegistrationClient
1 BK1 start
2 Add a node into bkInfos of RegistrationClient
3 Notice EnsemblePlacementPolicy: a node added
4 BK1 shutdown
5 Remove the node from bkInfos of RegistrationClient, bkInfos is empty now
6 (Highlight) NetworkTopologyImpl tries to calculate the network location but gets null, so use a default value default-region/default-rack[1]
6 depthOfAllLeaves was initialized to 3
7 The issue occurs: the node BK1 could not be added to NetworkTopologyImpl anymore

[1]: https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/TopologyAwareEnsemblePlacementPolicy.java#L830

protected String resolveNetworkLocation(BookieId addr) {
  try {
      return NetUtils.resolveNetworkLocation(dnsResolver, bookieAddressResolver.resolve(addr));
  } catch (BookieAddressResolver.BookieIdNotResolvedException err) {
      return NetworkTopology.DEFAULT_REGION_AND_RACK; // "/default-region/default-rack"
  }
}

The above scenario will cause EnsemblePlacementPolicy.knownBookies to be an empty collection, leading to error Not enough non-faulty bookies available

The issue below occurs on the version 4.16.3

Screenshot 2023-11-09 at 23 49 17 Screenshot 2023-11-09 at 23 50 09
2023-11-03T10:08:03,154+0000 [pulsar-registration-client-34-1] ERROR org.apache.bookkeeper.net.NetworkTopologyImpl - Error: can't add leaf node <Bookie:bk-9:3181> at depth 3 to topology:
2023-11-03T10:08:03,154+0000 [pulsar-registration-client-34-1] ERROR org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Unexpected exception while handling joining bookie bk-9:3181
	at java.lang.Thread.run(Thread.java:833) ~[?:?]	
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final]	
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]	
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.handleBookiesThatJoined(TopologyAwareEnsemblePlacementPolicy.java:720) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]	
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:666) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]	
	at org.apache.bookkeeper.net.NetworkTopologyImpl.add(NetworkTopologyImpl.java:418) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]	
org.apache.bookkeeper.net.NetworkTopologyImpl$InvalidTopologyException: Invalid network topology. You cannot have a rack and a non-rack node at the same level of the network topology.	
Expected number of leaves:0	
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.93.Final.jar:4.1.93.Final]	
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:666) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]	
	at org.apache.bookkeeper.client.BookieWatcherImpl.processWritableBookiesChanged(BookieWatcherImpl.java:196) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]	

Changes

Reset depthOfAllLeaves after all BKs have been removed.

Copy link
Member

@horizonzy horizonzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good resolution

// Register 2 BKs with depth 1 rack.
BookieNode dp1BkNode1 = new BookieNode(bkId1, dp1Rack);
BookieNode dp1BkNode2 = new BookieNode(bkId2, dp1Rack);
networkTopology.add(dp1BkNode1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a case for the middle state will be better.
The only one bks shutdown, then restart it with depth 2 rack, it also can't add successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@hangc0276 hangc0276 requested a review from dlg99 November 14, 2023 06:19
@zymap zymap merged commit c07e72a into apache:master Nov 20, 2023
16 checks passed
zymap pushed a commit that referenced this pull request Dec 7, 2023
* Fix no known bookies after reset racks for all BKs

(cherry picked from commit c07e72a)
yangl pushed a commit to yangl/bookkeeper that referenced this pull request Dec 11, 2023
* Fix no known bookies after reset racks for all BKs
hangc0276 pushed a commit to hangc0276/bookkeeper that referenced this pull request Jan 18, 2024
* Fix no known bookies after reset racks for all BKs

(cherry picked from commit c07e72a)
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
* Fix no known bookies after reset racks for all BKs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants