Skip to content

[improve] replace bookie from different rack of other quorum when do recovery#3707

Closed
TakaHiR07 wants to merge 1 commit into
apache:masterfrom
TakaHiR07:improve_replace_bookie_from_different_rack
Closed

[improve] replace bookie from different rack of other quorum when do recovery#3707
TakaHiR07 wants to merge 1 commit into
apache:masterfrom
TakaHiR07:improve_replace_bookie_from_different_rack

Conversation

@TakaHiR07
Copy link
Copy Markdown
Contributor

Motivation

If we need to do recovery when one bookie down, in the current RackAwarePolicy, it would try picking a candidate from same rack to replace the down bookie firstly. There is a risk that if many bookie in the same rack shutdown, it would result in replacing bookie to the several bookie remain in this rack, making these bookies become hot spot.

Changes

choose the replaced bookie from different rack of other quorum when do recovery or do ensemble change.

Copy link
Copy Markdown
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this configurable with current behavior enabled by default + add tests

@hangc0276
Copy link
Copy Markdown
Contributor

For RackAwarePolicy, it will try its best effort to pick bookies from different racks. When doing recovery, it is safe to replace bookies from the same rack.

@TakaHiR07
Copy link
Copy Markdown
Contributor Author

For RackAwarePolicy, it will try its best effort to pick bookies from different racks. When doing recovery, it is safe to replace bookies from the same rack.

In some extreme case, such as 5 bookies in 1 rack, 4 bookies down, 1 bookie survive. The recovery throughput would concentrate on this bookie. After recovery, it would result in data skew in this survived bookie. I think it is not really safe, maybe cause some problem.
And when do recovery, it seems data locality is not exist. Because the recover bk client in ReplicationWorker is random distribution, it would read data from rack 2 and then recover data to rack 1. So it is not neccessary to ensure the replaced bookie in the same rack

@hangc0276
Copy link
Copy Markdown
Contributor

For RackAwarePolicy, it will try its best effort to pick bookies from different racks. When doing recovery, it is safe to replace bookies from the same rack.

In some extreme case, such as 5 bookies in 1 rack, 4 bookies down, 1 bookie survive. The recovery throughput would concentrate on this bookie. After recovery, it would result in data skew in this survived bookie. I think it is not really safe, maybe cause some problem. And when do recovery, it seems data locality is not exist. Because the recover bk client in ReplicationWorker is random distribution, it would read data from rack 2 and then recover data to rack 1. So it is not neccessary to ensure the replaced bookie in the same rack

@TakaHiR07 If we not ensure the replaced bookie in the same rack, it will break the placement policy in ensemble change and auto recovery.

@TakaHiR07
Copy link
Copy Markdown
Contributor Author

For RackAwarePolicy, it will try its best effort to pick bookies from different racks. When doing recovery, it is safe to replace bookies from the same rack.

In some extreme case, such as 5 bookies in 1 rack, 4 bookies down, 1 bookie survive. The recovery throughput would concentrate on this bookie. After recovery, it would result in data skew in this survived bookie. I think it is not really safe, maybe cause some problem. And when do recovery, it seems data locality is not exist. Because the recover bk client in ReplicationWorker is random distribution, it would read data from rack 2 and then recover data to rack 1. So it is not neccessary to ensure the replaced bookie in the same rack

@TakaHiR07 If we not ensure the replaced bookie in the same rack, it will break the placement policy in ensemble change and auto recovery.

@hangc0276 This pr only use the implementation in

public BookieNode selectFromNetworkLocation(Set<String> excludeRacks,
Set<Node> excludeBookies,
Predicate<BookieNode> predicate,
Ensemble<BookieNode> ensemble,
boolean fallbackToRandom)
instead of
public BookieNode selectFromNetworkLocation(String networkLoc,
Set<String> excludeRacks,
Set<Node> excludeBookies,
Predicate<BookieNode> predicate,
Ensemble<BookieNode> ensemble,
boolean fallbackToRandom)

I think this implementation should not break the RackawarePolicy

@hangc0276
Copy link
Copy Markdown
Contributor

For RackAwarePolicy, it will try its best effort to pick bookies from different racks. When doing recovery, it is safe to replace bookies from the same rack.

In some extreme case, such as 5 bookies in 1 rack, 4 bookies down, 1 bookie survive. The recovery throughput would concentrate on this bookie. After recovery, it would result in data skew in this survived bookie. I think it is not really safe, maybe cause some problem. And when do recovery, it seems data locality is not exist. Because the recover bk client in ReplicationWorker is random distribution, it would read data from rack 2 and then recover data to rack 1. So it is not neccessary to ensure the replaced bookie in the same rack

@TakaHiR07 If we not ensure the replaced bookie in the same rack, it will break the placement policy in ensemble change and auto recovery.

@hangc0276 This pr only use the implementation in

public BookieNode selectFromNetworkLocation(Set<String> excludeRacks,
Set<Node> excludeBookies,
Predicate<BookieNode> predicate,
Ensemble<BookieNode> ensemble,
boolean fallbackToRandom)

instead of

public BookieNode selectFromNetworkLocation(String networkLoc,
Set<String> excludeRacks,
Set<Node> excludeBookies,
Predicate<BookieNode> predicate,
Ensemble<BookieNode> ensemble,
boolean fallbackToRandom)

I think this implementation should not break the RackawarePolicy

@TakaHiR07 The replaceBookie will be used in ensemble change and auto recovery.

@StevenLuMT
Copy link
Copy Markdown
Member

rerun failure checks

@TakaHiR07 TakaHiR07 closed this May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants