Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix over-replication caused by balancing when inventory is not updated yet #13114

Merged
merged 18 commits into from
Sep 29, 2022

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Sep 18, 2022

Fixes item (1) in #12881

Description

The current implementation of segment balancing often leads to over-replication.

This can happen in the following (fairly common) scenario:

  • Segment is moving from server A to B for balancing
  • Load succeeds on B, callback is invoked to finish the move
  • If inventory view shows the segment loaded on B, drop on A is invoked
  • If inventory view for B is not updated yet, the drop on A is not invoked

This is a frequent occurrence but under normal circumstances, it is not noticeable
because load rules quickly drop the over-replicated segment from A. But if load rules get
stuck for some reason (e.g. trying to reach target replication levels on a different tier),
the number of over-replicated segments keeps increasing, thereby overloading historicals.

over_replication

Changes

  1. For http loading, do not check the server inventory to determine if the drop should be invoked.
    The HTTP response status is enough to determine load success or failure.
  2. For zk-based loading,
    • continue to check the inventory because behaviour of zk loading is a little more error prone
    • remove path check in the callback as it is redundant given that the inventory is already being checked
    • some minor cleanup in the peon
  3. Fix the corresponding test in SegmentLoadingNegativeTest

Main classes to review:

  • LoadPeonCallback
  • DruidCoordinator
  • CuratorLoadQueuePeon
  • HttpLoadQueuePeon
  • minor refactor in BalanceSegments

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz changed the base branch from coordinator_test_fwrk to master September 21, 2022 04:29
* Fix Apache #12881 to fix this test.
*/
@Test
public void testBalancingWithStaleInventoryCausesOverReplication()
Copy link
Contributor Author

@kfaraz kfaraz Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this test to SegmentLoadingTest.testBalancingWithStaleViewDoesNotOverReplicate()

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, otherwise LGTM

Comment on lines +295 to +306
ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
currentlyMovingSegments.get(toServer.getTier());
movingSegments.put(segmentId, segment);
final LoadPeonCallback callback = moveSuccess -> movingSegments.remove(segmentId);
try {
ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
currentlyMovingSegments.get(toServer.getTier());
movingSegments.put(segmentId, segment);
callback = () -> movingSegments.remove(segmentId);
coordinator.moveSegment(
params,
fromServer,
toServer,
segmentToMove,
callback
);
coordinator
.moveSegment(params, fromServer, toServer, segmentToMove, callback);
return true;
}
catch (Exception e) {
log.makeAlert(e, StringUtils.format("[%s] : Moving exception", segmentId)).emit();
if (callback != null) {
callback.execute();
}
log.makeAlert(e, "[%s] : Moving exception", segmentId).emit();
callback.execute(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes fixing something or just style? It looks like just style trying to remove the null check from the callback by moving the work of building it outside of the try is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just a style change for readability.

@kfaraz
Copy link
Contributor Author

kfaraz commented Sep 29, 2022

Thanks a lot for the review, @imply-cheddar !

@abhishekagarwal87 abhishekagarwal87 merged commit ce5f55e into apache:master Sep 29, 2022
@kfaraz kfaraz deleted the fix_balance_overrep branch September 29, 2022 07:15
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants