Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

Merged
merged 6 commits into from
Mar 17, 2021

Conversation

a2l007
Copy link
Contributor

@a2l007 a2l007 commented Jul 24, 2020

Fixes #10193.

CuratorLoadQueuePeon no longer deletes segment load/drop entries in case druid.coordinator.load.timeout expires. Deleting these entries after a timeout can cause the balancer to work incorrectly, as described in the linked issue.

With this fix, the segment entries will remain in the load/drop queue for a peon until the ZK entry is deleted by the historical, unless a non-timeout related exception occurs. This helps the balancer to account for the actual queue size for historicals and can lead to better balancing decisions.


This PR has:

  • been self-reviewed.
  • using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@stale
Copy link

stale bot commented Oct 4, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Oct 4, 2020
@stale
Copy link

stale bot commented Nov 14, 2020

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Nov 14, 2020
@jihoonson jihoonson reopened this Jan 30, 2021
@stale
Copy link

stale bot commented Jan 30, 2021

This pull request/issue is no longer marked as stale.

@stale
Copy link

stale bot commented Jan 30, 2021

This pull request/issue is no longer marked as stale.

1 similar comment
@stale
Copy link

stale bot commented Jan 30, 2021

This pull request/issue is no longer marked as stale.

@stale stale bot removed the stale label Jan 30, 2021
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a useful change, I tested it out and it seems to relax the issue described in #10193 (comment).

@a2l007 any chance you can fix up conflicts?

@@ -282,14 +297,14 @@ public void run()
() -> {
try {
if (curator.checkExists().forPath(path) != null) {
failAssign(segmentHolder, new ISE("%s was never removed! Failing this operation!", path));
failAssign(segmentHolder, true, new ISE("%s was never removed! Failing this operation!", path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth clarifying this log message to indicate that for load operations, that while the coordinator has given up, the historical might still process and load the requested segments. Maybe something like "Load segments operation timed out, %s was never removed! Abandoning attempt, (but these segments might still be loaded)". I guess it would need to adjust message based on whether it was a load or drop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the message here. Please let me know if this works.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

private void failAssign(SegmentHolder segmentHolder, Exception e)
private void failAssign(SegmentHolder segmentHolder, boolean handleTimeout, Exception e)
{
if (e != null) {
log.error(e, "Server[%s], throwable caught when submitting [%s].", basePath, segmentHolder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we don't emit exceptions currently (using EmittingLogger.makeAlert()), but should we? At least for the segment loading timeout error, it would be nice to emit those errors so that cluster operators can notice there is something going wrong with segment loading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue along with some docs about its usage in understanding the cluster behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?

I think it's a valid concern. We may be able to emit those exceptions in bulk if they are thrown in a short time frame. I believe this should be done in a separate PR even if we want, and thus my comment is not a blocker for this PR.

Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue along with some docs about its usage in understanding the cluster behavior.

Thanks. It sounds good to me.

loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1);
// Timed out segments need to be replicated in another server for faster availability
if (!serverHolder.getPeon().getTimedOutSegments().contains(segment)) {
loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadingSegments is not just a set of segments loading anymore. Please add some javadoc in SegmentReplicantLookup about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @himanshug pointed out in #10193 (comment), there could be two types of slow segment loading.

  • There are a few historicals being slow in segment loading in the cluster. This can be caused by unbalanced load queues or some intermittent failures.
  • Historicals are OK, but ingestion might outpace the ability to load segments.

This particular change in SegmentReplicantLookup could help in the former case, but make things worse in the latter case. In an extreme case, all historicals could have the same set of timed-out segments in their load queue. This might be still OK though, because, if that's the case, Druid cannot get out of that state by itself anyway. The system administrator should add more historicals or use more threads for parallel segment loading. However, we should provide relevant data so that system administrators can tell what's happening. I left another comment about emitting exceptions to provide such data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson @himanshug Would it make sense to make the replication behavior user configurable? We could have a dynamic config like replicateAfterLoadTimeout which would control whether the segments would be attempted to be replicated to a different historical in case of a load timeout to the current historical. The default could be true but a cluster operator can set this to false if they wish to avoid the additional churn and know the historicals are OK and it would eventually load the segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a config seems reasonable to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good to me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a config. I've set replicateAfterLoadTimeout to false as the default I feel it might be better to preserve the existing behaviour and admins need to be aware of this property's behavior before setting it to true. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good to me to preserve the existing behavior by default.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI. Thanks @a2l007

@jihoonson jihoonson merged commit 3d7e7c2 into apache:master Mar 17, 2021
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Balancer can work incorrectly in case of slow historicals or large number of segments to move
3 participants