-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213
Conversation
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This pull request/issue is no longer marked as stale. |
This pull request/issue is no longer marked as stale. |
1 similar comment
This pull request/issue is no longer marked as stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a useful change, I tested it out and it seems to relax the issue described in #10193 (comment).
@a2l007 any chance you can fix up conflicts?
@@ -282,14 +297,14 @@ public void run() | |||
() -> { | |||
try { | |||
if (curator.checkExists().forPath(path) != null) { | |||
failAssign(segmentHolder, new ISE("%s was never removed! Failing this operation!", path)); | |||
failAssign(segmentHolder, true, new ISE("%s was never removed! Failing this operation!", path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth clarifying this log message to indicate that for load operations, that while the coordinator has given up, the historical might still process and load the requested segments. Maybe something like "Load segments operation timed out, %s was never removed! Abandoning attempt, (but these segments might still be loaded)". I guess it would need to adjust message based on whether it was a load or drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the message here. Please let me know if this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
private void failAssign(SegmentHolder segmentHolder, Exception e) | ||
private void failAssign(SegmentHolder segmentHolder, boolean handleTimeout, Exception e) | ||
{ | ||
if (e != null) { | ||
log.error(e, "Server[%s], throwable caught when submitting [%s].", basePath, segmentHolder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we don't emit exceptions currently (using EmittingLogger.makeAlert()
), but should we? At least for the segment loading timeout error, it would be nice to emit those errors so that cluster operators can notice there is something going wrong with segment loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue
along with some docs about its usage in understanding the cluster behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?
I think it's a valid concern. We may be able to emit those exceptions in bulk if they are thrown in a short time frame. I believe this should be done in a separate PR even if we want, and thus my comment is not a blocker for this PR.
Also as a followup PR I was planning to add the timedOut segment list to the
/druid/coordinator/v1/loadqueue
along with some docs about its usage in understanding the cluster behavior.
Thanks. It sounds good to me.
loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1); | ||
// Timed out segments need to be replicated in another server for faster availability | ||
if (!serverHolder.getPeon().getTimedOutSegments().contains(segment)) { | ||
loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadingSegments
is not just a set of segments loading anymore. Please add some javadoc in SegmentReplicantLookup
about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @himanshug pointed out in #10193 (comment), there could be two types of slow segment loading.
- There are a few historicals being slow in segment loading in the cluster. This can be caused by unbalanced load queues or some intermittent failures.
- Historicals are OK, but ingestion might outpace the ability to load segments.
This particular change in SegmentReplicantLookup
could help in the former case, but make things worse in the latter case. In an extreme case, all historicals could have the same set of timed-out segments in their load queue. This might be still OK though, because, if that's the case, Druid cannot get out of that state by itself anyway. The system administrator should add more historicals or use more threads for parallel segment loading. However, we should provide relevant data so that system administrators can tell what's happening. I left another comment about emitting exceptions to provide such data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jihoonson @himanshug Would it make sense to make the replication behavior user configurable? We could have a dynamic config like replicateAfterLoadTimeout
which would control whether the segments would be attempted to be replicated to a different historical in case of a load timeout to the current historical. The default could be true but a cluster operator can set this to false if they wish to avoid the additional churn and know the historicals are OK and it would eventually load the segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a config seems reasonable to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds good to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a config. I've set replicateAfterLoadTimeout
to false as the default I feel it might be better to preserve the existing behaviour and admins need to be aware of this property's behavior before setting it to true. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds good to me to preserve the existing behavior by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 after CI. Thanks @a2l007
Fixes #10193.
CuratorLoadQueuePeon
no longer deletes segment load/drop entries in casedruid.coordinator.load.timeout
expires. Deleting these entries after a timeout can cause the balancer to work incorrectly, as described in the linked issue.With this fix, the segment entries will remain in the load/drop queue for a peon until the ZK entry is deleted by the historical, unless a non-timeout related exception occurs. This helps the balancer to account for the actual queue size for historicals and can lead to better balancing decisions.
This PR has: