Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

a2l007 · 2020-07-24T19:03:21Z

CuratorLoadQueuePeon no longer deletes segment load/drop entries in case druid.coordinator.load.timeout expires. Deleting these entries after a timeout can cause the balancer to work incorrectly, as described in the linked issue.

With this fix, the segment entries will remain in the load/drop queue for a peon until the ZK entry is deleted by the historical, unless a non-timeout related exception occurs. This helps the balancer to account for the actual queue size for historicals and can lead to better balancing decisions.

This PR has:

been self-reviewed.
using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

stale · 2020-10-04T02:34:59Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2020-11-14T09:34:29Z

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

stale · 2021-01-30T01:40:30Z

This pull request/issue is no longer marked as stale.

stale · 2021-01-30T01:40:38Z

This pull request/issue is no longer marked as stale.

stale · 2021-01-30T01:40:38Z

This pull request/issue is no longer marked as stale.

clintropolis

this seems like a useful change, I tested it out and it seems to relax the issue described in #10193 (comment).

@a2l007 any chance you can fix up conflicts?

clintropolis · 2021-02-23T13:12:53Z

server/src/main/java/org/apache/druid/server/coordinator/CuratorLoadQueuePeon.java

@@ -282,14 +297,14 @@ public void run()
          () -> {
            try {
              if (curator.checkExists().forPath(path) != null) {
-                failAssign(segmentHolder, new ISE("%s was never removed! Failing this operation!", path));
+                failAssign(segmentHolder, true, new ISE("%s was never removed! Failing this operation!", path));


I think it would be worth clarifying this log message to indicate that for load operations, that while the coordinator has given up, the historical might still process and load the requested segments. Maybe something like "Load segments operation timed out, %s was never removed! Abandoning attempt, (but these segments might still be loaded)". I guess it would need to adjust message based on whether it was a load or drop.

I've modified the message here. Please let me know if this works.

clintropolis

👍

jihoonson · 2021-03-08T00:50:20Z

server/src/main/java/org/apache/druid/server/coordinator/CuratorLoadQueuePeon.java

  }

-  private void failAssign(SegmentHolder segmentHolder, Exception e)
+  private void failAssign(SegmentHolder segmentHolder, boolean handleTimeout, Exception e)
  {
    if (e != null) {
      log.error(e, "Server[%s], throwable caught when submitting [%s].", basePath, segmentHolder);


I'm not sure why we don't emit exceptions currently (using EmittingLogger.makeAlert()), but should we? At least for the segment loading timeout error, it would be nice to emit those errors so that cluster operators can notice there is something going wrong with segment loading.

Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?

Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue along with some docs about its usage in understanding the cluster behavior.

Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?

I think it's a valid concern. We may be able to emit those exceptions in bulk if they are thrown in a short time frame. I believe this should be done in a separate PR even if we want, and thus my comment is not a blocker for this PR.

Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue along with some docs about its usage in understanding the cluster behavior.

Thanks. It sounds good to me.

jihoonson · 2021-03-08T00:51:51Z

server/src/main/java/org/apache/druid/server/coordinator/SegmentReplicantLookup.java

-          loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1);
+          // Timed out segments need to be replicated in another server for faster availability
+          if (!serverHolder.getPeon().getTimedOutSegments().contains(segment)) {
+            loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1);


loadingSegments is not just a set of segments loading anymore. Please add some javadoc in SegmentReplicantLookup about this.

As @himanshug pointed out in #10193 (comment), there could be two types of slow segment loading.

There are a few historicals being slow in segment loading in the cluster. This can be caused by unbalanced load queues or some intermittent failures.

Historicals are OK, but ingestion might outpace the ability to load segments.

This particular change in SegmentReplicantLookup could help in the former case, but make things worse in the latter case. In an extreme case, all historicals could have the same set of timed-out segments in their load queue. This might be still OK though, because, if that's the case, Druid cannot get out of that state by itself anyway. The system administrator should add more historicals or use more threads for parallel segment loading. However, we should provide relevant data so that system administrators can tell what's happening. I left another comment about emitting exceptions to provide such data.

@jihoonson @himanshug Would it make sense to make the replication behavior user configurable? We could have a dynamic config like replicateAfterLoadTimeout which would control whether the segments would be attempted to be replicated to a different historical in case of a load timeout to the current historical. The default could be true but a cluster operator can set this to false if they wish to avoid the additional churn and know the historicals are OK and it would eventually load the segments.

Adding a config seems reasonable to me 👍

It sounds good to me too.

Added a config. I've set replicateAfterLoadTimeout to false as the default I feel it might be better to preserve the existing behaviour and admins need to be aware of this property's behavior before setting it to true. Let me know what you think.

It sounds good to me to preserve the existing behavior by default.

jihoonson

+1 after CI. Thanks @a2l007

Skip queue removal on timeout

4e0a995

himanshug mentioned this pull request Jul 27, 2020

Balancer can work incorrectly in case of slow historicals or large number of segments to move #10193

Closed

stale bot added the stale label Oct 4, 2020

stale bot closed this Nov 14, 2020

jihoonson reopened this Jan 30, 2021

stale bot removed the stale label Jan 30, 2021

jihoonson added Area - Segment Balancing/Coordination Bug stale labels Jan 30, 2021

stale bot removed the stale label Jan 30, 2021

clintropolis reviewed Mar 3, 2021

View reviewed changes

a2l007 added 2 commits March 3, 2021 11:49

Merge branch 'master' of https://github.com/druid-io/druid into zkerror

92154de

Clarify error

6c06ef2

clintropolis approved these changes Mar 4, 2021

View reviewed changes

jihoonson reviewed Mar 8, 2021

View reviewed changes

a2l007 added 3 commits March 9, 2021 09:09

Merge branch 'master' of https://github.com/druid-io/druid into zkerror

a56f5bf

Merge branch 'master' of https://github.com/druid-io/druid into zkerror

99f249e

Add new config to control replication

9dbf5b5

jihoonson approved these changes Mar 16, 2021

View reviewed changes

jihoonson merged commit 3d7e7c2 into apache:master Mar 17, 2021

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

a2l007 commented Jul 24, 2020 •

edited

Loading

stale bot commented Oct 4, 2020

stale bot commented Nov 14, 2020

stale bot commented Jan 30, 2021

stale bot commented Jan 30, 2021

stale bot commented Jan 30, 2021

clintropolis left a comment

clintropolis Feb 23, 2021

a2l007 Mar 3, 2021

clintropolis left a comment

jihoonson Mar 8, 2021

a2l007 Mar 9, 2021

a2l007 Mar 9, 2021

jihoonson Mar 15, 2021

jihoonson Mar 8, 2021

jihoonson Mar 8, 2021

a2l007 Mar 9, 2021

clintropolis Mar 10, 2021

jihoonson Mar 15, 2021

a2l007 Mar 16, 2021

jihoonson Mar 17, 2021

jihoonson left a comment

Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout #10213

Conversation

a2l007 commented Jul 24, 2020 • edited Loading

stale bot commented Oct 4, 2020

stale bot commented Nov 14, 2020

stale bot commented Jan 30, 2021

stale bot commented Jan 30, 2021

stale bot commented Jan 30, 2021

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

a2l007 commented Jul 24, 2020 •

edited

Loading