Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

clintropolis · 2018-03-22T19:01:11Z

TL;DR The deeper the load queue, the less likely a historical is to actually retain segments it is told to load because of some bugs and competing coordinator logic. Assigning and moving segments is done with the balancer, which makes segment placement decisions taking into account the load and drop queues of servers. Dropping segments is done considering solely the ideal state of all historicals (currently size loaded + load queue size) producing very quirky undefined behavior when trying to balance segments in a very imbalanced cluster (e.g. when replacing a failed historical or adding a new historical to the cluster) if this value surpasses the size of the 'typical' loaded historicals.

A potential workaround is to set maxSegmentsInNodeLoadingQueue on coordinator config to be small enough that size actually loaded + size of load queue is less likely to be larger than the size of a 'typical' historical. Reducing maxSegmentsToMove can also have a positive effect.

This was a fun one to hunt down! We ran into an issue where a cluster had replaced a historical server, but after a time began to present with 'stuck segment' log messages. This was not a production cluster so we had the rare luxury to investigate a bit deeper than might have otherwise been normally available, and noticed that this historical was constantly being told by the coordinator to load and then nearly immediately drop the same segments it had just loaded every cycle.

The basic pattern played out like this in the coordinator logs:

2018-03-22T01:22:35,505 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - Moving [some_segment] from [happy.historical.internal:8283] to [unlucky.historical.internal:8283]
2018-03-22T01:22:35,505 INFO [Coordinator-Exec--0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/unlucky.historical.internal:8283] to load segment[some_segment]
...
2018-03-22T01:22:55,646 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] loading [some_segment]
2018-03-22T01:22:55,646 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] processing segment[some_segment]
...
2018-03-22T01:22:58,422 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] done processing [/druid/loadQueue/unlucky.historical.internal:8283/some_segment]

Then the next coordinator run:

...
2018-03-22T01:23:05,542 INFO [Coordinator-Exec--0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/unlucky.historical.internal:8283] to drop segment[some_segment]
2018-03-22T01:23:08,327 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] dropping [some_segment]
2018-03-22T01:23:08,327 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] processing segment[some_segment]

This happened for many segments. Looking deeper, the first issue we noticed is that the segment move code in the cost balancer is partially broken, in that segments fail to be dropped from the source server after successfully loading on the destination server as the design intends. A flaw in the play between how BatchDataSegmentAnnouncer announces a set of segments under a single partially random zk node path, and an assumption AbstractCuratorServerInventoryView.isSegmentLoadedByServer is making that the segment is advertised in a zk node by identifier means that it never in fact makes it to the logic that drops the segment after moving. So, the add happens, but the drop can not.

On the next coordinator run, the rule runner will notice that this segment is now over replicated, and decide that one of them needs dropped.

However, the logic to decide which server to drop from chooses the server with the least amount of space available but unfortunately for us this includes the load queue. If the cluster is in an ideal state this isn't as painful, but when trying to load a historical with a small number of segments, this means that if the actual segment size + load queue size of the mostly empty server adds up to be larger than the historical that the segment was moved from, the segment will be dropped from the mostly empty server. We confirmed that this was indeed the case with the sizes reported in the logs:

2018-03-22T01:23:05,569 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[happy.historical.internal:8283, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 517,477,739,359 bytes served.
2018-03-22T01:23:05,569 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[unlucky.historical.internal:8283, historical, _default_tier] has 1,957 left to load, 10 left to drop, 545,943,116,991 bytes queued, 28,668,295,253 bytes served.

Now of course, the balancer, wanting to balance the cluster and fill up our still empty guy, will decide to move that or some other segments back to the imbalanced historical, likely repeating the process, perhaps with different segments this time. The number of segments affected per pair of coordinator runs is maximally bound by the maxSegmentsToMove configuration.

This is very likely the cause of a wide range of totally strange coordinator behavior, from stuck and slow balancing, to overly large load queues, etc.

The assign segments part of the rule runner delegates the decision of where to add the segment to the balancer, so too should the logic which decides which server should drop over replicated segments. Additionally, the segment drop at the end of a move should either work, or be removed and use a two pass strategy like what is effectively happening in the current state.

The text was updated successfully, but these errors were encountered:

* #5521 part 1 * formatting * oops * less magic tests

* apache#5521 part 1 * formatting * oops * less magic tests

* #5521 part 1 * formatting * oops * less magic tests

* apache#5521 part 1 * formatting * oops * less magic tests

jihoonson · 2018-08-06T21:21:10Z

Looks like we've fixed this issue. We can reopen this issue if we meet again. Closing now.

clintropolis added a commit to clintropolis/druid that referenced this issue Mar 23, 2018

apache#5521 part 1

ff870e6

This was referenced Mar 23, 2018

Coordinator balancer move then drop fix #5528

Merged

Coordinator drop segment selection through cost balancer #5529

Merged

Coordinator primary replicant assignment over assigns segments until loaded #5531

Closed

jihoonson pushed a commit that referenced this issue Mar 29, 2018

Coordinator balancer move then drop fix (#5528)

30fc4d3

* #5521 part 1 * formatting * oops * less magic tests

jon-wei pushed a commit to implydata/druid-public that referenced this issue Apr 4, 2018

Coordinator balancer move then drop fix (apache#5528)

dc8ee08

* apache#5521 part 1 * formatting * oops * less magic tests

gianm pushed a commit to implydata/druid-public that referenced this issue May 16, 2018

Coordinator balancer move then drop fix (apache#5528)

8b930ac

* apache#5521 part 1 * formatting * oops * less magic tests

jihoonson pushed a commit to jihoonson/druid that referenced this issue Jul 5, 2018

Coordinator balancer move then drop fix (apache#5528)

baa1602

* apache#5521 part 1 * formatting * oops * less magic tests

fjy pushed a commit that referenced this issue Jul 6, 2018

Coordinator balancer move then drop fix (#5528) (#5946)

6ff329b

* #5521 part 1 * formatting * oops * less magic tests

leventov pushed a commit to metamx/druid that referenced this issue Jul 20, 2018

Coordinator balancer move then drop fix (apache#5528) (apache#5946)

04fa5f1

* apache#5521 part 1 * formatting * oops * less magic tests

jihoonson added this to the 0.12.2 milestone Aug 6, 2018

jihoonson added Bug Area - Segment Balancing/Coordination labels Aug 6, 2018

jihoonson closed this as completed Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

clintropolis commented Mar 22, 2018

jihoonson commented Aug 6, 2018

Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

Comments

clintropolis commented Mar 22, 2018

jihoonson commented Aug 6, 2018