Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator rule runner and balancer can trample on each other, often producing unstable/undefined results, particularly when cluster is heavily imbalanced #5521

Closed
clintropolis opened this issue Mar 22, 2018 · 1 comment

Comments

@clintropolis
Copy link
Member

TL;DR The deeper the load queue, the less likely a historical is to actually retain segments it is told to load because of some bugs and competing coordinator logic. Assigning and moving segments is done with the balancer, which makes segment placement decisions taking into account the load and drop queues of servers. Dropping segments is done considering solely the ideal state of all historicals (currently size loaded + load queue size) producing very quirky undefined behavior when trying to balance segments in a very imbalanced cluster (e.g. when replacing a failed historical or adding a new historical to the cluster) if this value surpasses the size of the 'typical' loaded historicals.

A potential workaround is to set maxSegmentsInNodeLoadingQueue on coordinator config to be small enough that size actually loaded + size of load queue is less likely to be larger than the size of a 'typical' historical. Reducing maxSegmentsToMove can also have a positive effect.

This was a fun one to hunt down! We ran into an issue where a cluster had replaced a historical server, but after a time began to present with 'stuck segment' log messages. This was not a production cluster so we had the rare luxury to investigate a bit deeper than might have otherwise been normally available, and noticed that this historical was constantly being told by the coordinator to load and then nearly immediately drop the same segments it had just loaded every cycle.

The basic pattern played out like this in the coordinator logs:

2018-03-22T01:22:35,505 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorBalancer - Moving [some_segment] from [happy.historical.internal:8283] to [unlucky.historical.internal:8283]
2018-03-22T01:22:35,505 INFO [Coordinator-Exec--0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/unlucky.historical.internal:8283] to load segment[some_segment]
...
2018-03-22T01:22:55,646 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] loading [some_segment]
2018-03-22T01:22:55,646 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] processing segment[some_segment]
...
2018-03-22T01:22:58,422 INFO [main-EventThread] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] done processing [/druid/loadQueue/unlucky.historical.internal:8283/some_segment]

Then the next coordinator run:

...
2018-03-22T01:23:05,542 INFO [Coordinator-Exec--0] io.druid.server.coordinator.LoadQueuePeon - Asking server peon[/druid/loadQueue/unlucky.historical.internal:8283] to drop segment[some_segment]
2018-03-22T01:23:08,327 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] dropping [some_segment]
2018-03-22T01:23:08,327 INFO [Master-PeonExec--0] io.druid.server.coordinator.LoadQueuePeon - Server[/druid/loadQueue/unlucky.historical.internal:8283] processing segment[some_segment]

This happened for many segments. Looking deeper, the first issue we noticed is that the segment move code in the cost balancer is partially broken, in that segments fail to be dropped from the source server after successfully loading on the destination server as the design intends. A flaw in the play between how BatchDataSegmentAnnouncer announces a set of segments under a single partially random zk node path, and an assumption AbstractCuratorServerInventoryView.isSegmentLoadedByServer is making that the segment is advertised in a zk node by identifier means that it never in fact makes it to the logic that drops the segment after moving. So, the add happens, but the drop can not.

On the next coordinator run, the rule runner will notice that this segment is now over replicated, and decide that one of them needs dropped.

However, the logic to decide which server to drop from chooses the server with the least amount of space available but unfortunately for us this includes the load queue. If the cluster is in an ideal state this isn't as painful, but when trying to load a historical with a small number of segments, this means that if the actual segment size + load queue size of the mostly empty server adds up to be larger than the historical that the segment was moved from, the segment will be dropped from the mostly empty server. We confirmed that this was indeed the case with the sizes reported in the logs:

2018-03-22T01:23:05,569 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[happy.historical.internal:8283, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 517,477,739,359 bytes served.
2018-03-22T01:23:05,569 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[unlucky.historical.internal:8283, historical, _default_tier] has 1,957 left to load, 10 left to drop, 545,943,116,991 bytes queued, 28,668,295,253 bytes served.

Now of course, the balancer, wanting to balance the cluster and fill up our still empty guy, will decide to move that or some other segments back to the imbalanced historical, likely repeating the process, perhaps with different segments this time. The number of segments affected per pair of coordinator runs is maximally bound by the maxSegmentsToMove configuration.

This is very likely the cause of a wide range of totally strange coordinator behavior, from stuck and slow balancing, to overly large load queues, etc.

The assign segments part of the rule runner delegates the decision of where to add the segment to the balancer, so too should the logic which decides which server should drop over replicated segments. Additionally, the segment drop at the end of a move should either work, or be removed and use a two pass strategy like what is effectively happening in the current state.

clintropolis added a commit to clintropolis/druid that referenced this issue Mar 23, 2018
jihoonson pushed a commit that referenced this issue Mar 29, 2018
* #5521 part 1

* formatting

* oops

* less magic tests
jon-wei pushed a commit to implydata/druid-public that referenced this issue Apr 4, 2018
* apache#5521 part 1

* formatting

* oops

* less magic tests
gianm pushed a commit to implydata/druid-public that referenced this issue May 16, 2018
* apache#5521 part 1

* formatting

* oops

* less magic tests
jihoonson pushed a commit to jihoonson/druid that referenced this issue Jul 5, 2018
* apache#5521 part 1

* formatting

* oops

* less magic tests
fjy pushed a commit that referenced this issue Jul 6, 2018
* #5521 part 1

* formatting

* oops

* less magic tests
leventov pushed a commit to metamx/druid that referenced this issue Jul 20, 2018
@jihoonson jihoonson added this to the 0.12.2 milestone Aug 6, 2018
@jihoonson
Copy link
Contributor

Looks like we've fixed this issue. We can reopen this issue if we meet again. Closing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants