Analysis and action for UnreachableMasterWithLaggingReplicas #572

shlomi-noach · 2018-08-08T06:16:54Z

A new failure analysis: UnreachableMasterWithLaggingReplicas identifies the case of a master unreachable to orchestrator, where all of its replicas are seemingly OK, but all lagging.

Failure scenario

This is a known scenario in production. A well-known particular cause for this is the problem of Too Many Connections on a master. Say the master is overloaded, connections coming in, finally the master refuses to accept new connections.

orchestrator would suddenly be unable to reach the master. But long-time running replicas may enjoy the fact they're using good-old connections. They may actually still be able to replicate.
The master may eventually refuse any/all writes. The replicas would still think everything's fine but they're not getting anything through replication stream.

If using a pt-heartbeat or similar, replication lag will be seen to increase even as Seconds_behind_master may still indicate 0.

Some notes:

We've seen this scenario to happen even if slave_net_timeout is configured and replicas are using heartbeats.
We've seen this scenario in the past while we were still running trigger-based pt-online-schema-change (before moving to triggerless gh-ost).
And we've seen a similar scenario for other reasons.

Analysis

To avoid false positives, the analysis checks:

The master is unreachable to orchestrator
At least one replica thinks the master is reachable
All replicas show lag. You should be using ReplicationLagQuery configuration, i.e. utilize a heartbeat mechanism such as pt-heartbeat, and not trust Seconds_Behind_Master to do the right thing (it doesn't).
Not all replicas are delayed (if all replicas have SQL_Delay then they are in fact expected to lag).

Action

There are two potential courses of action and we picked one over the other. One course of action would be to immediately initiate a failover. However, we chose another course of action. The reason is that this analysis is a bit on the gray zone. There could be a failure of pt-heartbeat on the master together with a very brief network isolation of the master from orchestrator. It's slim, but because this type of analysis is new, we choose to tread carefully and avoid false positive failovers.

We choose a different action: Issue a STOP SLAVE; START SLAVE on all master's direct replicas, credit @tomkrouper.

This would kick the connections on replicas and hopefully the re-authentication and re-connection process would make the replica realize the master is broken, same as orchestrator had, or any app connection had.

That would shortly lead to all replicas being broken, which would lead to a DeadMaster analysis, and a failover action.

Noteworthy that this analysis is re-generated every second or so, and that the action taken (restart replication on replicas) is not affected by RecoveryPeriodBlockSeconds. There is an internal throttling mechanism to avoid flooding the replicas with stop slave; start slave operation.

cc @github/database-infrastructure

…artSlave on master's direct replicas

shlomi-noach · 2018-08-08T06:28:13Z

TODO: subtlety, if we restart replication on all replicas, that in itself would cause a situation where all replicas are not replicating, at the same time. This would trigger a DeadMaster analysis, and it shouldn't.

ggunson · 2018-08-09T05:00:46Z

go/logic/topology_recovery.go

+	return found
+}
+
+// emergentlyRestartReplicationOnTopologyInstanceReplicas forces a stop slave + start slave on


Is it useful to just do a stop/start of the IO thread, rather than both threads? On the off chance that the replica is behind?

Yeah, that's a good idea.

…:github/orchestrator into unreachable-master-with-lagging-replicas

…leSlaves, AllMasterSlavesStale

…base_instance_recent_relaylog_history

Shlomi Noach added 6 commits August 6, 2018 16:57

terminology

babf5da

CountDelayedReplicas

e9b3272

CountLaggingReplicas

b354c5b

new analysis: UnreachableMasterWithLaggingReplicas

0909784

tests for UnreachableMasterWithLaggingReplicas

6f0c630

Taking action on UnreachableMasterWithLaggingReplicas: forcing a Rest…

55b5f87

…artSlave on master's direct replicas

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 8, 2018 06:17 Inactive

shlomi-noach changed the title ~~analysis acnd action for UnreachableMasterWithLaggingReplicas~~ Analysis and action for UnreachableMasterWithLaggingReplicas Aug 8, 2018

emergency operation graceful period

33305be

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 04:59 Inactive

ggunson reviewed Aug 9, 2018

View reviewed changes

Restarting just the IO thread

7bed7e3

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:26 Inactive

Shlomi Noach added 2 commits August 9, 2018 09:43

whoops, wrong implementation of RestartIOThread

f874f8a

reverting earlier change

3f8e5d1

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:44 Inactive

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:45 Inactive

Merge branch 'master' into unreachable-master-with-lagging-replicas

53a332c

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 12, 2018 05:11 Inactive

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 12, 2018 05:11 Inactive

Shlomi Noach added 2 commits August 14, 2018 16:22

using config.Config.ReasonableReplicationLagSeconds

e7381c6

Merge branch 'unreachable-master-with-lagging-replicas' of github.com…

787be50

…:github/orchestrator into unreachable-master-with-lagging-replicas

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 14, 2018 13:24 Inactive

shlomi-noach had a problem deploying to production/mysql_cluster=conductor August 14, 2018 13:24 Failure

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 14, 2018 13:26 Inactive

temporary debug message for visibility

b12207b

shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 19, 2018 05:21 Inactive

shlomi-noach had a problem deploying to production/mysql_cluster=conductor August 19, 2018 05:21 Failure

Merge branch 'master' into unreachable-master-with-lagging-replicas

250f019

shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster August 19, 2018 05:28 Failure

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 19, 2018 05:28 Inactive

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 19, 2018 07:22 Inactive

Shlomi Noach added 4 commits August 19, 2018 10:24

debug messages more informative

4e1ef35

Merge branch 'unreachable-master-with-lagging-replicas' of github.com…

c819f50

…:github/orchestrator into unreachable-master-with-lagging-replicas

Documentation for UnreachableMasterWithLaggingReplicas

3bb1fa2

Removed legacy (and long since non-existing) UnreachableMasterWithSta…

ebaf07a

…leSlaves, AllMasterSlavesStale

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 07:40 Inactive

removed use of legacy database_instance_binlog_files_history and data…

3825de3

…base_instance_recent_relaylog_history

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 07:58 Inactive

too much sleep/retry time for orchestrator-client, reduced a bit

cb39182

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 08:06 Inactive

Merge branch 'master' into unreachable-master-with-lagging-replicas

048146a

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 23, 2018 05:18 Inactive

analysis message cached

5042119

shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 23, 2018 05:37 Inactive

shlomi-noach merged commit 464a3c1 into master Aug 23, 2018

shlomi-noach deleted the unreachable-master-with-lagging-replicas branch August 23, 2018 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis and action for UnreachableMasterWithLaggingReplicas #572

Analysis and action for UnreachableMasterWithLaggingReplicas #572

shlomi-noach commented Aug 8, 2018

shlomi-noach commented Aug 8, 2018

ggunson Aug 9, 2018

shlomi-noach Aug 9, 2018

Analysis and action for UnreachableMasterWithLaggingReplicas #572

Analysis and action for UnreachableMasterWithLaggingReplicas #572

Conversation

shlomi-noach commented Aug 8, 2018

Failure scenario

Analysis

Action

shlomi-noach commented Aug 8, 2018

ggunson Aug 9, 2018

Choose a reason for hiding this comment

shlomi-noach Aug 9, 2018

Choose a reason for hiding this comment