Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis and action for UnreachableMasterWithLaggingReplicas #572

Merged
merged 23 commits into from Aug 23, 2018

Conversation

shlomi-noach
Copy link
Collaborator

A new failure analysis: UnreachableMasterWithLaggingReplicas identifies the case of a master unreachable to orchestrator, where all of its replicas are seemingly OK, but all lagging.

Failure scenario

This is a known scenario in production. A well-known particular cause for this is the problem of Too Many Connections on a master. Say the master is overloaded, connections coming in, finally the master refuses to accept new connections.

orchestrator would suddenly be unable to reach the master. But long-time running replicas may enjoy the fact they're using good-old connections. They may actually still be able to replicate.
The master may eventually refuse any/all writes. The replicas would still think everything's fine but they're not getting anything through replication stream.

If using a pt-heartbeat or similar, replication lag will be seen to increase even as Seconds_behind_master may still indicate 0.

Some notes:

  • We've seen this scenario to happen even if slave_net_timeout is configured and replicas are using heartbeats.
  • We've seen this scenario in the past while we were still running trigger-based pt-online-schema-change (before moving to triggerless gh-ost).
  • And we've seen a similar scenario for other reasons.

Analysis

To avoid false positives, the analysis checks:

  • The master is unreachable to orchestrator
  • At least one replica thinks the master is reachable
  • All replicas show lag. You should be using ReplicationLagQuery configuration, i.e. utilize a heartbeat mechanism such as pt-heartbeat, and not trust Seconds_Behind_Master to do the right thing (it doesn't).
  • Not all replicas are delayed (if all replicas have SQL_Delay then they are in fact expected to lag).

Action

There are two potential courses of action and we picked one over the other. One course of action would be to immediately initiate a failover. However, we chose another course of action. The reason is that this analysis is a bit on the gray zone. There could be a failure of pt-heartbeat on the master together with a very brief network isolation of the master from orchestrator. It's slim, but because this type of analysis is new, we choose to tread carefully and avoid false positive failovers.

We choose a different action: Issue a STOP SLAVE; START SLAVE on all master's direct replicas, credit @tomkrouper.

This would kick the connections on replicas and hopefully the re-authentication and re-connection process would make the replica realize the master is broken, same as orchestrator had, or any app connection had.

That would shortly lead to all replicas being broken, which would lead to a DeadMaster analysis, and a failover action.

Noteworthy that this analysis is re-generated every second or so, and that the action taken (restart replication on replicas) is not affected by RecoveryPeriodBlockSeconds. There is an internal throttling mechanism to avoid flooding the replicas with stop slave; start slave operation.

cc @github/database-infrastructure

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 8, 2018 06:17 Inactive
@shlomi-noach shlomi-noach changed the title analysis acnd action for UnreachableMasterWithLaggingReplicas Analysis and action for UnreachableMasterWithLaggingReplicas Aug 8, 2018
@shlomi-noach
Copy link
Collaborator Author

TODO: subtlety, if we restart replication on all replicas, that in itself would cause a situation where all replicas are not replicating, at the same time. This would trigger a DeadMaster analysis, and it shouldn't.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 04:59 Inactive
return found
}

// emergentlyRestartReplicationOnTopologyInstanceReplicas forces a stop slave + start slave on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to just do a stop/start of the IO thread, rather than both threads? On the off chance that the replica is behind?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:26 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:44 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 9, 2018 06:45 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 12, 2018 05:11 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 12, 2018 05:11 Inactive
Shlomi Noach added 2 commits August 14, 2018 16:22
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 14, 2018 13:24 Inactive
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=conductor August 14, 2018 13:24 Failure
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 14, 2018 13:26 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster August 19, 2018 05:21 Inactive
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=conductor August 19, 2018 05:21 Failure
@shlomi-noach shlomi-noach had a problem deploying to production/mysql_cluster=concertmaster August 19, 2018 05:28 Failure
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 19, 2018 05:28 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 19, 2018 07:22 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 07:40 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 07:58 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 20, 2018 08:06 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 23, 2018 05:18 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor August 23, 2018 05:37 Inactive
@shlomi-noach shlomi-noach merged commit 464a3c1 into master Aug 23, 2018
@shlomi-noach shlomi-noach deleted the unreachable-master-with-lagging-replicas branch August 23, 2018 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants