New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis and action for UnreachableMasterWithLaggingReplicas #572
Analysis and action for UnreachableMasterWithLaggingReplicas #572
Conversation
…artSlave on master's direct replicas
TODO: subtlety, if we restart replication on all replicas, that in itself would cause a situation where all replicas are not replicating, at the same time. This would trigger a |
return found | ||
} | ||
|
||
// emergentlyRestartReplicationOnTopologyInstanceReplicas forces a stop slave + start slave on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to just do a stop/start of the IO thread, rather than both threads? On the off chance that the replica is behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good idea.
…:github/orchestrator into unreachable-master-with-lagging-replicas
…:github/orchestrator into unreachable-master-with-lagging-replicas
…leSlaves, AllMasterSlavesStale
…base_instance_recent_relaylog_history
A new failure analysis:
UnreachableMasterWithLaggingReplicas
identifies the case of a master unreachable toorchestrator
, where all of its replicas are seemingly OK, but all lagging.Failure scenario
This is a known scenario in production. A well-known particular cause for this is the problem of
Too Many Connections
on a master. Say the master is overloaded, connections coming in, finally the master refuses to accept new connections.orchestrator
would suddenly be unable to reach the master. But long-time running replicas may enjoy the fact they're using good-old connections. They may actually still be able to replicate.The master may eventually refuse any/all writes. The replicas would still think everything's fine but they're not getting anything through replication stream.
If using a
pt-heartbeat
or similar, replication lag will be seen to increase even asSeconds_behind_master
may still indicate0
.Some notes:
slave_net_timeout
is configured and replicas are using heartbeats.pt-online-schema-change
(before moving to triggerlessgh-ost
).Analysis
To avoid false positives, the analysis checks:
orchestrator
ReplicationLagQuery
configuration, i.e. utilize a heartbeat mechanism such aspt-heartbeat
, and not trustSeconds_Behind_Master
to do the right thing (it doesn't).SQL_Delay
then they are in fact expected to lag).Action
There are two potential courses of action and we picked one over the other. One course of action would be to immediately initiate a failover. However, we chose another course of action. The reason is that this analysis is a bit on the gray zone. There could be a failure of
pt-heartbeat
on the master together with a very brief network isolation of the master fromorchestrator
. It's slim, but because this type of analysis is new, we choose to tread carefully and avoid false positive failovers.We choose a different action: Issue a
STOP SLAVE; START SLAVE
on all master's direct replicas, credit @tomkrouper.This would kick the connections on replicas and hopefully the re-authentication and re-connection process would make the replica realize the master is broken, same as
orchestrator
had, or any app connection had.That would shortly lead to all replicas being broken, which would lead to a
DeadMaster
analysis, and a failover action.Noteworthy that this analysis is re-generated every second or so, and that the action taken (restart replication on replicas) is not affected by
RecoveryPeriodBlockSeconds
. There is an internal throttling mechanism to avoid flooding the replicas withstop slave; start slave
operation.cc @github/database-infrastructure