correlating relay logs #32

Merged
merged 17 commits into from Jan 8, 2017

Projects

None yet

1 participant

@shlomi-noach
Collaborator

Following up on #1, in the objective of aligning replicas at failover.

At this time this will cooperate with orchestrator-agent at github/orchestrator-agent#13, though the same can be accomplished via remote SSH.

When master fails, orchestrator is able to use GTID/Pseudo-GTID to match replicas. However there are a few constraints: what if the most up-to-date replica doesn't have binlogging or log-slave-updates? What if it uses a ROW based replication where all others use STATEMENT based? What if it's of a higher MySQL version? orchestrator would have to lose it, even though it contained more data than others.

This PR follows on the MHA approach, which requires either remote agents on MySQL boxes, or remote SSH. The intention is to correlate relay logs between failed replicas (done, optimized speed), then copy & apply such logs from the most up-to-date replica onto a (single) candidate replica.

Why single? Because the candidate replica would have log-slave-updates and orchestrator would be able to point all other replicas under that one. It is yet to be seen whether comparing and copying relay logs onto a single replica, then applying Pseudo-GTID/GTID logic to heal the rest of the replicas, is faster or slower than comparing and copying relay logs from the most up-to-date replica onto all other replicas.

Initial commits in this PR provide heuristic search for relay log coordinates & entries, which turn relay-log correlation into a subsecond operation, within 1 minute from failure.

@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach had a problem deploying to production/github-mysqlutil Jan 2, 2017 Failure
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach had a problem deploying to production/github-mysqlutil Jan 2, 2017 Failure
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach had a problem deploying to production/github-mysqlutil Jan 2, 2017 Failure
go/inst/instance_topology.go
@@ -1261,7 +1261,7 @@ func FindLastPseudoGTIDEntry(instance *Instance, recordedInstanceRelayLogCoordin
}
minBinlogCoordinates, minRelaylogCoordinates, err := GetHeuristiclyRecentCoordinatesForInstance(&instance.Key)
- if instance.LogBinEnabled && instance.LogSlaveUpdatesEnabled && (expectedBinlogFormat == nil || instance.Binlog_format == *expectedBinlogFormat) {
+ if instance.LogBinEnabled && instance.LogSlaveUpdatesEnabled && !*config.RuntimeCLIFlags.SkipBinlogSearch && (expectedBinlogFormat == nil || instance.Binlog_format == *expectedBinlogFormat) {
@shlomi-noach
shlomi-noach Jan 2, 2017 Collaborator

This change is actually unrelated to this PR. However the need for it came at a good time and it makes sense to include it here.

@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach had a problem deploying to production/github-mysqlutil Jan 2, 2017 Failure
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach had a problem deploying to production/github-mysqlutil Jan 2, 2017 Failure
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 2, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 3, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 3, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysqlutil Jan 3, 2017 Active
@shlomi-noach shlomi-noach changed the title from WIP: correlating relay logs on failover to correlating relay logs on failover Jan 3, 2017
@shlomi-noach shlomi-noach changed the title from correlating relay logs on failover to correlating relay logs Jan 3, 2017
@shlomi-noach
Collaborator

originally this PR was meant do go the whole way to relaylog-sync on failover.
However the functionality at this time is substantial enough, and can (and should) be tested. Once tested to work, I suggest merging this, and opening a new PR to take it from here.

@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 4, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 5, 2017 Active
@shlomi-noach shlomi-noach deployed to production/github-mysql6 Jan 5, 2017 Active
@shlomi-noach
Collaborator

merging for now, continued work in other PRs

@shlomi-noach shlomi-noach merged commit 4b07d99 into master Jan 8, 2017

1 check passed

orchestrator-build-deploy-tarball Build #5099024 succeeded in 15s
Details
@shlomi-noach shlomi-noach deleted the failover-correlate-relay-logs branch Jan 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment