Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operational question: how does one decide whether a _gho or _ghc are abandoned? #99

Open
shlomi-noach opened this issue Jul 22, 2016 · 5 comments

Comments

@shlomi-noach
Copy link
Contributor

With triggered tools, the absence of triggers is an indication that the operation is dead. How does one recognize the triggerless gh-ost operation is dead?

Once closed, this question should move into the documentation.

@shlomi-noach
Copy link
Contributor Author

The quick answer is to select max(last_update) from _whatever_ghc. An active gh-ost operation will routinely update that table (within the second) even while throttling. Thus, if max(last_update) is, say, a minute ago (are you checking on master? That's definite. Are you checking on replica? Make sure lag is OK) indicates the migration is dead.

@jonahberquist
Copy link
Contributor

I think abandoned and dead are two different states. With trigger-based migration tools, if the triggers are gone, we have to give up and start over, but that's not the case here. If the gh-ost process dies, we could theoretically resume the operation, as long as we know where to resume in the table copy process, know where to resume in the binary log for incoming changes, and have the binary logs to resume from.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Jul 24, 2016

The official query to see if a migration is running is:

select last_update from _tbl_ghc where hint='heartbeat';

(to get last known activity), or

select last_update > now() - interval 1 minute as is_alive from _tbl_ghc where hint='heartbeat';

for a heuristic "if it hasn't been updated in the last minute it must be dead"
cc @tomkrouper

@shlomi-noach
Copy link
Contributor Author

If the gh-ost process dies, we could theoretically resume the operation, as long as we know where to resume in the table copy process, know where to resume in the binary log for incoming changes, and have the binary logs to resume from.

omg.

So, I've been thinking about this: we only need to know where to resume in the table-copy process, actually. It turns out, and I need to put this in detailed writing, that replaying the RBR is idempotent!! Which means we can just replay it from some point in the past (but of course we must never skip entries).

But, I would suggest this is still way in the future.

@pbitty
Copy link
Contributor

pbitty commented Aug 19, 2016

The above sounds great. Having the ability to resume would be a great feature.

Now that binlog events are applied in a transaction, it should be possible to confidently store and read the last-processed binlog position. The same could be done with the table copy process, if the statements are also wrapped in a transaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants