-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential too-early-rename via killed connections #26
Comments
To solve half the problem, connection death at the time of the I concur the case where the connection dies right after obtaining the initial I'm still unsure how to solve it mathematically. There is an easy way of substantially reducing the risk: as you recall, connection #2 issues a We are able to inject a suspense. Such that we force a This will delay the swap by The above is not a clean solution, and not a complete solution. Let's further think of how this can be solved mathematically. |
The solution presented above is mutt. The |
I know you both understand this way better than me, so this might be a horribly stupid idea, but could we just add a lock right after the rename to do a quick check to make sure a previous lock didn't get killed to cause a premature rename? I'm guessing it won't help the situation other than alerting us to the mess up. |
The current implementation already makes it known should something go I'm trying to think if we can employ a different storage engine which would On Wednesday, May 4, 2016, Tom Krouper notifications@github.com wrote:
Shlomi Noach |
In my manual testing of just 4 connections:
What seems to return from the With 5.6, if the GET_LOCK() conn is killed, the SELECT RELEASE_LOCK() returns a 1. In 5.7, if the GET_LOCK() is killed, the SELECT returns the empty set. And it looks like @shlomi-noach is using mysql1v, which is 5.7. I know with 5.7 they deal with locks differently (now as metadata locks), so maybe that's why. But I'm not sure how to get, with either version, |
I'm not testing the lock on mysql1v. It's a replica so this algorithm does On Wednesday, May 4, 2016, Gillian Gunson notifications@github.com wrote:
Shlomi Noach |
Okay, I guess my points from my last comment were:
|
I was intentionally injecting the wrong lock name into the release_lock and Elaborating for the uninitiated: we are using voluntary locks, which are On Wednesday, May 4, 2016, Gillian Gunson notifications@github.com wrote:
Shlomi Noach |
@shlomi-noach I'm going to drop out of this conversation, because
and
are contradictory, and there's no point my trying to help if I'm testing the wrong behaviour. |
@ggunson sorry, please give me a couple hours till I can explain more On Wednesday, May 4, 2016, Gillian Gunson notifications@github.com wrote:
Shlomi Noach |
What does strike me as an enlightenment though is that we are not talking On Wednesday, May 4, 2016, Shlomi Noach shlomi-noach@github.com wrote:
Shlomi Noach |
The result of this query is actually ignored and not interesting to the algorithm. For now, please just ignore completely the fact that there's a Recall that connection #1 issues a I hope this clarifies some confusion throughout this thread. It is the result of that Let's now complete the flow:
If there's a reasonable error along the way, such as timeout on The initial problem @ggunson refers to is the case where Connection #1 dies in between step #1 and step #4, in which case the lock gets released prematurely, leading to the Actually, this is only half the danger. The same problem applies should connection #2 die. That's the connection issuing the In both those cases we get a I hope this clarifies better 😄 |
I stated both in the original comment:
|
Indeed you have 😮 |
At this time I've fried my brains trying to solve either this loophole or find yet another atomic method of swapping tables. I'm going to have a brain massage and do other stuff in the hope something comes up. |
I have several directions (all of them dead ends at this time) that I'll share. In this problem death of either two connections would lead to the premature We can statistically reduce the risk: no one says we only have to have a single blocking query and a single |
Thank you for spending the time reviewing this. I'm closing this issue as #65 came up, which I believe to solve the cut-over even in face of connection failure. |
GENERATED column as part of PK: handling UPDATE scenario
Paraphrasing a lot due to lack of ability to understand Go, at the point where the tool is ready to swap tables, there are these connections:
And then there's a check for the parsed binlog writes to have all gone to original_new, and then locks are released and the rename happens.
However, if either the GET_LOCK() connection or the SELECT RELEASE_LOCK() connection gets killed, the RENAME happens. So there are two places where a killed connection could result in a too-early-swap and some binlog events being lost.
I like the idea of confidence in being able to say that gh-ost wouldn't result in unintentional lost rows/writes. Maybe in this case it might involve locking the now-original-table after the rename and writing the missed events then (e.g. we'd know because the
REPLACE
/DELETE
to original_new would have failed). Just brainstorming.The text was updated successfully, but these errors were encountered: