Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Closed
shlomi-noach opened this issue Jun 26, 2016 · 19 comments
Closed

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

shlomi-noach opened this issue Jun 26, 2016 · 19 comments

Comments

@shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jun 26, 2016

Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).

Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE

  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'

  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK

  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl

    • This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue

  • Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in show processlist)

  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.

  • Connection 10: UNLOCK TABLES

    BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

  • We create tbl_old as a blocker for a premature swap
  • It is allowed for a connection to DROP a table it has under a WRITE LOCK
  • A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

  • If C10 errors on the CREATE we do not proceed.
  • If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.
  • If C10 dies just as C20 is about to issue the RENAME:
    • The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.
    • C20's RENAME immediately fails because tbl_old exists.
    • The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
  • If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation
  • If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry
  • If C20 dies just after C10 DROPs the table but before the unlock, same as above.
  • If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

@baloo
Copy link

@baloo baloo commented Aug 1, 2016

Hello @shlomi-noach ,

First thanks for opensourcing gh-ost :) It looks very good :)
The only question remaining in my mind was about this particular subject (cut-over phase), thanks for documenting it. But I believe I now have two more:

  • What ensures that no other connections writes to tbl table in-between those events:

    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

    wouldn't it be safer to:

    • Connection C10: LOCK TABLES tbl WRITE;
    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl_old WRITE;

    I believe this way the replication client knowns once the CREATE_TABLE occurs in the replication stream, no writes to tbl will occur ever. And it's safe to proceed further down (with the rename and so on).

  • Other question: You assume:

    A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

    From my understanding of the mysql documentation this is only guaranteed by the lock acquirement algorithm that mysql uses [https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html]:

    LOCK TABLES acquires locks as follows:

    • Sort all tables to be locked in an internally defined order. From the user standpoint, this order is undefined.
    • If a table is to be locked with a read and a write lock, put the write lock request before the read lock request.
    • Lock one table at a time until the session gets all locks.

    I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
    Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Aug 2, 2016

@baloo, thank you for your review!

What ensures that no other connections writes to tbl table in-between those events:

There is nothing to ensure that, and there is no need to ensure that; the table creation is merely a step in the direction of beginning the cut-over. There is no problem that tbl still takes writes, these are still being read in the binary log and propagated to the ghost table.

Moreover, if we reverse the order as you suggest, implication is we waste more time during the locked phase.

wouldn't it be safer to:

  • Connection C10: LOCK TABLES tbl WRITE;
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
  • Connection C10: LOCK TABLES tbl_old WRITE;

This doesn't work like that in MySQL, see: http://dev.mysql.com/doc/refman/5.7/en/lock-tables.html:

... A session that requires locks must acquire all the locks that it needs in a single LOCK TABLES statement. ...

You cannot just grab more locks as you go along, you have to grab them all at once.

I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

The relevant code in MySQL is https://github.com/mysql/mysql-server/blob/a533e2c786164af9bd276660b972d93649434297/sql/mdl.cc#L2312. You are correct this relies on internal behavior. This behavior exists for over 15 years or so, and, having discussed (informally, Safe Harbour etc.) with engineers - has no intention of going away.

I'm glad you point it out, I'll make it explicit in the fine print.

This entire scheme came to be because of a limitation in MySQL, where you cannot rename a table you have under lock in the same session. I'm in good mind to influence next MySQL version to support the above, rendering this entire algorithm unnecessary.

Loading

@baloo
Copy link

@baloo baloo commented Aug 2, 2016

Thanks for clarifications :)

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Oct 29, 2018

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

Loading

@youzipi
Copy link

@youzipi youzipi commented Dec 1, 2019

why need to lock the origin table?
i only figure out one reason:to prevent continuous transaction on origin table, so as to create a time window for the swap operation?
otherwise may fall into a try-fail cycle?

as i already know, rename operation is in-place, no need to rebuild table and permit concurrent DML.

will that happen? or there are some other reasons?

mark @youzipi

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Dec 3, 2019

rename operation is in-place, no need to rebuild table and permit concurrent DML.

True, but because gh-ost works asynchronously by applying binary log content, there may be some backlog in the binary logs. And so we lock the table to ensure we've applied any relevant changes from the binary log.

Loading

@youzipi
Copy link

@youzipi youzipi commented Dec 26, 2019

Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

Is it necessary to lock the tbl_old?
as just lock tbl can block the C20's RENAME and if C10 fail, the existence of tbl_old can make the RENAME fail.

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Dec 30, 2019

Is it necessary to lock the tbl_old?

It is necessary but for a different reason than you'd expect: in MySQL, if session A runs a LOCK TABLE some_table, then it is only allowed to operate on some_table. Specifically, it will not be allowed to rename another table.

Loading

@youzipi
Copy link

@youzipi youzipi commented Jan 7, 2020

if session A runs a LOCK TABLE some_table, then it is only allowed to operate on some_table. Specifically, it will not be allowed to rename another table.

but lock and rename are in different session.
C20 did not hold a lock.
Is it comes from the old way of Facebook ,but make no sense in new solution?

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Jan 7, 2020

but lock and rename are in different session.

You're right. I wasn't paying enough attention.
So, I'm trying to go three years back in time to recall my experiments. I'm assuming I had good reason to lock tbl_old. Possibly you are right and there is no need to lock it -- but also, there is no overhead to it. It is not being used in any way. I have little capacity right now to set up the testing scripts which checked all the locking behaviors. If you have the time, please go ahead. Again, I believe this may be interesting academically, but with little (no?) impact in reality.

Is it comes from the old way of Facebook ,but make no sense in new solution?

I don't understand the question. Facebook did not use any of this, they used two RENAME operations which left a hole/gap/puncture where for a brief period of time, the table does not exist.

My solution avoids that and ensures the table exists at all times.

Loading

@Shawn001
Copy link

@Shawn001 Shawn001 commented Apr 25, 2020

hello

why can't use rename t1 to t1_del, t1_gho to t1 directly,

are there some problems?

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Apr 25, 2020

It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.

Loading

@Shawn001
Copy link

@Shawn001 Shawn001 commented Apr 26, 2020

It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.

Thanks.
i found 3 chapters which you posted several years ago.
http://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-ii

Loading

@wukongHH
Copy link

@wukongHH wukongHH commented Jun 11, 2021

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

Hi,has gh-ost already utilize this new functionality?

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Jun 11, 2021

It is unfortunate that the implementation actually does not meet gh-ost's requirement, and using the new implementation sadly complicates the logic rather than simplifies it. I therefore made no progress meeting the new implementation.

Loading

@wukongHH
Copy link

@wukongHH wukongHH commented Jun 11, 2021

Can I ask why the new implementation can't simplify the logic? I don't understand by myself.o(╥﹏╥)o

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Jun 11, 2021

Loading

@cenkore
Copy link

@cenkore cenkore commented Jun 16, 2021

@shlomi-noach

If C20 dies just after C10 DROPs the table but before the unlock, same as above.

This entry should be incorrect. In this case, the table name should be cut-over normally, because after the session is broken, the lock is released on C10 and C20 will get the lock and cut-over table directly. Finally gh-ost will fail with error Expected magic comment on _xxx_del, did not find it in retry stage and exit, but the cut-over has actually been completed.

Loading

@shlomi-noach
Copy link
Contributor Author

@shlomi-noach shlomi-noach commented Jun 21, 2021

I don't think I have the bandwidth to solve this again. Unless someone else has a definitive solution, consider using -cut-over=two-step

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants