Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Closed
shlomi-noach opened this Issue Jun 26, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@shlomi-noach
Copy link
Collaborator

shlomi-noach commented Jun 26, 2016

Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).

Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE

  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'

  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK

  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl

    • This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue

  • Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in show processlist)

  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.

  • Connection 10: UNLOCK TABLES

    BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

  • We create tbl_old as a blocker for a premature swap
  • It is allowed for a connection to DROP a table it has under a WRITE LOCK
  • A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

  • If C10 errors on the CREATE we do not proceed.
  • If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.
  • If C10 dies just as C20 is about to issue the RENAME:
    • The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.
    • C20's RENAME immediately fails because tbl_old exists.
    • The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
  • If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation
  • If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry
  • If C20 dies just after C10 DROPs the table but before the unlock, same as above.
  • If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

@baloo

This comment has been minimized.

Copy link

baloo commented Aug 1, 2016

Hello @shlomi-noach ,

First thanks for opensourcing gh-ost :) It looks very good :)
The only question remaining in my mind was about this particular subject (cut-over phase), thanks for documenting it. But I believe I now have two more:

  • What ensures that no other connections writes to tbl table in-between those events:

    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

    wouldn't it be safer to:

    • Connection C10: LOCK TABLES tbl WRITE;
    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl_old WRITE;

    I believe this way the replication client knowns once the CREATE_TABLE occurs in the replication stream, no writes to tbl will occur ever. And it's safe to proceed further down (with the rename and so on).

  • Other question: You assume:

    A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

    From my understanding of the mysql documentation this is only guaranteed by the lock acquirement algorithm that mysql uses [https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html]:

    LOCK TABLES acquires locks as follows:

    • Sort all tables to be locked in an internally defined order. From the user standpoint, this order is undefined.
    • If a table is to be locked with a read and a write lock, put the write lock request before the read lock request.
    • Lock one table at a time until the session gets all locks.

    I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
    Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

@shlomi-noach

This comment has been minimized.

Copy link
Collaborator Author

shlomi-noach commented Aug 2, 2016

@baloo, thank you for your review!

What ensures that no other connections writes to tbl table in-between those events:

There is nothing to ensure that, and there is no need to ensure that; the table creation is merely a step in the direction of beginning the cut-over. There is no problem that tbl still takes writes, these are still being read in the binary log and propagated to the ghost table.

Moreover, if we reverse the order as you suggest, implication is we waste more time during the locked phase.

wouldn't it be safer to:

  • Connection C10: LOCK TABLES tbl WRITE;
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
  • Connection C10: LOCK TABLES tbl_old WRITE;

This doesn't work like that in MySQL, see: http://dev.mysql.com/doc/refman/5.7/en/lock-tables.html:

... A session that requires locks must acquire all the locks that it needs in a single LOCK TABLES statement. ...

You cannot just grab more locks as you go along, you have to grab them all at once.

I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

The relevant code in MySQL is https://github.com/mysql/mysql-server/blob/a533e2c786164af9bd276660b972d93649434297/sql/mdl.cc#L2312. You are correct this relies on internal behavior. This behavior exists for over 15 years or so, and, having discussed (informally, Safe Harbour etc.) with engineers - has no intention of going away.

I'm glad you point it out, I'll make it explicit in the fine print.

This entire scheme came to be because of a limitation in MySQL, where you cannot rename a table you have under lock in the same session. I'm in good mind to influence next MySQL version to support the above, rendering this entire algorithm unnecessary.

@baloo

This comment has been minimized.

Copy link

baloo commented Aug 2, 2016

Thanks for clarifications :)

@shlomi-noach

This comment has been minimized.

Copy link
Collaborator Author

shlomi-noach commented Oct 29, 2018

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.