New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Isolation may not be as isolated as one would like #336

Open
aphyr opened this Issue Sep 1, 2015 · 7 comments

Comments

Projects
None yet
4 participants
@aphyr

aphyr commented Sep 1, 2015

I might be wrong about this, and I'd like some help checking my work--but it looks like MariaDB + Galera Cluster, version 7.4.7 from the MariaDB Debian Jessie repos, may not correctly enforce snapshot isolation between transactions.

This Jepsen test sets up a five node Galera cluster and runs a simple simulation of transfers between two bank accounts, both starting with balance = 10.

create table if not exists accounts
  (id      int not null primary key,
   balance bigint not null)
INSERT INTO accounts ( id, balance ) VALUES ( 0, 10 )
INSERT INTO accounts ( id, balance ) VALUES ( 1, 10 )

We then have several clients, connected to all 5 nodes, execute a mix of read transactions and balance-transfer transactions at SERIALIZABLE. Read transactions simply read all balances:

           67 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           67 Query show variables like 'max_allowed_packet'
           67 Query SELECT @@tx_isolation
           67 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           67 Query set autocommit=0
           67 Query select * from accounts
           67 Query COMMIT
           67 Query set autocommit=1
           67 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           67 Quit

Transfer transactions read the balance of both accounts, move some from one to the other, and, if both remain positive, write back the new balances. This one transfers 0, and writes 10 and 10 respectively. Note that the write set always covers the read set, and there are only two rows in the table to begin with--I don't think we should be subject to write skew anomalies here, and phantoms shouldn't be an issue.

           62 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           62 Query show variables like 'max_allowed_packet'
           62 Query SELECT @@tx_isolation
           62 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           62 Query set autocommit=0
           62 Query select * from accounts where id = 0
           62 Query select * from accounts where id = 1
           62 Query UPDATE accounts SET balance = 10 WHERE id = 0
           62 Query UPDATE accounts SET balance = 10 WHERE id = 1
           62 Query COMMIT
           62 Query ROLLBACK
           62 Query set autocommit=1
           62 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           62 Quit  

As I understand the isolation level documentation, Galera Cluster should provide Serializable isolation locally and Snapshot Isolation between nodes. Since SI is higher than RR, I assume you have to set ISOLATION LEVEL SERIALIZABLE to actually obtain SI.

As I understand SI, given fixed working sets and with write sets that cover reads, all balances should sum to the same value. As a test invariant, we measure that every read transaction sees a consistent snapshot--in this case, that the balances always sum to 20. This works for low levels of concurrency, but with 50 concurrent clients, we start seeing anomalous reads. For instance:

4   :ok :read   [3 17]
0   :invoke :read   nil
5   :invoke :transfer   {:from 0, :to 1, :amount 3}
23  :invoke :read   nil
9   :ok :transfer   {:from 1, :to 0, :amount 1}
6   :invoke :transfer   {:from 1, :to 0, :amount 0}
34  :ok :transfer   {:from 1, :to 0, :amount 3}
23  :ok :read   [4 16]
5   :fail   :transfer   [:negative 0 -1]
0   :ok :read   [2 17]
33  :invoke :read   nil
18  :invoke :transfer   {:from 1, :to 0, :amount 4}
29  :invoke :transfer   {:from 0, :to 1, :amount 2}
33  :ok :read   [7 13]

See it? Process 0 reads the balances 2 and 17, which only sum to 19. A correct value, read by process 4, might have been [3 17], or the subsequent [7 13]. There are a lot of transactions in flight at any given point in this log, so it's a little hard to reconstruct the order of events. :(

Transfer transactions see inconsistent reads as well. For instance, this transaction only sees a total balance of 17, not 20, and attempts to write [9 8].

           66 Connect   jepsen@192.168.122.1 as anonymous on jepsen
           66 Query show variables like 'max_allowed_packet'
           66 Query SELECT @@tx_isolation
           66 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
           66 Query set autocommit=0
           66 Query select * from accounts where id = 1
           66 Query select * from accounts where id = 0
           66 Query UPDATE accounts SET balance = 8 WHERE id = 1
           66 Query UPDATE accounts SET balance = 9 WHERE id = 0
           66 Query COMMIT
           66 Query ROLLBACK
           66 Query set autocommit=1
           66 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
           66 Quit

This particular commit conflicts and hits a rollback--and in fact, every transfer transaction I've analyzed so far appears to roll back, which suggests write promotion is doing the right thing here, and only read transactions fail to serialize appropriately. I'm struggling to understand whether this qualifies as an instance of http://www.cs.umb.edu/~poneil/ROAnom.pdf, or if these kinds of reads should be prevented in this case.

The complete logfiles from each host, including query logs, and the history of operations from Jepsen, plus analysis showing all inconsistent reads, are available here.

   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245300529,
     :process 1,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245328813,
     :process 21,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op
    {:value [2 14],
     :time 15245350309,
     :process 16,
     :type :ok,
     :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 19,
    :op
    {:value [5 14],
     :time 97109842242,
     :process 86,
     :type :ok,
     :f :read}}]}}

Do you have any suggestions here? Have I misunderstood Galera's guarantees? Or is there additional debugging information I can provide?

@aphyr aphyr changed the title from Snapshot Isolation may not be so isolated? to Snapshot Isolation may not be as isolated as one would like Sep 1, 2015

@sjaakola

This comment has been minimized.

Show comment
Hide comment
@sjaakola

sjaakola Sep 1, 2015

Contributor

The first committer wins rule is not honored by galera certification, this has been optimized for performance reasons, at some point. However, it is possible to support SI and make it configurable per session, so we should leave this as a potential task for some future release.

Serializable transactions are supported only if all writes are targeted to one node. Any replication coming from other nodes will not honor read locks held by serializable transaction. Achieving true serializability would require populating also read sets in Galera replication. This would cause major impact for performance, but again it would be configurable per session, and a potential new task for future, as well.

Thanks for the test case, and detailed bug report!

Contributor

sjaakola commented Sep 1, 2015

The first committer wins rule is not honored by galera certification, this has been optimized for performance reasons, at some point. However, it is possible to support SI and make it configurable per session, so we should leave this as a potential task for some future release.

Serializable transactions are supported only if all writes are targeted to one node. Any replication coming from other nodes will not honor read locks held by serializable transaction. Achieving true serializability would require populating also read sets in Galera replication. This would cause major impact for performance, but again it would be configurable per session, and a potential new task for future, as well.

Thanks for the test case, and detailed bug report!

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Sep 1, 2015

The first committer wins rule is not honored by galera certification

That's... interesting. So when the docs say

Transaction isolation also occurs at the cluster level. Between transactions processing on separate nodes, Galera Cluster implements a transaction level called SNAPSHOT-ISOLATION. The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

Does that mean Galera should provide snapshot isolation? Without first-committer-wins, you can't really have snapshot isolation, right? Does Galera actually provide some weaker isolation level--e.g. snapshot isolation plus read skew, and, if so, should the docs be changed?

aphyr commented Sep 1, 2015

The first committer wins rule is not honored by galera certification

That's... interesting. So when the docs say

Transaction isolation also occurs at the cluster level. Between transactions processing on separate nodes, Galera Cluster implements a transaction level called SNAPSHOT-ISOLATION. The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

Does that mean Galera should provide snapshot isolation? Without first-committer-wins, you can't really have snapshot isolation, right? Does Galera actually provide some weaker isolation level--e.g. snapshot isolation plus read skew, and, if so, should the docs be changed?

@aphyr

This comment has been minimized.

Show comment
Hide comment
@aphyr

aphyr Sep 3, 2015

Adjusting the concurrency and duration of the test, I can now confirm that transfer transactions are also affected: not only can read-only transactions see intermediate results from transfers, but transfers can write back intermediate results successfully, causing the permanent creation or destruction of money in the system.

Here are two test cases; in the first, the total rises from $20 to $22, and in the second, it falls to $15.

I believe this kind of data corruption is a more serious issue than inconsistent pure-read transactions--it may be a good idea to revise your documentation to reflect the possibility of data corruption for users relying on Snapshot Isolation's invariants.

aphyr commented Sep 3, 2015

Adjusting the concurrency and duration of the test, I can now confirm that transfer transactions are also affected: not only can read-only transactions see intermediate results from transfers, but transfers can write back intermediate results successfully, causing the permanent creation or destruction of money in the system.

Here are two test cases; in the first, the total rises from $20 to $22, and in the second, it falls to $15.

I believe this kind of data corruption is a more serious issue than inconsistent pure-read transactions--it may be a good idea to revise your documentation to reflect the possibility of data corruption for users relying on Snapshot Isolation's invariants.

@sjaakola

This comment has been minimized.

Show comment
Hide comment
@sjaakola

sjaakola Sep 7, 2015

Contributor

By trading the "first committer wins rule" for performance, Galera does not any more provide SI. We need to change the documentation accordingly.
As said, it is possible to bring back SI, as session property, we will consider it for future road map.

Contributor

sjaakola commented Sep 7, 2015

By trading the "first committer wins rule" for performance, Galera does not any more provide SI. We need to change the documentation accordingly.
As said, it is possible to bring back SI, as session property, we will consider it for future road map.

@temeo

This comment has been minimized.

Show comment
Hide comment
@temeo

temeo Sep 9, 2015

Contributor

@aphyr
I did some experimenting around Jepsen Galera bank test (with latest Galera Cluster release binaries from http://releases.galeracluster.com/) and found out the reason the test fails might be in SERIALIZABLE transaction isolation mode, or rather in LOCK IN SHARE MODE. MySQL documentation https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html states about SERIALIZABLE:

This level is like REPEATABLE READ, but InnoDB implicitly converts all plain SELECT statements to
SELECT ... LOCK IN SHARE MODE if autocommitis disabled.

so running test in SERIALIZABLE mode is essentially the same as running the test in REPEATABLE READ mode with SELECT statements modified to LOCK IN SHARE MODE.

After some experimenting I found out that setting transaction isolation mode to REPEATABLE READ made bank test pass, but making any of the SELECT statements to LOCK IN SHARE MODE explicitly made the test fail.

Also, setting transaction isolation mode to SERIALIZABLE and doing SELECTs in FOR UPDATE mode (see https://dev.mysql.com/doc/refman/5.6/en/innodb-locking-reads.html for explanation) made the test pass.

So it seems that something in galera/mysql-wsrep implementation (probably in prioritized transaction processing) does not honor shared locks properly or the effect of brute force abort caused by prioritized transaction is not propagated properly for locally executing (so far read only) transaction.

Concurrency level in tests was 50 concurrent clients, so I'm quite confident that the success result is valid.

Edit: My changes to jepsen can be found from: https://github.com/codership/jepsen/tree/galeracluster

Contributor

temeo commented Sep 9, 2015

@aphyr
I did some experimenting around Jepsen Galera bank test (with latest Galera Cluster release binaries from http://releases.galeracluster.com/) and found out the reason the test fails might be in SERIALIZABLE transaction isolation mode, or rather in LOCK IN SHARE MODE. MySQL documentation https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html states about SERIALIZABLE:

This level is like REPEATABLE READ, but InnoDB implicitly converts all plain SELECT statements to
SELECT ... LOCK IN SHARE MODE if autocommitis disabled.

so running test in SERIALIZABLE mode is essentially the same as running the test in REPEATABLE READ mode with SELECT statements modified to LOCK IN SHARE MODE.

After some experimenting I found out that setting transaction isolation mode to REPEATABLE READ made bank test pass, but making any of the SELECT statements to LOCK IN SHARE MODE explicitly made the test fail.

Also, setting transaction isolation mode to SERIALIZABLE and doing SELECTs in FOR UPDATE mode (see https://dev.mysql.com/doc/refman/5.6/en/innodb-locking-reads.html for explanation) made the test pass.

So it seems that something in galera/mysql-wsrep implementation (probably in prioritized transaction processing) does not honor shared locks properly or the effect of brute force abort caused by prioritized transaction is not propagated properly for locally executing (so far read only) transaction.

Concurrency level in tests was 50 concurrent clients, so I'm quite confident that the success result is valid.

Edit: My changes to jepsen can be found from: https://github.com/codership/jepsen/tree/galeracluster

@temeo

This comment has been minimized.

Show comment
Hide comment
@temeo

temeo Sep 9, 2015

Contributor

The following simple MTR test case demonstrates that read-only transaction with SERIALIZABLE isolation level will normally abort if conflicting high priority (certified) transaction commits before it.

--source include/galera_cluster.inc
--source include/have_innodb.inc

#
# Test a local read-only transaction with SERIALIZABLE transaction isolation
# mode in precense of BF abort
#

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY, f2 INTEGER) ENGINE=InnoDB;

--connection node_2
--let $wsrep_local_bf_aborts_before = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`
INSERT INTO t1 VALUES (1, 1);
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET AUTOCOMMIT=OFF;
START TRANSACTION;
SELECT * FROM t1 WHERE f1 = 1;

--connection node_1
UPDATE t1 SET f2 = 2 WHERE f1 = 1;

--connection node_2
--error ER_LOCK_DEADLOCK
SELECT * FROM t1 where f1 = 1;


--let $wsrep_local_bf_aborts_after = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`

--disable_query_log
--eval SELECT $wsrep_local_bf_aborts_after - $wsrep_local_bf_aborts_before = 1 AS wsrep_local_aborts_increment;
--enable_query_log

DROP TABLE t1;
Contributor

temeo commented Sep 9, 2015

The following simple MTR test case demonstrates that read-only transaction with SERIALIZABLE isolation level will normally abort if conflicting high priority (certified) transaction commits before it.

--source include/galera_cluster.inc
--source include/have_innodb.inc

#
# Test a local read-only transaction with SERIALIZABLE transaction isolation
# mode in precense of BF abort
#

CREATE TABLE t1 (f1 INTEGER PRIMARY KEY, f2 INTEGER) ENGINE=InnoDB;

--connection node_2
--let $wsrep_local_bf_aborts_before = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`
INSERT INTO t1 VALUES (1, 1);
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET AUTOCOMMIT=OFF;
START TRANSACTION;
SELECT * FROM t1 WHERE f1 = 1;

--connection node_1
UPDATE t1 SET f2 = 2 WHERE f1 = 1;

--connection node_2
--error ER_LOCK_DEADLOCK
SELECT * FROM t1 where f1 = 1;


--let $wsrep_local_bf_aborts_after = `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_local_bf_aborts'`

--disable_query_log
--eval SELECT $wsrep_local_bf_aborts_after - $wsrep_local_bf_aborts_before = 1 AS wsrep_local_aborts_increment;
--enable_query_log

DROP TABLE t1;
@Marqin

This comment has been minimized.

Show comment
Hide comment
@Marqin

Marqin Jan 25, 2016

How's the progress on this issue?

Marqin commented Jan 25, 2016

How's the progress on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment