Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change locking system into using RocksDB's pessimistic transactions #46

Closed
hermanlee opened this issue Sep 28, 2015 · 12 comments
Closed
Assignees
Labels

Comments

@hermanlee
Copy link
Contributor

Issue by spetrunia
Wednesday Jun 17, 2015 at 21:02 GMT
Originally opened as MySQLOnRocksDB#86


RocksDB's pessimistic transaction system handles locking and also takes care of storing not-yet-committed changes made by the transaction. That is, it has two counterparts in MyRocks:

  1. Row locking module (rdb_locks.h: class LockTable)
  2. Table of transaction's changes (rdb_rowmods.h: class Row_table).

If we just replace #.1, there will be data duplication (Row_table will have the same data as WriteBatchWithIndex).

Using the API to get SQL semantics

At start, we call
transaction->SetSnapshot()

this gives us:

+  // If SetSnapshot() is used, then any attempt to read/write a key in this
+  // transaction will fail if either another transaction has this key locked
+  // OR if this key has been written by someone else since the most recent
+  // time SetSnapshot() was called on this transaction.

then, reading, modifying and writing a key can be done with simple

rocksdb_trx->Get() 
... modify the record as needed
rocksdb_trx->Put()

Misc notes

  • Modifying the same row multiple times will work, it seems.

Open issues

  • Transaction API doesnt allow to create iterators. How does one create an Iterator that looks at a snapshot that agrees with the transaction? Pass wtite_options.snapshot into Transaction::BeginTransaction() ?
  • Statement rollback. If a statement within a transaction fails, it is rolled back. All its changes are undone, and its locks should be released. Transaction's changes/locks should remain. There seems to be no way to achieve this in the API?
  • SELECT ... LOCK IN SHARE MODE. There seems to be no way to achieve shared read locks in the API.
@hermanlee
Copy link
Contributor Author

Comment by agiardullo
Thursday Jun 18, 2015 at 03:39 GMT


  1. Yes, I still need to add iterator support. A hacky workaround would be to temporary re-read keys that you iterate through until iterator support is present. (Fyi, we do not have plans to add gap-locking support right now.)

  2. From discussions with Yoshinori, I was under the impression that we did not need to support partial rollback of a transaction. But this is do-able if needed. How critical is this support in the short-term?

  3. Correct. We discussed not supporting shared locks in v1 and will likely add support for this in the future. The initial plan was to have lock in shared mode simply take an exclusive lock in the meantime.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Tuesday Jun 23, 2015 at 17:38 GMT


  1. From discussions with Yoshinori, I was under the impression that we did not need to support partial rollback of a transaction. But this is do-able if needed. How critical is this support in the short-term?

In order to have SQL semantics, one has to be able to rollback a failed statement. The statement may be a part of a transaction, in that case the statement must be rolled back, but the transaction must remain open (i.e. neither commit nor rollback).

If we only use Pessimistic Transaction API for locking, we can achieve correct behavior by simply not releasing failed statement's locks util the transaction has either committed or rolled back. This will limit concurrency but is a correct behavior.

If we use Pessimistic Transaction API to also hold transaction's changes, then we need to be able roll back: If a statement inside a transaction changes a few rows and then fails, these changes must not be visible by the rest of the transaction.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Tuesday Jul 14, 2015 at 10:45 GMT


I've read the new Pessimistic Transactions patch (posted at https://reviews.facebook.net/D40869 ). It looks like the patch provides everything that MyRocks needs.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Friday Jul 24, 2015 at 11:02 GMT


Getting close to getting something to work, found one thing that I forgot about.

In the current system, row lock waits are integrated into MySQL. SHOW PROCESSIST shows waits as:

+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+
| Id | User | Host      | db   | Command | Time | State                | Info                                | Rows examined | Rows sent | Tid  |
+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+
|  1 | root | localhost | j8   | Sleep   |   87 |                      | NULL                                |             0 |         0 | 9827 |
|  3 | root | localhost | NULL | Query   |    0 | init                 | show processlist                    |             0 |         0 | 9907 |
|  4 | root | localhost | j8   | Query   |    9 | Waiting for row lock | insert into t1 values (1,'one-one') |             0 |         0 | 9830 |
+----+------+-----------+------+---------+------+----------------------+-------------------------------------+---------------+-----------+------+

and one can use KILL QUERY to immediately terminate the query that is waiting for the lock.

With my new code that uses pessimistic RocksDB trx API, we have state=Updating and the thread is not KILLable:

+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+
| Id | User | Host      | db   | Command | Time | State       | Info                                    | Rows examined | Rows sent | Tid  |
+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+
| 18 | root | localhost | j5   | Query   |    0 | cleaning up | show processlist                        |             0 |         0 | 5583 |
| 19 | root | localhost | j5   | Sleep   |   19 |             | NULL                                    |             1 |         0 | 5584 |
| 20 | root | localhost | j5   | Query   |   15 | updating    | update test set value = 12 where id = 1 |             0 |         0 | 7194 |
+----+------+-----------+------+---------+------+-------------+-----------------------------------------+---------------+-----------+------+

In order to achieve state="Waiting for row lock" and KILLability, MyRocks does the following in storage/rocksdb/rdb_locks.cc:

    thd= current_thd;
    thd_enter_cond(thd, &found_lock->cond, &found_lock->mutex,
                   &stage_waiting_on_row_lock, &old_stage);
    ...
      res= mysql_cond_timedwait(&found_lock->cond, &found_lock->mutex, 
                                &wait_timeout);
    // check if there was SQL  KILL command:
     killed= thd_killed(thd);

      thd_exit_cond(current_thd, &old_stage);

That is, we use MySQL's wrappers over pthread_cond_t and pthread_mutex_t, mysql's wait function, and also we inform MySQL that we start/finish waiting.

Possible ways out:

  • Make "Pessimistic Transaction API" support all of the above (i.e. user-provided wait condition, mutex, and wait function)
  • Forget about all this for now.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Friday Jul 24, 2015 at 18:13 GMT


Consider two cases:

Case 1.

trx1> txn1= txn_db->BeginTransaction(...);
trx2> txn2= txn_db->BeginTransaction(...); 

trx1> txn1->Put($key, $value1);  // This gets a lock

trx2> txn2->Put($key, $value2); // This waits for the lock

The last call times out and returns:

(gdb) p s
  $52 = {code_ = rocksdb::Status::kBusy, state_ = 0x0}

I have made MyRocks to return ER_LOCK_WAIT_TIMEOUT to SQL layer in this case.

Case 2:

trx2> txn2= txn_db->BeginTransaction(...); 

trx1> txn1= txn_db->BeginTransaction(...);
trx1> txn1->Put($key, $value1);
trx1> txn1->Commit();

trx2> txn2->Put($key, $value2);

The returned value is the same as in Case 1:

(gdb) print s
  $54 = {code_ = rocksdb::Status::kBusy, state_ = 0x0}

But the situation is different. In the Case 1, it makes sense to wait more. In Case 2, increasing wait timeout or retrying won't help.

The return value is the same, though, so MyRocks returns ER_LOCK_TIMEOUT in Case 2, too. For the SQL user, the error message is misleading.

I am not sure how big of a problem this is.

@hermanlee
Copy link
Contributor Author

Comment by agiardullo
Friday Jul 24, 2015 at 22:43 GMT


This is good feedback. About distinguishing error cases, I can change the api to return a different status in each case. I believe there are 4 interesting cases:

Case 1 above: Transaction timed out waiting to acquire a lock. In this case, it is possible that having a longer timeout could have succeeded (but we dont know since we never got the lock).

Case 2 above: Detected a write conflict.

Case 3: Transaction has an expiration time set and is expired.

Case 4: We don't have enough memtable history to determine whether there are any conflicts (User could then choose to tune max_write_buffer_number_to_maintain).

Should we have a different Status for each of these 4 cases? How about:
case1: Status::LockWaitTimeout (new status code)
case2: Status::Busy
case3: Status::TimedOut (there is already precedent in rockdsb for using TimedOut to mean 'expired').
case4: Status::TryAgain (new status code)

@hermanlee
Copy link
Contributor Author

Comment by agiardullo
Friday Jul 24, 2015 at 22:49 GMT


Re MySql locking: I think we could come up with an api that lets you override what mutex/condvar is used to do locking. But I will need to look into mysql a bit and think about this some more.

@hermanlee
Copy link
Contributor Author

Comment by agiardullo
Thursday Aug 06, 2015 at 03:10 GMT


Sergei, how about we chat about myrocks locking api? I just sent you a fb msg.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Tuesday Sep 08, 2015 at 10:50 GMT


Got an interesting problem while rebasing the patch over the current tree.

rocksdb.rocksdb test started to fail with an error like this:

MySQL [test]> create table t1 (
    ->   id int not null,
    ->   key1 int,
    ->   PRIMARY KEY (id),
    ->   index (key1) comment '$per_index_cf'
    -> ) engine=rocksdb;
Query OK, 0 rows affected (0.00 sec)

MySQL [test]> insert into t1 values (1,1);
Query OK, 1 row affected (0.00 sec)

MySQL [test]> select * from t1;
+----+------+
| id | key1 |
+----+------+
|  1 |    1 |
+----+------+
1 row in set (0.00 sec)

MySQL [test]> alter table t1 add col2 int;
ERROR 1815 (HY000): Internal error: Operation failed. Try again.: Transaction
could not check for conflicts for opearation at SequenceNumber 9 as the
MemTable only contains changes newer than SequenceNumber 10.  Increasing the
value of the max_write_buffer_number_to_maintain option could reduce the
frequency of this er

while the expected error was:

ERROR 42000: This version of MySQL doesn't yet support 'ALTER TABLE on table with per-index CF'

The new error is caused by this scenario:

  • ALTER TABLE starts execution
  • ha_rocksdb::external_lock() gets a snapshot with current sequence number (it's 9)
  • The code in Sequence_generator::get_and_update_next_number
    makes a write which increases current sequence number to 10 (NEW-WRITE)
  • A new column family is created for $per_index_cf. As part of that, a MemTable
    is created. The MemTable holds changes that start from sequence_no=10.
  • We attempt to write a row into the new column family. The tranaction's
    snapshot has SequenceNumber=9, while MemTable has data starting from
    SequenceNumber=10.(SEQ-CMP) We get an error.

The problem is not observable on the current repository, because old-style
locking doesn't care about Sequence Numbers.

The problem was not observable before the rebase, because the write (NEW-WRITE) didn't
happen, and comparison (SEQ-CMP) compared two equal sequence numbers.

@hermanlee
Copy link
Contributor Author

Comment by spetrunia
Tuesday Sep 08, 2015 at 12:17 GMT


  • Tried to disallow creation of "temporary" tables (temporary here means the name is#sql_nnnn) with $per_index_cf column families. There is no clear way to do this, it seems.
  • The problem only affects ALTER TABLE statements. CREATE TABLE ... SELECT is not affected, because it creates the destination table before it starts the transaction.

@hermanlee
Copy link
Contributor Author

Comment by yoshinorim
Tuesday Sep 08, 2015 at 17:28 GMT


I recently committed https://reviews.facebook.net/D45963 that increments index_id and commits into data dictionary at Sequence_generator::get_and_update_next_number(). The internal begin->commit happens per index creation. I discussed with Herman in the diff --

I think there are two ways -- one is using Transaction API and doing begin -> 
select for update -> update -> commit. the other is what Herman suggested 
-- begin->update->commit at get_next_number(). I think the latter is easier and fine. 
Performance is not a concern since get_next_number() is called at 
index creation (DDL) only.

Maybe it's better to switch to select for update -> update at Sequence_generator::get_and_update_next_number(), within a transaction created at ha_rocksdb::create_key_defs(), then commit altogether?

@hermanlee
Copy link
Contributor Author

Comment by hermanlee
Tuesday Sep 08, 2015 at 17:51 GMT


The select for update on the sequence_number would block other create table requests until the transaction is committed? If we're performing a restore of multiple databases where tables are created in parallel, could one of the table creates fail due to timing out on the select for update?

spetrunia added a commit that referenced this issue Jan 5, 2016
…verse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered

Reviewers: maykov, hermanlee4, jtolmer, yoshinorim

Differential Revision: https://reviews.facebook.net/D35331
hermanlee pushed a commit that referenced this issue Jan 31, 2017
…verse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered

Reviewers: maykov, hermanlee4, jtolmer, yoshinorim

Differential Revision: https://reviews.facebook.net/D35331
VitaliyLi pushed a commit to VitaliyLi/mysql-5.6 that referenced this issue Feb 9, 2017
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Test Plan: mtr t/rocksdb_range.test, used gcov to check the new code is covered

Reviewers: maykov, hermanlee4, jtolmer, yoshinorim

Differential Revision: https://reviews.facebook.net/D35331
facebook-github-bot pushed a commit that referenced this issue Dec 23, 2019
…verse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 12, 2020
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 9, 2020
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 16, 2020
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Oct 5, 2020
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Nov 11, 2020
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 891464f
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Mar 11, 2021
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 336a08f04dc
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 16, 2021
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 336a08f04dc
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 30, 2021
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 336a08f04dc
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 1, 2021
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 336a08f04dc
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 2, 2021
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331

fbshipit-source-id: 336a08f04dc
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jan 17, 2022
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Apr 26, 2022
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331
laurynas-biveinis pushed a commit to laurynas-biveinis/mysql-5.6 that referenced this issue Aug 11, 2022
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 1, 2023
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 14, 2023
…rk in reverse CF

Summary:
Make ha_rocksdb::index_read_map() correctly handle
find_flag=HA_READ_BEFORE_KEY.
Explanation how it should be handled is provided in
storage/rocksdb/rocksdb-range-access.txt

Differential Revision: https://reviews.facebook.net/D35331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants