Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range Locking - rebased on latest fb-mysql-8.0.20 #1165

Open
wants to merge 29 commits into
base: fb-mysql-8.0.20
Choose a base branch
from

Conversation

spetrunia
Copy link
Contributor

@spetrunia spetrunia commented May 17, 2021

This includes:

  1. Iterator patches, re-based on top of the latest fb-mysql-8.0.20 tree
  2. Range Locking patch

Copy link
Contributor

@hermanlee hermanlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few initial comments. I will continue to read through this.

storage/rocksdb/rdb_locking_iter.h Outdated Show resolved Hide resolved
storage/rocksdb/rdb_locking_iter.h Outdated Show resolved Hide resolved
storage/rocksdb/rdb_locking_iter.h Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Show resolved Hide resolved
@spetrunia spetrunia force-pushed the fb-mysql-8.0.20-range-locking-may branch from c89e6b3 to f8effb0 Compare July 26, 2021 16:00
@spetrunia spetrunia changed the title Range Locking - rebased on latest fb-mysql-8.0.20. Includes iterator patches Range Locking - rebased on latest fb-mysql-8.0.20 Jul 26, 2021
Copy link
Contributor

@hermanlee hermanlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a different behavior from InnoDB where the gap from the last key to supremum is not being acquired.

On InnoDB:

TXN 1:
create table t1 (pk int primary key) engine=innodb;
insert into t1 values (1), (10), (100);
begin;
select * from t1 for update;

TXN 2:
begin;
select * from t1;
insert into t1 values (150); # times out on lock

On MyRocks:
insert into t1 values (150); # succeeds

Copy link
Contributor

@hermanlee hermanlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another difference with InnoDB:

TXN 1:
create table t1 (pk int primary key) engine=rocksdb;
insert into t1 values (1), (10), (100);
begin;
select * from t1;

TXN 2:
begin;
select * from t1;
insert into t1 values (90);
commit;

TXN 1:
select * from t1 lock in share mode; /* InnoDB returns 4 rows, but MyRocks only returns 3 rows: 90 is missing. */

Using select * from t1 for update, MyRocks correctly returns 4 rows instead of 3.

storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/ha_rocksdb.cc Outdated Show resolved Hide resolved
storage/rocksdb/rdb_locking_iter.h Show resolved Hide resolved
storage/rocksdb/rdb_i_s.cc Show resolved Hide resolved
storage/rocksdb/rdb_locking_iter.h Outdated Show resolved Hide resolved
(Cherry-picked on top of the pushed iterator patches)

When it is ON, MyRocks will:
- initialize RocksDB to use range-locking lock manager
- for all DML operations (including SELECT .. FOR UPDATE) will lock
the scanned range before reading/modifying rows.
- In range locking mode, there is no snapshot checking (cannot do that
  for ranges). Instead, MyRocks will read and modify latest committed
  data, just like InnoDB does (in the code, grep for (start|end)
  _ignore_snapshot)
- Queries that do not have a finite range to scan, like
  UPDATE t1 .... ORDER BY t1.key LIMIT n
  will use a  "Locking iterator" which will read rows, lock the range,
  and re-read the rows. See class LockingIterator.
LockingIterator may attempt to lock the same range multiple times
if another transaction is inserting records into the range it is scanning
through. Make the code avoid this.
Do not decrement it in unlock_row, as there may be multiple unlock_row
calls per range.
Range Locking code locks the gap after the last value now, which
means for 2-row table "SELECT * FROM t FOR UPDATE" will place 3 locks.
Initializing it in ha_rocksdb::external_lock was not enough. MyRocks uses
index read calls before external_lock() is called:

myrocks::Rdb_iterator_base::seek()
myrocks::ha_rocksdb::index_read_intern()
myrocks::ha_rocksdb::index_read_intern()
myrocks::ha_rocksdb::index_last()
myrocks::ha_rocksdb::load_auto_incr_value_from_index()
myrocks::ha_rocksdb::load_auto_incr_value()
myrocks::ha_rocksdb::open()
handler::ha_open()
This is a followup to:

  Address review input: lock the "gap" after the last row that was read

Made the code work when iterator bounds are not available, in which
case it locks till index start/end.
@spetrunia spetrunia force-pushed the fb-mysql-8.0.20-range-locking-may branch from 04b29be to 8c6c1b8 Compare August 23, 2021 19:04
Copy link
Contributor

@hermanlee hermanlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following doesn't look right:

/* TXN 1 */
mysql> create table t1 (pk int not null auto_increment, a int, b int, primary key (pk), unique key (a)) engine=rocksdb;
Query OK, 0 rows affected (0.05 sec)
                                         
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
                                                                                   
mysql> insert into t1 (a, b) values (1, 1), (2, 2);                               
Query OK, 2 rows affected (0.01 sec)                                               
Records: 2  Duplicates: 0  Warnings: 0                                             
                                                                                   
mysql> select * from information_schema.rocksdb_locks;                            
+------------------+----------------+--------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                | MODE |
+------------------+----------------+--------------------+------+
|                0 |             17 | 000001080180000002 | X    |
|                0 |             17 | 000001080180000001 | X    |
|                0 |             17 | 0000010780000002   | X    |
|                0 |             17 | 0000010780000001   | X    |
+------------------+----------------+--------------------+------+
4 rows in set (0.01 sec)

/* TXN 2 */
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 (a, b) values (10, 10), (20, 20);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction: Timeout on index: test.t1.PRIMARY

/* TXN 1 */
mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+-------------------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                                 | MODE |
+------------------+----------------+-------------------------------------+------+
|                0 |             17 | 0000010780000001-000001080180000002 | X    |
+------------------+----------------+-------------------------------------+------+
1 row in set (0.01 sec)

The range lock seems to span multiple indexes.

Another example:

mysql> create table t6 (pk int not null auto_increment, a int, b int, c int, primary key (pk), unique key (a), key (b), unique key (c)) engine=rocksdb;
Query OK, 0 rows affected (0.03 sec)
                                          
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t6 (a, b, c) values (1, 1, 1), (2, 2, 2);                                                                                                             
Query OK, 2 rows affected (0.00 sec)                                                                                                                                     
Records: 2  Duplicates: 0  Warnings: 0    
                                          
mysql> select * from information_schema.rocksdb_locks;                                                                                                                   
+------------------+----------------+---------------------------------------------+------+                                                                               
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                                         | MODE |                                                                               
+------------------+----------------+---------------------------------------------+------+                                                                               
|                0 |             21 | 00000113018000000280000002                  | X    |                                                                               
|                0 |             21 | 0000011080000001-00000113018000000180000001 | X    |                                                                               
+------------------+----------------+---------------------------------------------+------+                                                                               
2 rows in set (0.00 sec)                  
                                                                                                                                                                         
mysql> select index_name, hex(index_number) from information_schema.rocksdb_ddl where table_name = 't6';                                                                 
+------------+-------------------+
| index_name | hex(index_number) |  
+------------+-------------------+        
| PRIMARY    | 110               |
| a          | 111               |  
| c          | 112               |        
| b          | 113               |                                                                                                                                       
+------------+-------------------+  
4 rows in set (0.00 sec)                  

Use InnoDB-like concurrency mode: make the DML statements always read
the latest data (instead of using transaction's snapshot).
This "downgrades" the transaction isolation to READ-COMMITTED on the
primary, but in return the actions can be replayed on the replica.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first half of the sentence makes sense, but the second half: "but in return the actions can be replayed on the replica" seems to involve replication and it's not clear how replication to secondaries is affected by this concurrency mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it doesn't sound clear.

IIRC I had this in mind when writing this:

With statement-based replication and point locking, one can get the slave
out of sync. Using gap(or range) locking allows to avoid that. But when
one uses Gap (or range) locking, the locking reads must read the latest
committed data.

I guess this should either be stated in full, or not at all...

storage/rocksdb/rdb_locking_iter.cc Outdated Show resolved Hide resolved
@spetrunia
Copy link
Contributor Author

#1165 (review)

The following doesn't look right:
...

I can't reproduce this behavior... This is what I'm getting:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 (a, b) values (1, 1), (2, 2);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+--------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                | MODE |
+------------------+----------------+--------------------+------+
|                0 |              2 | 0000016e0180000002 | X    |
|                0 |              2 | 0000016d80000002   | X    |
|                0 |              2 | 0000016e0180000001 | X    |
|                0 |              2 | 0000016d80000001   | X    |
+------------------+----------------+--------------------+------+
4 rows in set (0.00 sec)

so far the same. But TXN 2 will succeed to get the lock:

/* TXN 2 */

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 (a, b) values (10, 10), (20, 20);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+--------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                | MODE |
+------------------+----------------+--------------------+------+
|                0 |              3 | 0000016e0180000014 | X    |
|                0 |              3 | 0000016e018000000a | X    |
|                0 |              2 | 0000016e0180000002 | X    |
|                0 |              2 | 0000016e0180000001 | X    |
|                0 |              3 | 0000016d80000004   | X    |
|                0 |              3 | 0000016d80000003   | X    |
|                0 |              2 | 0000016d80000002   | X    |
|                0 |              2 | 0000016d80000001   | X    |
+------------------+----------------+--------------------+------+

..which looks correct to me

@spetrunia
Copy link
Contributor Author

Another example:

mysql> create table t6 (pk int not null auto_increment, a int, b int, c int, primary key (pk), unique key (a), key (b), unique key (c)) engine=rocksdb;
Query OK, 0 rows affected (0.03 sec)
                                          
mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t6 (a, b, c) values (1, 1, 1), (2, 2, 2);                                                                                                             
Query OK, 2 rows affected (0.00 sec)                                                                                                                                     
Records: 2  Duplicates: 0  Warnings: 0    
                                          
mysql> select * from information_schema.rocksdb_locks;                                                                                                                   
+------------------+----------------+---------------------------------------------+------+                                                                               
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                                         | MODE |                                                                               
+------------------+----------------+---------------------------------------------+------+                                                                               
|                0 |             21 | 00000113018000000280000002                  | X    |                                                                               
|                0 |             21 | 0000011080000001-00000113018000000180000001 | X    |                                                                               
+------------------+----------------+---------------------------------------------+------+                                                                               
2 rows in set (0.00 sec)                  

For me, I get this:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t6 (a, b, c) values (1, 1, 1), (2, 2, 2);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> 
mysql> select * from information_schema.rocksdb_locks;
+------------------+----------------+----------------------------+------+
| COLUMN_FAMILY_ID | TRANSACTION_ID | KEY                        | MODE |
+------------------+----------------+----------------------------+------+
|                0 |              1 | 00000172018000000280000002 | X    |
|                0 |              1 | 000001710180000002         | X    |
|                0 |              1 | 000001700180000002         | X    |
|                0 |              1 | 0000016f80000002           | X    |
|                0 |              1 | 00000172018000000180000001 | X    |
|                0 |              1 | 000001710180000001         | X    |
|                0 |              1 | 000001700180000001         | X    |
|                0 |              1 | 0000016f80000001           | X    |
+------------------+----------------+----------------------------+------+
8 rows in set (0.00 sec)

mysql>  select index_name, hex(index_number) from information_schema.rocksdb_ddl where table_name = 't6';
+------------+-------------------+
| index_name | hex(index_number) |
+------------+-------------------+
| PRIMARY    | 16F               |
| a          | 170               |
| c          | 171               |
| b          | 172               |
+------------+-------------------+
4 rows in set (0.01 sec)

which looks ok at first glance

@spetrunia
Copy link
Contributor Author

Double checking if I am using the right tree... it is https://github.com/spetrunia/mysql-5.6/commits/fb-mysql-8.0.20-range-locking-may

commit 9f29d708d47db4aaf4e015697dd088d2ee408b44 (HEAD -> fb-mysql-8.0.20-range-locking-may, origin/fb-mysql-8.0.20-range-locking-may)
Author: Sergei Petrunia <psergey@askmonty.org>
Date:   Mon Sep 13 20:23:12 2021 +0300

    Range Locking: Make partial index support LockingIterator.

@hermanlee
Copy link
Contributor

I ran my tests with the range locking change based on top of: b22013d.

In your lock outputs, none of the locks look like they were escalated and merged together. What triggers that process?

@spetrunia
Copy link
Contributor Author

lock escalation is triggered by lock tree memory exceeding the limit. The current lock memory usage should be shown in rocksdb_current_lock_memory, and rocksdb_max_lock_memory is where one sets the limit.

@hermanlee
Copy link
Contributor

Ah, my setup also inherited these settings from an existing test:

-rocksdb_use_range_locking=1 --rocksdb_max_lock_memory=1024

- Truncate the table before starting the test.
- Added n_insert_failed counter
If an attempt to insert a row or create a group fails (because we've
found unsuitable data), do not count it.
It starts by doing a locking read of 5 rows in a random location.

If it encounters a grouped row that's not a group leader, it should
follow till the end of this group and then delete+create the next group.
Also: We are now using skewed distribution, in order for INSERTs
to not become stuck, make them do deletes when they are not able
to insert.
@laurynas-biveinis
Copy link
Contributor

Should this one be closed because of #1185 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants