Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MyRocks executes index_merge query plan incorrectly. #624

Closed
hermanlee opened this issue May 17, 2017 · 7 comments
Closed

MyRocks executes index_merge query plan incorrectly. #624

hermanlee opened this issue May 17, 2017 · 7 comments

Comments

@hermanlee
Copy link
Contributor

Test case:

create table t1 (key1 int, key2 int, key3 int, key (key1), key (key2), key(key3)) engine=rocksdb;
insert into t1 values (1, 100, 100), (1, 200, 200), (1, 300, 300);
set global rocksdb_force_flush_memtable_now = 1;
explain select * from t1 where key1 = 1;
explain select * from t1 where key1 = 1 or key2 = 1;
select * from t1 where key1 = 1;
select * from t1 where key1 = 1 or key2 = 1;

Output: Note the two selects should return the same rows.

mysql> create table t1 (key1 int, key2 int, key3 int, key (key1), key (key2), key(key3)) engine=rocksdb;
Query OK, 0 rows affected (1.07 sec)

mysql> insert into t1 values (1, 100, 100), (1, 200, 200), (1, 300, 300);
Query OK, 3 rows affected (0.07 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> set global rocksdb_force_flush_memtable_now = 1;
Query OK, 0 rows affected (0.42 sec)

mysql> explain select * from t1 where key1 = 1;
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
|  1 | SIMPLE      | t1    | ref  | key1          | key1 | 5       | const |    1 | NULL  |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------+
1 row in set (0.00 sec)

mysql> explain select * from t1 where key1 = 1 or key2 = 1;
+----+-------------+-------+-------------+---------------+-----------+---------+------+------+-------------------------------------+
| id | select_type | table | type        | possible_keys | key       | key_len | ref  | rows | Extra                               |
+----+-------------+-------+-------------+---------------+-----------+---------+------+------+-------------------------------------+
|  1 | SIMPLE      | t1    | index_merge | key1,key2     | key1,key2 | 5,5     | NULL |    2 | Using union(key1,key2); Using where |
+----+-------------+-------+-------------+---------------+-----------+---------+------+------+-------------------------------------+
1 row in set (0.01 sec)

mysql> select * from t1 where key1 = 1;
+------+------+------+
| key1 | key2 | key3 |
+------+------+------+
|    1 |  100 |  100 |
|    1 |  200 |  200 |
|    1 |  300 |  300 |
+------+------+------+
3 rows in set (0.00 sec)

mysql> select * from t1 where key1 = 1 or key2 = 1;
+------+------+------+
| key1 | key2 | key3 |
+------+------+------+
|    1 |  100 |  100 |
+------+------+------+
1 row in set (0.01 sec)
@yoshinorim
Copy link
Contributor

yoshinorim commented May 18, 2017

There is a bug in void ha_rocksdb::setup_read_decoders().

    if (m_lock_rows == RDB_LOCK_WRITE || m_verify_row_debug_checksums ||
        bitmap_is_set(table->read_set, table->field[i]->field_index)) {

bitmap_is_set() returned false with index merge, which made the decoder not added and could not decode keys properly. In the above case, "set session rocksdb_verify_row_debug_checksums=1;" makes the index merge return correct rows..

I confirmed table->read_set bitmap (table->read_set->bitmap[0]) was 7 with index lookup, and 0 with index merge. With index merge, the bitmap was set to 0 in QUICK_RANGE_SELECT::init_ror_merged_scan() (at head->column_bitmaps_set(&column_bitmap, &column_bitmap()).

@yoshinorim
Copy link
Contributor

I confirmed the following change made the index merge return correct rows.

$ git diff
diff --git a/storage/rocksdb/ha_rocksdb.cc b/storage/rocksdb/ha_rocksdb.cc
index 6686f69..a7d817f 100644
--- a/storage/rocksdb/ha_rocksdb.cc
+++ b/storage/rocksdb/ha_rocksdb.cc
@@ -4427,7 +4428,9 @@ void ha_rocksdb::setup_read_decoders() {
       continue;
     }

+    // bitmap is cleared on index merge but it needs to decode columns
     if (m_lock_rows == RDB_LOCK_WRITE || m_verify_row_debug_checksums ||
+        bitmap_is_clear_all(table->read_set) ||
         bitmap_is_set(table->read_set, table->field[i]->field_index)) {
       // We will need to decode this field
       m_decoders_vect.push_back({&m_encoder_arr[i], true, skip_size});

@spetrunia
Copy link
Contributor

More Analysis:

QUICK_RANGE_SELECT::init_ror_merged_scan first calls ha_index_init:

  #0  myrocks::ha_rocksdb::setup_read_decoders
  #1  0x000000000149f18e in myrocks::ha_rocksdb::index_init
  #2  0x0000000000c2da60 in handler::ha_index_init
  #3  0x0000000000fdffe2 in QUICK_RANGE_SELECT::reset
  #4  0x0000000000fca745 in QUICK_RANGE_SELECT::init_ror_merged_scan
  #5  0x0000000000fcb1ea in QUICK_ROR_UNION_SELECT::reset

When the above call is made, QUICK_RANGE_SELECT::column_bitmap has all bits cleared.

The code in init_ror_merged_scan() slightly below sets column_bitmap to only include columns that are part of the used index:

  if (!head->no_keyread)
    head->mark_columns_used_by_index(index);
  head->prepare_for_position();
  head->file= org_file;
  bitmap_copy(&column_bitmap, head->read_set);

(it is also interesting is that it is set to include all columns in the index, even those that are not read by the query)

@spetrunia
Copy link
Contributor

@yoshinorim, as far as I can see your fix will not break anything. Perhaps, it will cause extra columns to be read for queries SELECT COUNT(*) FROM table (without the WHERE), but it's a very degenerate case.

The fix doesn't look nice from an architecture POV, though - there is no reason why SQL layer would call index_init() with table->read_set having all bits cleared (InnoDB apparently doesn't care, but it's still wrong). Making the storage engine reflect the quirks of the SQL layer doesn't look nice

@yoshinorim
Copy link
Contributor

Thanks for review @spetrunia . I'm going to commit the diff to fix the S1, and will consider long term fix and mitigating other perf issues.

@spetrunia
Copy link
Contributor

spetrunia commented May 18, 2017

A possible alternative fix

index d6acf12..e009c02 100644
--- a/sql/opt_range.cc
+++ b/sql/opt_range.cc
@@ -1714,6 +1714,7 @@ end:
   if (!head->no_keyread)
     head->mark_columns_used_by_index(index);
   head->prepare_for_position();
+  head->file->column_bitmaps_signal(); // psergey-new
   head->file= org_file;
   bitmap_copy(&column_bitmap, head->read_set);
 
diff --git a/storage/rocksdb/ha_rocksdb.h b/storage/rocksdb/ha_rocksdb.h
index 8744243..a0fbc24 100644
--- a/storage/rocksdb/ha_rocksdb.h
+++ b/storage/rocksdb/ha_rocksdb.h
@@ -630,6 +630,7 @@ class ha_rocksdb : public my_core::handler {
   /* Setup field_decoders based on type of scan and table->read_set */
   void setup_read_decoders();
 
+  virtual void column_bitmaps_signal() override { setup_read_decoders(); }
   /*
     Number of bytes in on-disk (storage) record format that are used for
     storing SQL NULL flags.

Doesn't cause extra testcase failures for me, but I would still use yours to fix the S1 issue.

facebook-github-bot pushed a commit that referenced this issue May 18, 2017
Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes #626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: f025617
@yoshinorim
Copy link
Contributor

Fix committed. I'm keeping this task opened for future refactoring.

IslamAbdelRahman pushed a commit to IslamAbdelRahman/mysql-5.6 that referenced this issue May 31, 2017
…sults (facebook#624)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626
Github Author: Yoshinori Matsunobu <yoshinori@fb.com>

Github PR Sync: {sync, type="child", parent="github", parentrepo="facebook/mysql-5.6", parentprnum="626", parentprfbid="174177239779441"}

Test Plan: Imported from GitHub, without a `Test Plan:` line.

Reviewers: herman

Reviewed By: herman

Subscribers: herman, webscalesql-eng@fb.com

Differential Revision: https://phabricator.intern.facebook.com/D5088547

Signature: t1:5088547:1495137850:d336321f11952106f380e06cc34f56330b3ee02c
facebook-github-bot pushed a commit that referenced this issue Dec 23, 2019
Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes #626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 12, 2020
…book#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 9, 2020
…book#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Sep 16, 2020
…book#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Oct 5, 2020
…book#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Nov 11, 2020
…book#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Reviewed By: hermanlee

Differential Revision: D5088547

Pulled By: yoshinorim

fbshipit-source-id: 3cc770c
facebook-github-bot pushed a commit that referenced this issue Feb 4, 2021
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```

As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](70f2bcd), and [issue 624](#624) and [PR 626](#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reviewed By: lth

Differential Revision: D25521518

fbshipit-source-id: a46ed3e71fa
facebook-github-bot pushed a commit that referenced this issue Mar 8, 2021
Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes #626

Differential Revision: D5088547 (3b2cc9f)

Pulled By: yoshinorim

fbshipit-source-id: 9f0aee0da20
facebook-github-bot pushed a commit that referenced this issue Apr 22, 2021
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](70f2bcd), and [issue 624](#624) and [PR 626](#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: 44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/percona-server that referenced this issue Jun 24, 2021
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/percona-server that referenced this issue Jun 24, 2021
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep added a commit to inikep/percona-server that referenced this issue Jun 29, 2021
Fix buffer management issues during online alter

Upstream commit ID : fb-mysql-5.6.35/edce7e9f7518a0a9159a2cf5c63e8e8f340a0481
PS-7731 : Merge percona-202102

Summary:
This fixes two issues related to online alter found by the rocksdb.rqg_runtime test.

1. A buffer overflow would happen occasionally, and this was caused by the fact that if an online alter fails for whatever reason (eg. duplicate key error), then we never restore the old buffer from the old schema, and if the old buffers were larger than the new buffers, you could run into buffer overflow in later queries. The fix is to reallocate the old buffers.
2. Memory allocated for holding blobs was sometimes leaked during online alter. This is because in `Rdb_key_def::unpack_record`, the blob is allocated from the handler on the `TABLE` object passed in, and this is usually usually the correct handler (ie. `table->file` matches the handler object that called the unpack function). However, during online alter, we pass in `altered_table` instead, so the buffer ends up being allocated on `altered_table->file`. That handler object was never opened with `ha_rocksdb::open`, so it never gets closed and the buffer leaks. To fix this, call we free the buffers after online alter.

Reviewed By: Pushapgl

Differential Revision: D26891145

fbshipit-source-id: 2a38e485cd1

Fix a memory leak in "rocksdb_init_internal"

fbshipit-source-id: 07ffe574130

Remove some dead code and unused variables

Upstream commit ID : fb-mysql-5.6.35/99affe6c5677b2dbdc0aaeb31b8e5182c622fb6d
PS-7731 : Merge percona-202102

Summary:
1. dup_ref is unused unless the storage engine supports HA_DUPLICATE_POS
as a table flag. This means that m_pk_tuple is also useless, and is
removed.
2. Remove unused parameter end_key_packed_size on calc_eq_cond_len.
3. Remove unused m_pk_key_parts member.
4. Remove unused is_ascending declaration.
5. Remove unused parameter on read_key_exact
6. The ha_rocksdb::read_range_first override was only useful when
   index_read_map_impl required an explicit end_key parameter. However,
   there is now code to fallback to using end_range, this override is
   no longer useful. This also obviates the distinction between
   index_read_map_impl and index_read_map.
7. Remove unused function remove_rows.

Reviewed By: luqun

Differential Revision: D23300062

fbshipit-source-id: 0220b839690

2x integer key decoding perf

Upstream commit ID : fb-mysql-5.6.35/0bf45d644729f4beed6280b38c55a557fbad5698
PS-7731 : Merge percona-202102

Summary:
In my testing this reduces the overhead of integer decoding by 2x. The idea of the fix is straight-forward:
* Don't call Field* to get information, instead cache as much information as possible in our Rdb_field_packing, similar to InnoDB's row_prebuilt_t. This has a few advantages:
  * Avoid expensive virtual function calls
  * More cache friendly
* Similarly, don't use field->move_field / move_field_offset  to move field to the target buffer and then back - just write to the buffer using pre-calculated field offset directly
* Use template integer unpacking function to enable compiler to unroll expensive loops into a constant number of mov instructions.

NOTE: FDO data needs to be updated for this diff in order to take full advantage of this change.

Reference Patch: facebook/mysql-5.6@9af9aa5
 ---
Porting Notes:

There are some additional changes required to account for covering TEXT/BLOB changes, especially for blob I need to be able to call into ha_rocksdb->get_blob_buffer. I've added a new struct Rdb_unpack_func_context to be able to pass any additional required arguments without having to update all the unpack functions. For now I'm only passing in the table that we can use to get the ha_rocksdb handler.

Reviewed By: Pushapgl

Differential Revision: D25800694

fbshipit-source-id: d5f347d0a98

Faster value decoding

Upstream commit ID : fb-mysql-5.6.35/0e45b20a6a7e205f230314e1c6d1d91b752ac372
PS-7731 : Merge percona-202102

Summary:
Similar to integer key decoding improvement, but for values:
* Remove dependency to field* and cache those information in Rdb_field_encoder
* No more Field::move_field - just write to buffer directly

We were originally planning to use Rdb_protocol_value_decoder for MySQL bypass project to eliminate overhead of decoding to integer then convert to string, but in practice it probably makes little difference and it is not going to be needed when we go with bypass RPC (which is thrift binary protocol). So this changes removes Rdb_protocol_value_decoder completely.

One thing that worth mentioning is I've unified pack_length and pack_length_in_rec because they are essentially identical in our scenarios and in the case they could be different (more packed BITS format that borrows bits from null byte when storage engine supports it) we don't really support it at all, so unifying it to avoid confusion and assert it.

With this MyRocks is now either faster or on par with InnoDB when it comes to SELECT integer columns stored in value.

NOTE:FDO data needs to be updated for this diff in order to take full advantage of this change.

Reference Patch: facebook/mysql-5.6@f4cefba

Reviewed By: Pushapgl

Differential Revision: D25880543

fbshipit-source-id: 19faf536554

Improve count(*) performance and fix index merge bug

Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab

Skip decoding completely during count and improve SK count perf by 20x

Upstream commit ID : fb-mysql-5.6.35/c059dde397f327982bc6abe555ca63756f4b9fa9
PS-7731 : Merge percona-202102

Summary:
8.0 is taking a new approach with doing counts - MySQL will call handler::records / records_with_index which will then use rnd_init+rnd_next / index_init+index_next to iterate through everything. If you try count(*) you'll see MyRocks 8.0 is actually much slower when iterating the table with secondary keys than 5.6 (the optimizer picks secondary key for whatever reason).

This is because MySQL no longer calls `handler::extra(KEY_READ)` so we no longer get the hints to only decode the key - instead we'll see that the SK is not covering so we fall back to PK, which dramatically slows down by *a lot*.

This fixes it by adding a m_iteration_only = true hint -  with this hint we'll just quickly iterate through everything without any decoding nor falling back to PK.

Note even with this fix, if you compare MyRocks 8.0 with InnoDB 8.0, you'll see that InnoDB 8.0 is also much faster - this is because InnoDB overrides the implementation and iterate the records directly in parallel (default to 4). This is something we can do as well.

Reviewed By: Pushapgl

Differential Revision: D26272733

fbshipit-source-id: 53b1ae328f6

Bump rocksdb to 6.19

Upstream commit ID : fb-mysql-5.6.35/a69277edd3e9766aead9634dc09253fdff2df908
PS-7731 : Merge percona-202102

Summary:
This re-introduces Luqun's get_rocksdb_files.sh (which was reverted due to RocksDB perf regression in 6.17) to support building RocksDB 6.17+ and updates RocksDB to 6.19. The regression in 6.17 was reverted in RocksDB 6.18 so we should be able to proceed with new version.

update-submodule: rocksdb

Reviewed By: Pushapgl

Differential Revision: D27471365

fbshipit-source-id: 75a1ff9b4f3

Integrate rocksdb fault injection framework with myrocks

Upstream commit ID : fb-mysql-5.6.35/136b9ecdb2427b7468f832be5ef98a8ed057e2fc
PS-7731 : Merge percona-202102

Summary:
The change integrates rocksdb::FaultInjectionTestFS with MyRocks
code. MyRocks exposes system variable
rocksdb_fault_injection_options that takes in options in json format.
This is first iteration which exposes write error injection. The injection
framework will end up supporting more functionality in the future.

```
rocksdb_fault_injection_options={"retry":true,"onein":1000,"filetypes":["kTableFile","kDescriptorFile"]}
```

Reviewed By: yizhang82

Differential Revision: D27477107

fbshipit-source-id: eff000c3e93

Check for errors during inplace_populate_sk

Upstream commit ID : fb-mysql-5.6.35/c7c590acbcbfd60b1afbbc90c8162d8b5cadbc67
PS-7731 : Merge percona-202102

Summary: In `inplace_populate_sk`, `ha_index_init` can return an error if the query has been killed, and if that happens, it's not safe to continue with calling `index_first`/`index_next`. The fix is to check for errors and return early if needed.

Reviewed By: luqun

Differential Revision: D26960027 (facebook/mysql-5.6@f12eea3)

fbshipit-source-id: eafc511dce4

Return stats from storage engine directly without table open

Upstream commit ID : fb-mysql-5.6.35/18d7f5ba8304c3bf470b8fc27de601eafe467910
PS-7731 : Merge percona-202102

Summary:
In 8.0 query from i_s.tables are *much* slower (~5x) most likely due to table open/close are much slower - profiling data showed that 74% time spent on opening tables and 19.71% time spent on reading histograms, only 2.41% spent on the actual stats inside MyRocks.

8.0 provides a new callback get_table_statistics and get_index_column_cardinality that when "overridden", signal MySQL to go call those functions to fill in stats directly, which should be much faster, with some caveats:
* Our stats code path need to work without table open - this fix updates our stat code path to strictly use Rdb_tbl_def and Rdb_key_def.

One catch is that we won't be able to update the auto_inc value correctly because we won't know if the table actually has AUTO_INCREMENT without opening it, so there are some corner cases the value might be NULL (such as immediately after creating a table with AUTO_INCREMENT). This actually aligns with ROCKSDB_DDL table behavior and InnoDB behavior as well. Once you open the table, the cached Rdb_tbl_def::m_auto_inc_val should be updated properly.

* We also need to make sure the lifetime / locking works properly with regarding to DDL and special cases like drop cf.

Fortunately the table_stats code path does take MDL_EXPLICIT lock with the db_name.table_name, so correctness should be straight-forward. One special case if DROP TABLE is running at the same time, the MDL_EXPLICIT lock may succeed *after* drop table succeed, and Rdbl_ddl_manager::find would return null, and we would give default values in the columns and with a error message saying table cannot be found.

Note this version only overrides get_table_statistics (and it updates SQL layer to treat them separately) and we can add get_index_column_cardinality in the future if there is a need - right now there is little evidence that we are querying information_schema.statistics much (if not at all).

Reviewed By: hermanlee

Differential Revision: D26624028

fbshipit-source-id: 3b41f25d61c

Fix rows_read and queries_range stats in bypass

Upstream commit ID : fb-mysql-5.6.35/50bfa5b89b2240eb84ad81f6c0e90bf8e0cbae60
PS-7767 : Merge percona-202103

Summary:
Bypass updates rows_sent and rows_examined but didn't update rows_read. This fixes that and adds tests. We also get rocksdb_queries_point and rocksdb_queries_range for free as we are using MyRocks API to access data. Note I didn't include rows_examined in the test because querying unrelated information_schema tables will affect rows_examined and it is kinda flaky depending on many factors like perf schema. I did found a double counting issue with rocksdb_queries_range and fixed it as well.

Reference Patch: facebook/mysql-5.6@6106eba

----
Porting Notes: Removed updates to table stats which is not ported to 8.0.

Differential Revision: D25809584

fbshipit-source-id: 4cbb31e5b35

Refactoring index covering check to match MyRocks and add non-covering SK tests and rocksdb_covering_secondary_key stats

Upstream commit ID : fb-mysql-5.6.35/7194dfd7679aa70bda91236d0d6455d5324e615b
PS-7731 : Merge percona-202102

Summary:
When considering when a secondary key is covering, we need to also consider lookup bitmap for some special cases like using covering bitmap w/ prefix keys. This isn't much of an issue with 5.6 today because we don't actually enable covering bitmap format in prod but in 8.0 it might be an issue. Either way this refactors the covering check to be exactly the same as MyRocks does. Also adds more tests for covering / non covering prefix tests.

Also in order to catch cases where we incorrectly consider covering secondary keys to be non-covering, I've added rocksdb_covering_secondary_key_count support into bypass (should've done that earlier) and added checks in secondary range key tests to ensure covering check is done correctly.

Reference Patch: facebook/mysql-5.6@30e9039

----

Porting Notes:
* Fixed an issue that we need to explicitly compare with Rdb_key_def::KEY_COVERED
* The scenario with prefix key is now correctly handled and re-enabled in bypass_select_scenarios.inc

Reviewed By: Pushapgl

Differential Revision: D26566848

fbshipit-source-id: f149f119da4
ldonoso pushed a commit to ldonoso/percona-server that referenced this issue Jul 6, 2021
Upstream commit ID : fb-mysql-5.6.35/25b42b5d4365b6d2deb6bf40da9e776cd2e56698
PS-7780 : Merge fb-prod202101

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```

As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reviewed By: lth

Differential Revision: D25521518

fbshipit-source-id: a46ed3e71fa
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 16, 2021
…book#626) (facebook#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547 (facebook@3b2cc9f)

Pulled By: yoshinorim

fbshipit-source-id: 9f0aee0da20
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Aug 17, 2021
…book#626) (facebook#626)

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547 (facebook@3b2cc9f)

Pulled By: yoshinorim

fbshipit-source-id: 9f0aee0da20
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 13, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
oleksandr-kachan pushed a commit to oleksandr-kachan/percona-server that referenced this issue May 13, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 15, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 15, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 16, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 16, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 17, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 17, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 17, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 21, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 21, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 21, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
oleksandr-kachan pushed a commit to oleksandr-kachan/percona-server that referenced this issue May 24, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
oleksandr-kachan pushed a commit to oleksandr-kachan/percona-server that referenced this issue May 27, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 30, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue May 30, 2024
Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470
VarunNagaraju pushed a commit to VarunNagaraju/percona-server that referenced this issue May 31, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
VarunNagaraju pushed a commit to VarunNagaraju/percona-server that referenced this issue Jun 5, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 5, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 6, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 6, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 6, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 7, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 10, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 10, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
VarunNagaraju pushed a commit to VarunNagaraju/percona-server that referenced this issue Jun 10, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
VarunNagaraju pushed a commit to VarunNagaraju/percona-server that referenced this issue Jun 12, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
VarunNagaraju pushed a commit to VarunNagaraju/percona-server that referenced this issue Jun 12, 2024
Upstream commit ID : fb-mysql-5.6.35/e025cf1c47e63aada985d78e4083f2e02fba434f
PS-7731 : Merge percona-202102

Summary:
Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty:

```
 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);
```
As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook/mysql-5.6@70f2bcd), and [issue 624](facebook/mysql-5.6#624) and [PR 626](facebook/mysql-5.6#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

We have a few options to address this:
1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.
2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).
3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly.

In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of #3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init.

Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly.

Reference Patch: facebook/mysql-5.6@44d6a8d

---------
Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra.

Reviewed By: Pushapgl

Differential Revision: D26265470

fbshipit-source-id: f142be681ab
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 13, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
inikep pushed a commit to inikep/mysql-5.6 that referenced this issue Jun 14, 2024
…book#626) [non-RocksDB part]

Summary:
This diff fixed MyRocks returning missing rows if index
merge query plan was used. When index merge is selected,
table->read_map is cleared at
QUICK_RANGE_SELECT::init_ror_merged_scan(). Then
ha_rocksdb::setup_read_decoders() wrongly decided not to decode
all fields, which made MySQL layer decide that keys did not match.
This diff changes ha_rocksdb::setup_read_decoders() to
always decode if read_map is cleared. A side effect is decoding
all fetched fields on index merge. There is a slight performance
penalty but much better than returning wrong results.

This diff reuses index_merge_ror.inc and index_merge_2sweeps.inc
test cases for MyRocks. Since MyRocks query plan is less stable than
InnoDB, it skips using explain and just verifies data correctness.
Closes facebook#626

Differential Revision: D5088547

Pulled By: yoshinorim
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants