Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Improve count(*) performance and fix index merge bug #1155

Closed
wants to merge 1 commit into from

Conversation

yizhang82
Copy link
Contributor

@yizhang82 yizhang82 commented Jan 15, 2021

(Sending this in github to get broader feedback)

Today in SELECT count(*) MyRocks would still decode every single column due to this check, despite the readset being empty:

 // bitmap is cleared on index merge, but it still needs to decode columns
    bool field_requested =
        decode_all_fields || m_verify_row_debug_checksums ||
        bitmap_is_clear_all(field_map) ||
        bitmap_is_set(field_map, m_table->field[i]->field_index);

As a result MyRocks is significantly slower than InnoDB in this particular scenario.

Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in this commit, and issue 624 and PR 626. So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).

Looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call mark_columns_used_by_index_no_reset(s->primary_key, read_set) if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.

Note that InnoDB doesn't have this problem, because it actually initialize its template again in index_read for the 2nd time (relying on prebuilt->sql_stat_start), and during index_read QUICK_RANGE_SELECT::column_bitmap is already fixed up and the table read/write set is switched to it, so the new template would be built correctly. In theory we could also use similar technique, but I decided to go with a change that has smaller scope limiting to index merges, and is the better long term fix anyway.

With this MyRocks is on par with InnoDB with count(*) scenario in the scenario we tested, roughly 40% faster than before the fix.

Another alternative solution is to override handler::column_bitmaps_signal and recreate the MyRocks encoders there, however doing so revealed some additional issues. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).

Given that column_bitmaps_signal is not being used by any storage engines (in the tree) today so it is not well tested, I think it is better to stick to the more established pattern to make sure init functions see the correct read/write set.

@yizhang82
Copy link
Contributor Author

/cc @yoshinorim @lth

Copy link
Contributor

@spetrunia spetrunia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

The general idea looks good, but I have some comments on the specific details of the patch. Added them to this review.

/* We don't have to set 'head->keyread' here as the 'file' is unique */
if (!head->no_keyread)
head->mark_columns_used_by_index(index);
head->prepare_for_position();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really get here when head->no_keyread==true? (As far I remember, this happens when the table is updated, and index_merge is not used in such cases. If one is not allowed to use index-only reads, then there is no point to use index_merge, because you're going to read the same full row using multiple indexes...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code that I've refactored into this function, so I'm not sure exactly when it is required. Looking at the history, this commit added it in 2006 but unfortunately had little information. Interestingly, I did find this MariaDB bug report that seems to indicate that this scenario could be triggered by DELETE tbl FROM tbl WHERE key1=const1 AND key2=const2; but the scenario in question doesn't seem to actually work and we might need to call mark_columns_used_by_index regardless, based on your investigation in the bug back in 2012. If we want to change it I think it should be a separate diff with a test case similar to what's in mariadb bug.

@@ -1671,11 +1691,19 @@ int QUICK_RANGE_SELECT::init_ror_merged_scan(bool reuse_handler)
if (reuse_handler)
{
DBUG_PRINT("info", ("Reusing handler %p", file));

head->column_bitmaps_set_no_signal(&column_bitmap, &column_bitmap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but this call looks redundant to me: mark_columns_used_by_index() calls TABLE::mark_columns_used_by_index which makes this call

  column_bitmaps_set(bitmap, bitmap);

where bitmap points to TABLE::tmp_set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code so I'm not 100% positive - based on my reading of the code it seems to be related to the earlier question regarding no_keyread - if no_keyread is true, then prepare_position call below might update column_bitmap (not tmp, if mark_columns_used_by_index isn't called):

  /* We don't have to set 'head->keyread' here as the 'file' is unique */
  if (!head->no_keyread)
    head->mark_columns_used_by_index(index);
  head->prepare_for_positon();
void TABLE::prepare_for_position()
{
  DBUG_ENTER("TABLE::prepare_for_position");

  if ((file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_POSITION) &&
      s->primary_key < MAX_KEY)
  {
    mark_columns_used_by_index_no_reset(s->primary_key, read_set);
    /* signal change */
    file->column_bitmaps_signal();
  }
  DBUG_VOID_RETURN;
}

The whole thing seems rather needlessly convoluted IMO. I think the better option here is to call mark_columns_used_by_index_no_reset that modifies column_bitmap directly without relying on tmp_set, and it would also get rid of bitmap_copy(&column_bitmap, head->read_set); since all changes would've been made in column_bitmap directly, but there might be some subtle side effects that I may have missed so for now my preference is to keep it as-is (and also make future merges a bit easier when we port this to 8.0).

We are only going to read key fields and call position() on 'file'
The following sets head->tmp_set to only use this key and then updates
head->read_set and head->write_set to use this bitmap.
The now bitmap is stored in 'column_bitmap' which is used in ::get_next()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, s/The now/The new/ ?

@yizhang82
Copy link
Contributor Author

hi @spetrunia, thanks for the review/comments! This patch is mostly intended to fix the index merge issue by moving the read/write set manipulation to before init, and tries to avoid touching existing index merge implementation as much as possible. I think your comments are mostly for existing code in 5.6 (that was simply moved around / refactored into its own function to be reused in both reuse_handler = true/false in my patch) which I didn't plan to change. But I'll try to add my thoughts anyway.

@george-lorch
Copy link

We just had a recent report of something similar against 8.0 here https://jira.percona.com/browse/PS-7500. W.r.t. this patch, it really would be ideal if it could be limited to MyRocks with no change to the SQL layer, that makes it much more difficult for us (Percona) to justify merging it. We would be glad to help out or potentially send back our engine only patch if we arrive at a similar conclusion.

@yizhang82
Copy link
Contributor Author

hi @georgelorchpercona I agree it would be ideal if we can limit this to MyRocks engine. I tried two approaches:

  • Overriding columns_bitmap_signal. As previously mentioned it didn't work and require further changes in init_ror_merged_scan, and as no engine currently is using it so it's not tested and have bugs
  • Following innodb practice. The biggest reason InnoDB worked is because in first ha_innobase::index_read call, InnoDB will reinitialize its prebuilt template with the correct column_bitmap
	if (prebuilt->sql_stat_start) {
		build_template(false);

I looked at how InnoDB sets sql_stat_start = TRUE/FALSE and basically it will reconstruct templates in index reads until a first read/index positioning completes, though the details seems a bit convoluted. I'm trying to see if I can replicate the same logic in MyRocks - it is definitely doable but getting it exactly right will take some time and can be error prone (mostly resulting in performance regressions if we are reinitializing decoders unnecessarily in a missed path and hard to catch). On the other hand, fixing in SQL layer is mostly refactoring and is lower risk / easier to validate in this particular case.

Thoughts?

@lth
Copy link
Contributor

lth commented Jan 19, 2021

I wonder if we can do something simpler than what InnoDB does, since I think sql_stat_start does a lot more than what we need it to do. We could:

  1. Delay initializing calling setup_field_decoders until the first read, or
  2. Store what read_set the decoders were initialized with, and reinitialize them if the read_set does not match anymore at index_read.

At least for InnoDB, it seems like they are paying the cost of calling build_template twice in a query (in index_init and in index_read). We can probably be a bit smarter.

@yizhang82
Copy link
Contributor Author

Discussed with @lth offilne and I like the idea simplifying to just delaying building decoders until first index / rnd operation, setting the delayed flag in rnd_init / index_init. Technically it's still kinda working around the bug in optimizer but if we can limit this patch to MyRocks only there is definitely lot of value there like @georgelorchpercona mentioned. So far this seems quite promising and straight-forward enough. thanks @spetrunia for the review/comments as well

@yizhang82
Copy link
Contributor Author

This is the new MyRocks only version: 25b42b5

@yizhang82 yizhang82 closed this Feb 4, 2021
@yizhang82
Copy link
Contributor Author

btw, the patch above works for 5.6 but for 8.0 the count code has moved into records / records_from_index which doesn't call extra(KEY_READ) by default so we'll need some changes to account for that - otherwise for secondary index the reads are really bad (and for some reason MySQL likes to read from SK in 5.6 and 8.0, for whatever reason). Primary key shouldn't be impacted. My 8.0 port (coming soon) should address this - we can do better by adding some additional hints to skip decoding completely as well so we'd only need the iteration.

Also innodb has parallel scanning with default to 4 threads so it'll be faster as well. But that's probably a feature for another day.

@yizhang82
Copy link
Contributor Author

It turns out MyRocks 8.0 when doing count(*) with secondary key full table scan, it would actually read the PK as well (and due to this issue it would also unpack the entire PK + value, making it even worse). This is a bug that have something to do with new counting overrides handler::record / handler::record_with_index. I have a patch that fixes this for 8.0 and should be able to get it in next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants