-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Improve count(*) performance and fix index merge bug #1155
Conversation
/cc @yoshinorim @lth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
The general idea looks good, but I have some comments on the specific details of the patch. Added them to this review.
/* We don't have to set 'head->keyread' here as the 'file' is unique */ | ||
if (!head->no_keyread) | ||
head->mark_columns_used_by_index(index); | ||
head->prepare_for_position(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really get here when head->no_keyread==true? (As far I remember, this happens when the table is updated, and index_merge is not used in such cases. If one is not allowed to use index-only reads, then there is no point to use index_merge, because you're going to read the same full row using multiple indexes...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old code that I've refactored into this function, so I'm not sure exactly when it is required. Looking at the history, this commit added it in 2006 but unfortunately had little information. Interestingly, I did find this MariaDB bug report that seems to indicate that this scenario could be triggered by DELETE tbl FROM tbl WHERE key1=const1 AND key2=const2;
but the scenario in question doesn't seem to actually work and we might need to call mark_columns_used_by_index
regardless, based on your investigation in the bug back in 2012. If we want to change it I think it should be a separate diff with a test case similar to what's in mariadb bug.
@@ -1671,11 +1691,19 @@ int QUICK_RANGE_SELECT::init_ror_merged_scan(bool reuse_handler) | |||
if (reuse_handler) | |||
{ | |||
DBUG_PRINT("info", ("Reusing handler %p", file)); | |||
|
|||
head->column_bitmaps_set_no_signal(&column_bitmap, &column_bitmap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but this call looks redundant to me: mark_columns_used_by_index()
calls TABLE::mark_columns_used_by_index
which makes this call
column_bitmaps_set(bitmap, bitmap);
where bitmap
points to TABLE::tmp_set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old code so I'm not 100% positive - based on my reading of the code it seems to be related to the earlier question regarding no_keyread
- if no_keyread is true, then prepare_position
call below might update column_bitmap (not tmp, if mark_columns_used_by_index
isn't called):
/* We don't have to set 'head->keyread' here as the 'file' is unique */
if (!head->no_keyread)
head->mark_columns_used_by_index(index);
head->prepare_for_positon();
void TABLE::prepare_for_position()
{
DBUG_ENTER("TABLE::prepare_for_position");
if ((file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_POSITION) &&
s->primary_key < MAX_KEY)
{
mark_columns_used_by_index_no_reset(s->primary_key, read_set);
/* signal change */
file->column_bitmaps_signal();
}
DBUG_VOID_RETURN;
}
The whole thing seems rather needlessly convoluted IMO. I think the better option here is to call mark_columns_used_by_index_no_reset
that modifies column_bitmap
directly without relying on tmp_set
, and it would also get rid of bitmap_copy(&column_bitmap, head->read_set);
since all changes would've been made in column_bitmap
directly, but there might be some subtle side effects that I may have missed so for now my preference is to keep it as-is (and also make future merges a bit easier when we port this to 8.0).
We are only going to read key fields and call position() on 'file' | ||
The following sets head->tmp_set to only use this key and then updates | ||
head->read_set and head->write_set to use this bitmap. | ||
The now bitmap is stored in 'column_bitmap' which is used in ::get_next() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, s/The now/The new/ ?
hi @spetrunia, thanks for the review/comments! This patch is mostly intended to fix the index merge issue by moving the read/write set manipulation to before init, and tries to avoid touching existing index merge implementation as much as possible. I think your comments are mostly for existing code in 5.6 (that was simply moved around / refactored into its own function to be reused in both reuse_handler = true/false in my patch) which I didn't plan to change. But I'll try to add my thoughts anyway. |
We just had a recent report of something similar against 8.0 here https://jira.percona.com/browse/PS-7500. W.r.t. this patch, it really would be ideal if it could be limited to MyRocks with no change to the SQL layer, that makes it much more difficult for us (Percona) to justify merging it. We would be glad to help out or potentially send back our engine only patch if we arrive at a similar conclusion. |
hi @georgelorchpercona I agree it would be ideal if we can limit this to MyRocks engine. I tried two approaches:
I looked at how InnoDB sets sql_stat_start = TRUE/FALSE and basically it will reconstruct templates in index reads until a first read/index positioning completes, though the details seems a bit convoluted. I'm trying to see if I can replicate the same logic in MyRocks - it is definitely doable but getting it exactly right will take some time and can be error prone (mostly resulting in performance regressions if we are reinitializing decoders unnecessarily in a missed path and hard to catch). On the other hand, fixing in SQL layer is mostly refactoring and is lower risk / easier to validate in this particular case. Thoughts? |
I wonder if we can do something simpler than what InnoDB does, since I think
At least for InnoDB, it seems like they are paying the cost of calling |
Discussed with @lth offilne and I like the idea simplifying to just delaying building decoders until first index / rnd operation, setting the delayed flag in rnd_init / index_init. Technically it's still kinda working around the bug in optimizer but if we can limit this patch to MyRocks only there is definitely lot of value there like @georgelorchpercona mentioned. So far this seems quite promising and straight-forward enough. thanks @spetrunia for the review/comments as well |
This is the new MyRocks only version: 25b42b5 |
btw, the patch above works for 5.6 but for 8.0 the count code has moved into records / records_from_index which doesn't call extra(KEY_READ) by default so we'll need some changes to account for that - otherwise for secondary index the reads are really bad (and for some reason MySQL likes to read from SK in 5.6 and 8.0, for whatever reason). Primary key shouldn't be impacted. My 8.0 port (coming soon) should address this - we can do better by adding some additional hints to skip decoding completely as well so we'd only need the iteration. Also innodb has parallel scanning with default to 4 threads so it'll be faster as well. But that's probably a feature for another day. |
It turns out MyRocks 8.0 when doing count(*) with secondary key full table scan, it would actually read the PK as well (and due to this issue it would also unpack the entire PK + value, making it even worse). This is a bug that have something to do with new counting overrides |
(Sending this in github to get broader feedback)
Today in
SELECT count(*)
MyRocks would still decode every single column due to this check, despite the readset being empty:As a result MyRocks is significantly slower than InnoDB in this particular scenario.
Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in this commit, and issue 624 and PR 626. So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*).
Looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call
mark_columns_used_by_index_no_reset(s->primary_key, read_set)
if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge.Note that InnoDB doesn't have this problem, because it actually initialize its template again in index_read for the 2nd time (relying on
prebuilt->sql_stat_start
), and during index_readQUICK_RANGE_SELECT::column_bitmap
is already fixed up and the table read/write set is switched to it, so the new template would be built correctly. In theory we could also use similar technique, but I decided to go with a change that has smaller scope limiting to index merges, and is the better long term fix anyway.With this MyRocks is on par with InnoDB with count(*) scenario in the scenario we tested, roughly 40% faster than before the fix.
Another alternative solution is to override
handler::column_bitmaps_signal
and recreate the MyRocks encoders there, however doing so revealed some additional issues. Because no storage engine today actually usehandler::column_bitmaps_signal
this path haven't been tested properly in index merge. In this case,QUICK_RANGE_SELECT::init_ror_merged_scan
should callset_column_bitmaps_no_signal
to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false).Given that
column_bitmaps_signal
is not being used by any storage engines (in the tree) today so it is not well tested, I think it is better to stick to the more established pattern to make sure init functions see the correct read/write set.