Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip #12653

Merged
merged 2 commits into from
Oct 21, 2023

Conversation

shubhamvishu
Copy link
Contributor

@shubhamvishu shubhamvishu commented Oct 11, 2023

Description

While going through MultiLevelSkipListWriter I happened to see we always calculate the numLevels(number of skip levels) here for a particular doc frequency df. Since we know the skipInterval and skipMultiplier(in the cx) upfront we could those to replace arithmetic operations of dividing df by skipInterval and then (df % skipMultiplier) == 0) with a single modulo operation for cases where numLevels is supposed to be 1 (which is more often?) i.e. the only df(doc frequency) which could have numLevels > 1 must suffice the check (df % windowLength == 0) where windowLength is skipInterval * skipMultiplier i.e. Length of the window at which the skips are placed on skip level 1.

 

* Example for skipInterval = 3:
 *                                                     c            (skip level 2)
 *                 c                 c                 c            (skip level 1)
 *     x     x     x     x     x     x     x     x     x     x      (skip level 0)
 * d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
 *     3     6     9     12    15    18    21    24    27    30     (df)
 *
 * d - document
 * x - skip data
 * c - skip data with child pointer

Here windowLength = skipInterval * skipMultiplier = 3 x 3 = 9 (interval at which skips are placed on skip level 1). So for df=6,12,18,24,30... we would perform a single modulo operation to quickly evaluate numLevels which would be 1 in those cases.

NOTE - This would be more helpful as we increase the skipMultiplier. Eg : for skipInterval=BLOCK_SIZE=128 and skipMultiplier=8 we would early evaluate numLevels to 1 for 6 doc frequencies out of 8 when buffering the skip data.

@@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter {
/** for every skip level a different buffer is used */
private ByteBuffersDataOutput[] skipBuffer;

/** Length of the window at which the skips are placed on skip level 1 */
private final long windowLength;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could make it int here considering the values of skipInterval(BLOCK_SIZE=128?) and skipMultiplier(8?) but I have kept it long to consider any unknown cases of overflowing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use int here, and then use Math.toIntExact to cast the (long) multiplied value back down to int. A single postings list is at most Integer.MAX_VALUE (actually a bit less than this) documents since a single Lucene index can hold at most that many documents, and all docids in a postings list are unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shubhamvishu -- Lucene's skipping implementation is very, very old (more than a decade?) and could badly use some love. I'm happy you are poking around in it!

if (df > skipInterval) {
// also make sure it does not exceed maxSkipLevels
numberOfSkipLevels =
Math.min(1 + MathUtil.log(df / skipInterval, skipMultiplier), maxSkipLevels);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the numberOfSkiLevels = 1 onto an else clause here?

Copy link
Contributor Author

@shubhamvishu shubhamvishu Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure...btw do you find thats more readable or is there any specific reason I'm missing on(just curious)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is no hard standard in Lucene or anything (that I am aware of). I just generally prefer "write once" to variables like this (write in the if, write in the else) instead of always writing a value, and then sometimes overwriting it. I do feel it's more readable? In the first way, when I glance at the code, it looks at first like numberOfSkipLevels is always set to 1, and I might miss (on first glance) the if that then overwrites it with a new values? It also makes final possible. (Hmm in your previous approach this instance variable was also final? And javac did not complain that it was being assigned twice? Curious...).

Copy link
Contributor Author

@shubhamvishu shubhamvishu Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting points! I agree it makes it easy to read at first galnce.

It also makes final possible. (Hmm in your previous approach this instance variable was also final? And javac did not complain that it was being assigned twice? Curious...).

Actually it was not the instance variable that you are thinking of and rather was a local variable with duplicate name(i.e. numberOfSkipLevels) which was totally unnecessary here so I cleaned that up as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it was not the instance variable that you are thinking of and rather was a local variable with duplicate name(i.e. numberOfSkipLevels) which was totally unnecessary here so I cleaned that up as well.

Aha! That explains my confusion. Such shadowing (x and this.x being different) is so confusing/dangerous. Thanks for cleaning it up!


// determine max level
while ((df % skipMultiplier) == 0 && numLevels < numberOfSkipLevels) {
if (df % windowLength == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this optimizes for the common case when numLevels will be 1, right? It does a single modulo check to catch that case, and only if numLevels will be > 1 does it fall into the while loop case.

Maybe add some comments explaining this? Perhaps even the beautiful ascii art you put in the opening description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly! I'll add a comment to mention this. The ascii art is taken from this class javadocs(top of this file) itself.

@shubhamvishu
Copy link
Contributor Author

Thanks for the review @mikemccand ! I have addressed the comments in the new revision.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shubhamvishu -- looks good. Could you add a CHANGES entry too, under the 9.9 Optimizations section?

@shubhamvishu
Copy link
Contributor Author

@mikemccand I have added a CHANGES entry to 9.9. Thanks!

@mikemccand
Copy link
Member

Thanks @shubhamvishu -- looks great! I plan to merge later today.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shubhamvishu!

@mikemccand mikemccand merged commit de8ae1d into apache:main Oct 21, 2023
4 checks passed
mikemccand pushed a commit that referenced this pull request Oct 21, 2023
…rSkip (#12653)

* Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip

* CHANGES.txt entry
@mikemccand
Copy link
Member

I merged to main and 9.x (9.9)! Thanks @shubhamvishu.

@mikemccand mikemccand added this to the 9.9.0 milestone Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants