-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip #12653
Conversation
13536b0
to
6f4749d
Compare
@@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter { | |||
/** for every skip level a different buffer is used */ | |||
private ByteBuffersDataOutput[] skipBuffer; | |||
|
|||
/** Length of the window at which the skips are placed on skip level 1 */ | |||
private final long windowLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could make it int
here considering the values of skipInterval
(BLOCK_SIZE=128?) and skipMultiplier
(8?) but I have kept it long
to consider any unknown cases of overflowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to use int
here, and then use Math.toIntExact
to cast the (long) multiplied value back down to int. A single postings list is at most Integer.MAX_VALUE
(actually a bit less than this) documents since a single Lucene index can hold at most that many documents, and all docids in a postings list are unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
6f4749d
to
ad80835
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shubhamvishu -- Lucene's skipping implementation is very, very old (more than a decade?) and could badly use some love. I'm happy you are poking around in it!
if (df > skipInterval) { | ||
// also make sure it does not exceed maxSkipLevels | ||
numberOfSkipLevels = | ||
Math.min(1 + MathUtil.log(df / skipInterval, skipMultiplier), maxSkipLevels); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move the numberOfSkiLevels = 1
onto an else
clause here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure...btw do you find thats more readable or is there any specific reason I'm missing on(just curious)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is no hard standard in Lucene or anything (that I am aware of). I just generally prefer "write once" to variables like this (write in the if, write in the else) instead of always writing a value, and then sometimes overwriting it. I do feel it's more readable? In the first way, when I glance at the code, it looks at first like numberOfSkipLevels
is always set to 1
, and I might miss (on first glance) the if
that then overwrites it with a new values? It also makes final
possible. (Hmm in your previous approach this instance variable was also final? And javac
did not complain that it was being assigned twice? Curious...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting points! I agree it makes it easy to read at first galnce.
It also makes final possible. (Hmm in your previous approach this instance variable was also final? And javac did not complain that it was being assigned twice? Curious...).
Actually it was not the instance variable that you are thinking of and rather was a local variable with duplicate name(i.e. numberOfSkipLevels
) which was totally unnecessary here so I cleaned that up as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it was not the instance variable that you are thinking of and rather was a local variable with duplicate name(i.e.
numberOfSkipLevels
) which was totally unnecessary here so I cleaned that up as well.
Aha! That explains my confusion. Such shadowing (x
and this.x
being different) is so confusing/dangerous. Thanks for cleaning it up!
|
||
// determine max level | ||
while ((df % skipMultiplier) == 0 && numLevels < numberOfSkipLevels) { | ||
if (df % windowLength == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this optimizes for the common case when numLevels
will be 1
, right? It does a single modulo check to catch that case, and only if numLevels
will be > 1 does it fall into the while
loop case.
Maybe add some comments explaining this? Perhaps even the beautiful ascii art you put in the opening description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly! I'll add a comment to mention this. The ascii art is taken from this class javadocs(top of this file) itself.
ad80835
to
a126fc5
Compare
Thanks for the review @mikemccand ! I have addressed the comments in the new revision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shubhamvishu -- looks good. Could you add a CHANGES
entry too, under the 9.9
Optimizations section?
@mikemccand I have added a |
Thanks @shubhamvishu -- looks great! I plan to merge later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shubhamvishu!
…rSkip (#12653) * Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip * CHANGES.txt entry
I merged to |
Description
While going through MultiLevelSkipListWriter I happened to see we always calculate the
numLevels
(number of skip levels) here for a particular doc frequencydf
. Since we know theskipInterval
andskipMultiplier
(in the cx) upfront we could those to replace arithmetic operations of dividingdf
byskipInterval
and then(df % skipMultiplier) == 0
) with a single modulo operation for cases wherenumLevels
is supposed to be 1 (which is more often?) i.e. the onlydf
(doc frequency) which could havenumLevels > 1
must suffice the check (df % windowLength == 0
) where windowLength isskipInterval * skipMultiplier
i.e. Length of the window at which the skips are placed on skip level 1.Here
windowLength = skipInterval * skipMultiplier = 3 x 3 = 9
(interval at which skips are placed on skip level 1). So fordf=6,12,18,24,30...
we would perform a single modulo operation to quickly evaluatenumLevels
which would be1
in those cases.NOTE - This would be more helpful as we increase the
skipMultiplier
. Eg : forskipInterval=BLOCK_SIZE=128
andskipMultiplier=8
we would early evaluatenumLevels
to1
for 6 doc frequencies out of 8 when buffering the skip data.