-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167) #12555
Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167) #12555
Conversation
2fc4090
to
f8c0e43
Compare
Extended existing nightly random tests to catch the issue most of the time. Would that be enough or do we need a test that catches it every single time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can have a unit test that can constantly reproduce the previous failure. If my analysis is correct, then probably we just need to reproduce it with
te.seekExact(0);
te.seekCeil(new BytesRef());
te.next()
@@ -1205,7 +1205,15 @@ public SeekStatus seekCeil(BytesRef text) throws IOException { | |||
ord = 0; | |||
return SeekStatus.END; | |||
} else { | |||
seekExact(0L); | |||
// seekBlock doesn't update ord and it repositions bytes when calls getFirstTermFromBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think the ultimate reason might be previously when we call seekExact(0)
, this if condition is not true, and thus the bytes
and ord
are not reset correctly.
if (ord < this.ord || blockIndex != currentBlockIndex) {
// The looked up ord is before the current ord or belongs to a different block, seek again
final long blockAddress = blockAddresses.get(blockIndex);
bytes.seek(blockAddress);
this.ord = (blockIndex << TERMS_DICT_BLOCK_LZ4_SHIFT) - 1;
}
So I would prefer we do
ord = 1; // Probably need to add some comments on why we do this weird thing
seekExact(0L);
Instead of doing it manually here in case we're changing the seek behavior in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case we're changing the seek behavior in the future.
This is the reason why I think we should not rely on seekExact. For example, if we do:
ord = 1;
seekExact(0L);
and in the future we optimize seekExact
to not reset bytes
, this can break again, because seekExact relies on TermDict to be in consistent state. Just to give an example of potential optimization: both ords 0
and 1
are in the same block. So, when we seek from 1
to 0
we can try using data from the current block without re-reading data from bytes
? Even if this particular idea doesn't work, I think that that public methods of this class rely on data to be in a consistent state, so we should probably not rely on them to fix the state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, can we instead refactor this code with the code below to a method seekBlock(int)
? Essentially these 4 lines:
final long blockAddress = blockAddresses.get(block);
this.ord = block << TERMS_DICT_BLOCK_LZ4_SHIFT;
bytes.seek(blockAddress);
decompressBlock();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks for the suggestion! We still need to handle block < 0
, but it looks better now and seekBlock leaves term dict in a consistent state.
Actually I just tried it myself and this will always reproduce the error:
What happened is really tricky: |
f8c0e43
to
0885807
Compare
// Testing termsEnum seekCeil edge case, where inconsistent internal state led to | ||
// IndexOutOfBoundsException | ||
// see https://github.com/apache/lucene/pull/12555 for details | ||
public void testTermsEnumConsistency() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the unit test!
Also pls add an entry to CHANGES.txt :) |
0885807
to
99a9616
Compare
99a9616
to
52833f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Merged and backported |
@zhaih thank you for reviewing and merging! |
Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)
TermsDict
ord
andbytes
can be out of sync after a call to seekCeil. It can lead to a state whereord
indicates beginning of the next block ((ord & blockMask) == 0L
), butbytes
is positioned to read the compressed part of the block.Test failure in #12167 is triggered when after
seekCeil
we callnext
which needs todecompressBlock
. We read vInt toterm.length
, where in fact this vInt refers to the compressed block length. This length is greater than max term length of the block, which leads to IndexOutOfBoundsException.TODO: write a test