Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167) #12555

Merged
merged 2 commits into from
Sep 16, 2023

Conversation

epotyom
Copy link
Contributor

@epotyom epotyom commented Sep 14, 2023

Fix: Lucene90DocValuesProducer.TermsDict.seekCeil doesn't always position bytes correctly (#12167)

TermsDict ord and bytes can be out of sync after a call to seekCeil. It can lead to a state where ord indicates beginning of the next block ((ord & blockMask) == 0L), but bytes is positioned to read the compressed part of the block.

Test failure in #12167 is triggered when after seekCeil we call next which needs to decompressBlock. We read vInt to term.length, where in fact this vInt refers to the compressed block length. This length is greater than max term length of the block, which leads to IndexOutOfBoundsException.

TODO: write a test

@epotyom epotyom force-pushed the bug-Lucene90DocValuesProducer-seekCeil branch from 2fc4090 to f8c0e43 Compare September 14, 2023 18:16
@epotyom
Copy link
Contributor Author

epotyom commented Sep 14, 2023

Extended existing nightly random tests to catch the issue most of the time. Would that be enough or do we need a test that catches it every single time?

Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can have a unit test that can constantly reproduce the previous failure. If my analysis is correct, then probably we just need to reproduce it with

te.seekExact(0);
te.seekCeil(new BytesRef());
te.next()

@@ -1205,7 +1205,15 @@ public SeekStatus seekCeil(BytesRef text) throws IOException {
ord = 0;
return SeekStatus.END;
} else {
seekExact(0L);
// seekBlock doesn't update ord and it repositions bytes when calls getFirstTermFromBlock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think the ultimate reason might be previously when we call seekExact(0), this if condition is not true, and thus the bytes and ord are not reset correctly.

      if (ord < this.ord || blockIndex != currentBlockIndex) {
        // The looked up ord is before the current ord or belongs to a different block, seek again
        final long blockAddress = blockAddresses.get(blockIndex);
        bytes.seek(blockAddress);
        this.ord = (blockIndex << TERMS_DICT_BLOCK_LZ4_SHIFT) - 1;
      }

So I would prefer we do

ord = 1; // Probably need to add some comments on why we do this weird thing
seekExact(0L);

Instead of doing it manually here in case we're changing the seek behavior in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we're changing the seek behavior in the future.

This is the reason why I think we should not rely on seekExact. For example, if we do:

ord = 1;
seekExact(0L);

and in the future we optimize seekExact to not reset bytes, this can break again, because seekExact relies on TermDict to be in consistent state. Just to give an example of potential optimization: both ords 0 and 1 are in the same block. So, when we seek from 1 to 0 we can try using data from the current block without re-reading data from bytes? Even if this particular idea doesn't work, I think that that public methods of this class rely on data to be in a consistent state, so we should probably not rely on them to fix the state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, can we instead refactor this code with the code below to a method seekBlock(int)? Essentially these 4 lines:

      final long blockAddress = blockAddresses.get(block);
      this.ord = block << TERMS_DICT_BLOCK_LZ4_SHIFT;
      bytes.seek(blockAddress);
      decompressBlock();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks for the suggestion! We still need to handle block < 0, but it looks better now and seekBlock leaves term dict in a consistent state.

@zhaih
Copy link
Contributor

zhaih commented Sep 15, 2023

Actually I just tried it myself and this will always reproduce the error:

        actual.seekExact(0);
        actual.seekCeil(new BytesRef(""));
        for (int i = 0; i < TERMS_DICT_BLOCK_LZ4_SIZE; i++) {
          actual.next();
        }

What happened is really tricky:
First, when the bug happens, the ord must be 0, otherwise as I previously said everything will be nicely reset by seekExact(0). So only when ord == 0, when we call a seekCeil with some small value the ord and bytesRef will be out of sync.
However at this moment it's totally ok to call next() because the next ord is 1 (so we don't need to decompressBlock) and we have a blockInput that is caching the decompressed bytes when the last time we call decompressBlock (remember the ord must be 0, so before the seekCeil, someone must have already called decompressBlock in some way, either next or seekXX)
The disaster happens when we keep calling next() until we have to decompress the next block, as we're previously always using cached blockInput so bytes kept untouched, and as @epotyom described we read block length and thought it was term length and overfilled the term buffer.

@epotyom epotyom force-pushed the bug-Lucene90DocValuesProducer-seekCeil branch from f8c0e43 to 0885807 Compare September 15, 2023 14:49
// Testing termsEnum seekCeil edge case, where inconsistent internal state led to
// IndexOutOfBoundsException
// see https://github.com/apache/lucene/pull/12555 for details
public void testTermsEnumConsistency() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the unit test!

@zhaih
Copy link
Contributor

zhaih commented Sep 15, 2023

Also pls add an entry to CHANGES.txt :)

@epotyom epotyom force-pushed the bug-Lucene90DocValuesProducer-seekCeil branch from 0885807 to 99a9616 Compare September 15, 2023 23:03
@epotyom epotyom force-pushed the bug-Lucene90DocValuesProducer-seekCeil branch from 99a9616 to 52833f9 Compare September 15, 2023 23:08
Copy link
Contributor

@zhaih zhaih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@zhaih zhaih merged commit d633c9b into apache:main Sep 16, 2023
4 checks passed
zhaih pushed a commit that referenced this pull request Sep 16, 2023
@zhaih
Copy link
Contributor

zhaih commented Sep 16, 2023

Merged and backported

@epotyom
Copy link
Contributor Author

epotyom commented Sep 17, 2023

@zhaih thank you for reviewing and merging!

@zhaih zhaih added this to the 9.8.0 milestone Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

org.apache.lucene.search.grouping.TestGroupFacetCollector.testRandom fails reproducibly
2 participants