Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataOutput.writeGroupVInts throws IntegerOverflow exception during merging #13373

Closed
iamsanjay opened this issue May 15, 2024 · 5 comments · Fixed by #13376
Closed

DataOutput.writeGroupVInts throws IntegerOverflow exception during merging #13373

iamsanjay opened this issue May 15, 2024 · 5 comments · Fixed by #13376
Labels

Comments

@iamsanjay
Copy link
Contributor

Description

As being discussed on email list that DataOutput.writeGroupVInts throws as IntegerOverflow exception. The goal is to find out the main reason and also to improve the exception message.

Exception in thread "Lucene Merge Thread #202"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArithmeticException: integer overflow at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735) at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
Caused by: java.lang.ArithmeticException: integer overflow at 
java.base/java.lang.Math.toIntExact(Math.java:1135) at 
org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at 
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379) at 
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173) at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097) at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398) at 
org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209) at
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137) at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740) at
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541) at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639) at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)

More context from the reporter

Looking deeper into this. I think we overflowed a term frequency field.
Looking in some statistics, in a previous release we had 1,288,526,281
of a certain field, this would be larger now. Each of these would have
had a limited set of values. But crucially nearly all of them would have
had the term "positional" or "non-positional" added to the document.

There is no good reason to do this today, we should just turn this into
a boolean field and update the UI. I will do this and report back.

Do you think that a patch for a try/catch for a more informative log
message be appreciated by the community? e.g. mentioning the field name
in the exception?

The index that had an issue when merging into one segment definitely had
more than 1 billion times the word "positional" in it. I hope to be able
to give a closer number once re-indexing finished with a "work-around".

Of course the "work-around" is to just fix this correctly by not having
that word so often in the index and definitely not as docs, freqs and
postings.

For background information.

The use case was to find a set of documents that where either
"positional" or "non-positional". This was present in the first check in
of our code 18 years ago! since then our data has grown a bit ;) The
code was using Lucene 1.4.3 at that time. Users would search using this
as what now would be a facet type:positional. I changed this to a
field only IndexOptions.DOCS which is called 'positional' and searched
as positional:yes rewriting the previous query syntax behind the scene
to not break any user tools.

Version and environment details

No response

@iamsanjay
Copy link
Contributor Author

iamsanjay commented May 15, 2024

Below code snippet is from 9_10 branch where this issues has been observed. As per the latest change for 10, we have moved few set of lines from below method to other class into a new method.
java.base/java.lang.Math.toIntExact(Math.java:1135) at
org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at

public void writeGroupVInts(long[] values, int limit) throws IOException {
int off = 0;
// encode each group
while ((limit - off) >= 4) {
byte flag = 0;
groupVIntBytes.setLength(1);
flag |= (encodeGroupValue(Math.toIntExact(values[off++])) - 1) << 6;
flag |= (encodeGroupValue(Math.toIntExact(values[off++])) - 1) << 4;
flag |= (encodeGroupValue(Math.toIntExact(values[off++])) - 1) << 2;
flag |= (encodeGroupValue(Math.toIntExact(values[off++])) - 1);
groupVIntBytes.setByteAt(0, flag);
writeBytes(groupVIntBytes.bytes(), groupVIntBytes.length());
}
// tail vints
for (; off < limit; off++) {
writeVInt(Math.toIntExact(values[off]));
}
}

@easyice
Copy link
Contributor

easyice commented May 16, 2024

Sorry for missing the email list, It seems the docDeltaBuffer should not overflow if just reading the code, I will try to reproduce this issue, Could you show me your source code for indexing, and some sample data? @iamsanjay

@JervenBolleman
Copy link

Hi @easyice, I am the original reporter on the mailing list.

As the code around indexing is a bit abstracted it might be hard to follow. What I do have, is the index that failed merging it is however, 173 GB xz compressed. I could use luke or a tool like that to extract more information for the lucene team.

The fieldtype that we are indexing into is

UNSTORED_POSITIONAL.setOmitNorms(true);
UNSTORED_POSITIONAL.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
UNSTORED_POSITIONAL.setStored(false);
UNSTORED_POSITIONAL.setTokenized(false);
UNSTORED_POSITIONAL.freeze();```

Then we add fields like so

doc.add(new Field("type", value.toLowerCase(Locale.US), UNSTORED_POSITIONAL);

With over 1,177,800,000 documents in this index, all with the term "positional" at least once in their documents.
On average there are three fields of this type in each document.

So to create local sample data I would just do ;)

for (int i=0;i<2_000_000_000;i++){
{
    Document doc = new Document();
    doc.add(new Field("type", "number", UNSTORED_POSITIONAL);
    if (i % 2 == 0} {
        doc.add(new Field("type", "even", UNSTORED_POSITIONAL);
    } else {
        doc.add(new Field("type", "un-even", UNSTORED_POSITIONAL);
   }
   writer.addDocument(doc);
}

@easyice
Copy link
Contributor

easyice commented May 16, 2024

Thank you @JervenBolleman , I have found the cause of the issue with @gf2121 , i will raise a PR later.

@mikemccand
Copy link
Member

Here is the java-user discussion that lead to this issue.

Thank you for reporting this @iamsanjay! It looks like it was a real bug, phew, and somewhat serious (not sure).

And thank you @easyice and @gf2121 for the quick repro/fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants