Indexing a large collection of about 20 million paragraph-sized documents results in an ArrayIndexOutOfBoundsException in org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace below).
The bug is possibly related to issues described in here and SOLR-10936 – but I am not using SOLR, I am directly using Lucene Core.
The issue can be reproduced using code from GitHub trec-car-tools-example
- compile with
mvn compile assembly:single
- run with
java -cp ./target/treccar-tools-example-0.1-jar-with-dependencies.jar edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir
Where paragraphCorpus.cbor is contained in this archive
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
at edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)
Migrated from LUCENE-8118 by Laura Dietz, 2 votes, updated Aug 19 2022
Environment:
Debian/Stretch
java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
Attachments: LUCENE-8118_test.patch
Linked issues:
Indexing a large collection of about 20 million paragraph-sized documents results in an ArrayIndexOutOfBoundsException in org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace below).
The bug is possibly related to issues described in here and SOLR-10936 – but I am not using SOLR, I am directly using Lucene Core.
The issue can be reproduced using code from GitHub trec-car-tools-example
mvn compile assembly:singlejava -cp ./target/treccar-tools-example-0.1-jar-with-dependencies.jar edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDirWhere paragraphCorpus.cbor is contained in this archive
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508)
at edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)
Migrated from LUCENE-8118 by Laura Dietz, 2 votes, updated Aug 19 2022
Environment:
Attachments: LUCENE-8118_test.patch
Linked issues: