Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monster test that indexes 1M vectors #11867

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jtibshirani
Copy link
Member

This is a rough draft of a large-scale test for kNN vectors.

It tests a large dataset of kNN vectors to check for issues that only show up when
segments are very large, like overflow. The dataset is based on the StackOverflow
track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector.
I tried developing a test using random vectors, but HNSW can become quite slow
and ineffective when the data doesn't have structure.

Steps to run the test

  1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin
  2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/
  3. Start the test: ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1

Relates to #11863.

*
* Steps to run the test
* 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin
* 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tries to make a 3GB jar file as part of :lucene:core:jar task. For me it takes an eternity due to the zipping of the file into the jar. I dropped the file in src/test folder instead and the test is running with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this one i just suggest changing the code comment to say mv documents.bin lucene/core/src/test/. It makes for a faster experience.

assertNotNull(documentsPath);

try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI()));
Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use newFSDirectory() instead, then we get a checkindex at the end too. It can give more confidence in tests like these (as well as confidence there is no overflow in checkindex itself).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe by using newFSDirectory instead, we can remove the loop that reads the vectors from all the docs at the end? I would just nuke the loop thru all the docs myself, and keep the checks that e.g. vector field exists with the dimensions you expect. that's good to have in the test.

CheckIndex will read all the vectors though, but more thoroughly and probably not cost the test really any more runtime either.

if (VERBOSE && i % 10_000 == 0) {
System.out.println("Indexed " + i + " vectors out of " + numVectors);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve the output for this long-running test. I had to fill in the gaps with jstack otherwise:
I would also consider changing the loop to be for (int i = 1; i <= numVectors; i++). Then the print will say "Indexed 1000000 vectors out of 1000000 vectors" at the very end, so that you know indexing is complete. This does not happen today.

Maybe also here before the forceMerge:

if (VERBOSE) {
   System.out.println("forceMerge()ing to one segment...");
}

@rmuir
Copy link
Member

rmuir commented Oct 21, 2022

With current test i hit the exception on the 9.4 tag: BUILD FAILED in 2h 24m 45s:
2GB heap. Never saw any significant time (e.g. 0.1%) in GC or other jvm threads when inspecting the running test:
The initial indexing takes about an hour and then the forcemerge takes an eternity (over an hour), but it works:

org.apache.lucene.document.TestManyKnnVectors > testLargeSegment FAILED
    java.lang.IllegalStateException: Vector data length 3072000000 not matching size=1000000 * dim=768 * byteSize=4 = -1222967296
        at __randomizedtesting.SeedInfo.seed([CF186B7BCEFCCF79:EBD7012A6CACC57]:0)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.validateFieldEntry(Lucene94HnswVectorsReader.java:185)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readFields(Lucene94HnswVectorsReader.java:156)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readMetadata(Lucene94HnswVectorsReader.java:103)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.<init>(Lucene94HnswVectorsReader.java:64)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsFormat.fieldsReader(Lucene94HnswVectorsFormat.java:157)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.<init>(PerFieldKnnVectorsFormat.java:219)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat.fieldsReader(PerFieldKnnVectorsFormat.java:81)
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:157)
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:91)
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179)
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221)
        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:536)
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91)
        at org.apache.lucene.document.TestManyKnnVectors.testLargeSegment(TestManyKnnVectors.java:94)

@mikemccand
Copy link
Member

I love this idea of a "high scale" KNN monster test! It can catch overflow exceptions that we otherwise miss, and @rmuir hit a spooky exception that might be just such an example? @jtibshirani can we finish iterating on this PR and roll this test into Lucene, at least when running the @Monster tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants