Add monster test that indexes 1M vectors #11867

jtibshirani · 2022-10-20T19:42:19Z

This is a rough draft of a large-scale test for kNN vectors.

It tests a large dataset of kNN vectors to check for issues that only show up when
segments are very large, like overflow. The dataset is based on the StackOverflow
track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector.
I tried developing a test using random vectors, but HNSW can become quite slow
and ineffective when the data doesn't have structure.

Steps to run the test

Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin
Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/
Start the test: ./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1

Relates to #11863.

rmuir · 2022-10-20T20:18:23Z

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java

+ *
+ * Steps to run the test
+ *   1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin
+ *   2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/


This tries to make a 3GB jar file as part of :lucene:core:jar task. For me it takes an eternity due to the zipping of the file into the jar. I dropped the file in src/test folder instead and the test is running with it.

I think for this one i just suggest changing the code comment to say mv documents.bin lucene/core/src/test/. It makes for a faster experience.

rmuir · 2022-10-20T20:21:29Z

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java

+    assertNotNull(documentsPath);
+
+    try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI()));
+         Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors"));


if we use newFSDirectory() instead, then we get a checkindex at the end too. It can give more confidence in tests like these (as well as confidence there is no overflow in checkindex itself).

Maybe by using newFSDirectory instead, we can remove the loop that reads the vectors from all the docs at the end? I would just nuke the loop thru all the docs myself, and keep the checks that e.g. vector field exists with the dimensions you expect. that's good to have in the test.

CheckIndex will read all the vectors though, but more thoroughly and probably not cost the test really any more runtime either.

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java

rmuir · 2022-10-20T23:36:36Z

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java

+        if (VERBOSE && i % 10_000 == 0) {
+          System.out.println("Indexed " + i + " vectors out of " + numVectors);
+        }
+      }


We can improve the output for this long-running test. I had to fill in the gaps with jstack otherwise:
I would also consider changing the loop to be for (int i = 1; i <= numVectors; i++). Then the print will say "Indexed 1000000 vectors out of 1000000 vectors" at the very end, so that you know indexing is complete. This does not happen today.

Maybe also here before the forceMerge:

if (VERBOSE) { System.out.println("forceMerge()ing to one segment..."); }

rmuir · 2022-10-21T01:16:04Z

With current test i hit the exception on the 9.4 tag: BUILD FAILED in 2h 24m 45s:
2GB heap. Never saw any significant time (e.g. 0.1%) in GC or other jvm threads when inspecting the running test:
The initial indexing takes about an hour and then the forcemerge takes an eternity (over an hour), but it works:

org.apache.lucene.document.TestManyKnnVectors > testLargeSegment FAILED
    java.lang.IllegalStateException: Vector data length 3072000000 not matching size=1000000 * dim=768 * byteSize=4 = -1222967296
        at __randomizedtesting.SeedInfo.seed([CF186B7BCEFCCF79:EBD7012A6CACC57]:0)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.validateFieldEntry(Lucene94HnswVectorsReader.java:185)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readFields(Lucene94HnswVectorsReader.java:156)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.readMetadata(Lucene94HnswVectorsReader.java:103)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsReader.<init>(Lucene94HnswVectorsReader.java:64)
        at org.apache.lucene.codecs.lucene94.Lucene94HnswVectorsFormat.fieldsReader(Lucene94HnswVectorsFormat.java:157)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.<init>(PerFieldKnnVectorsFormat.java:219)
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat.fieldsReader(PerFieldKnnVectorsFormat.java:81)
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:157)
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:91)
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179)
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221)
        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:536)
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:598)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:91)
        at org.apache.lucene.document.TestManyKnnVectors.testLargeSegment(TestManyKnnVectors.java:94)

mikemccand · 2023-11-02T13:03:22Z

I love this idea of a "high scale" KNN monster test! It can catch overflow exceptions that we otherwise miss, and @rmuir hit a spooky exception that might be just such an example? @jtibshirani can we finish iterating on this PR and roll this test into Lucene, at least when running the @Monster tests?

Add monster test that indexes 1M vectors

d992964

rmuir reviewed Oct 20, 2022

View reviewed changes

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java Outdated Show resolved Hide resolved

jtibshirani added 4 commits October 20, 2022 13:40

Make sure to use default codec

c6c0851

Increase RAM buffer

e68f41a

Remove index sort

8804217

Fix checkstyle

6af6556

rmuir reviewed Oct 20, 2022

View reviewed changes

lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java Outdated Show resolved Hide resolved

Fix RAM buffer size

09d2b68

rmuir reviewed Oct 20, 2022

View reviewed changes

This was referenced Nov 8, 2022

Fix integer overflow when seeking the vector index for connections #11905

Merged

Add monster test for many knn docs #11906

Closed

jtibshirani mentioned this pull request Nov 10, 2022

Add large-scale test for kNN vectors #11863

Open

alessandrobenedetti added the vector-based-search label Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add monster test that indexes 1M vectors #11867

Add monster test that indexes 1M vectors #11867

jtibshirani commented Oct 20, 2022

rmuir Oct 20, 2022

rmuir Oct 20, 2022

rmuir Oct 20, 2022

rmuir Oct 20, 2022

rmuir Oct 20, 2022

rmuir commented Oct 21, 2022

mikemccand commented Nov 2, 2023

Add monster test that indexes 1M vectors #11867

Are you sure you want to change the base?

Add monster test that indexes 1M vectors #11867

Conversation

jtibshirani commented Oct 20, 2022

rmuir Oct 20, 2022

Choose a reason for hiding this comment

rmuir Oct 20, 2022

Choose a reason for hiding this comment

rmuir Oct 20, 2022

Choose a reason for hiding this comment

rmuir Oct 20, 2022

Choose a reason for hiding this comment

rmuir Oct 20, 2022

Choose a reason for hiding this comment

rmuir commented Oct 21, 2022

mikemccand commented Nov 2, 2023