-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monster test that indexes 1M vectors #11867
base: main
Are you sure you want to change the base?
Conversation
* | ||
* Steps to run the test | ||
* 1. Download the dataset: wget https://rally-tracks.elastic.co/so_vector/documents.bin | ||
* 2. Move the dataset to the resources folder: mv documents.bin lucene/core/src/resources/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tries to make a 3GB jar file as part of :lucene:core:jar
task. For me it takes an eternity due to the zipping of the file into the jar. I dropped the file in src/test
folder instead and the test is running with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this one i just suggest changing the code comment to say mv documents.bin lucene/core/src/test/
. It makes for a faster experience.
assertNotNull(documentsPath); | ||
|
||
try (FileChannel input = FileChannel.open(Paths.get(documentsPath.toURI())); | ||
Directory dir = FSDirectory.open(createTempDir("ManyKnnVectors")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use newFSDirectory()
instead, then we get a checkindex at the end too. It can give more confidence in tests like these (as well as confidence there is no overflow in checkindex itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe by using newFSDirectory
instead, we can remove the loop that reads the vectors from all the docs at the end? I would just nuke the loop thru all the docs myself, and keep the checks that e.g. vector field exists with the dimensions you expect. that's good to have in the test.
CheckIndex will read all the vectors though, but more thoroughly and probably not cost the test really any more runtime either.
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/document/TestManyKnnVectors.java
Outdated
Show resolved
Hide resolved
if (VERBOSE && i % 10_000 == 0) { | ||
System.out.println("Indexed " + i + " vectors out of " + numVectors); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve the output for this long-running test. I had to fill in the gaps with jstack
otherwise:
I would also consider changing the loop to be for (int i = 1; i <= numVectors; i++)
. Then the print will say "Indexed 1000000 vectors out of 1000000 vectors" at the very end, so that you know indexing is complete. This does not happen today.
Maybe also here before the forceMerge
:
if (VERBOSE) {
System.out.println("forceMerge()ing to one segment...");
}
With current test i hit the exception on the 9.4 tag: BUILD FAILED in 2h 24m 45s:
|
I love this idea of a "high scale" KNN monster test! It can catch overflow exceptions that we otherwise miss, and @rmuir hit a spooky exception that might be just such an example? @jtibshirani can we finish iterating on this PR and roll this test into Lucene, at least when running the |
This is a rough draft of a large-scale test for kNN vectors.
It tests a large dataset of kNN vectors to check for issues that only show up when
segments are very large, like overflow. The dataset is based on the StackOverflow
track from Elasticsearch's rally benchmarks: https://github.com/elastic/rally-tracks/tree/master/so_vector.
I tried developing a test using random vectors, but HNSW can become quite slow
and ineffective when the data doesn't have structure.
Steps to run the test
wget https://rally-tracks.elastic.co/so_vector/documents.bin
mv documents.bin lucene/core/src/resources/
./gradlew test --tests TestManyKnnVectors.testLargeSegment -Dtests.monster=true -Dtests.verbose=true -Dorg.gradle.jvmargs="-Xms2g -Xmx2g" --max-workers=1
Relates to #11863.