Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix integer overflow when seeking the vector index for connections #11905

Merged
merged 5 commits into from
Nov 10, 2022

Conversation

benwtrent
Copy link
Member

This bug has been around since 9.1. It relates directly to the number of nodes that are contained in the level 0 of the HNSW graph. Since level 0 contains all the nodes, this implies the following:

  • In Lucene 9.1, the bug probably would have appeared once 31580641 (Integer.MAX_VALUE/(maxConn + 1)) vectors were in a single segment
  • In Lucene 9.2+, the bug appears when there are 16268814 (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.

The stack trace would indicate an EOF failure as Lucene attempts to seek to a negative number in ByteBufferIndexInput.

This commit fixes the type casting and utilizes Math.exact... in the number multiplication and addition. The overhead here is minimal as these calculations are done in constructors and then used repeatably afterwards.

I put fixes in the older codecs, I don't know if that is typically done, but if somebody has a large segment and wants to read the vectors, they could build this jar and read them now (the bug is only on read and data layout is unchanged)

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

Not good: thanks for splitting this out from your other PR! wonder if we can start cooking up something similar to #11867 ? Looks like we need more vectors but they can have less dimensions to trigger this one? Maybe they can be artificially generated so we don't need a dataset?

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

ok i tried to make a stab at a test in that draft PR, but its still pretty slow so I'm gonna leave it running. we have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.

@benwtrent
Copy link
Member Author

We have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.

I am right there with ya @rmuir. 100% feels like "whackamole".

Looks like we need more vectors but they can have less dimensions to trigger this one?

Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work.

One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.

but its still pretty slow so I'm gonna leave it running.

Thanks for digging into writing a test up. I am thinking on how to test it. I initially was thinking about moving all these numeric calculations outside of the I/O path. But that pattern is not prevalent in Lucene. Also, doing this type of "unit tests" that wouldn't ACTUALLY be using large amounts of data still won't solve our lack of coverage in larger test scenarios.


As an aside, all these scenarios fixed here @rmuir had a Java auto-cast warning. Running IntelliJ's analyzer there are over 100 issues of integer multiplication being auto-cast to long. It may be prudent to dig through these warnings and see if they need fixing. What do you think?

@dweiss
Copy link
Contributor

dweiss commented Nov 8, 2022

There's a whole bunch of automated checks you could go through, selectively, and try to enable them for the future. This includes IntLongMath, which is currently off.

https://github.com/apache/lucene/blob/main/gradle/validation/error-prone.gradle#L71-L183

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

Thanks for digging into writing a test up. I am thinking on how to test it. I initially was thinking about moving all these numeric calculations outside of the I/O path. But that pattern is not prevalent in Lucene. Also, doing this type of "unit tests" that wouldn't ACTUALLY be using large amounts of data still won't solve our lack of coverage in larger test scenarios.

Agreed, it is an annoying problem. But its just incredibly painful for vectors due to the way they don't really scale with large data (heap/cpu). It is true that most of these "Test2BXXX" tests are monstrous, but not all. For example Test2BPostings runs in every single nightly build. Its an example of a @Monster test that eventually became less monstrous and was able to be @Nightly.

Still, monster tests are better than nothing, even though they don't run in CI, e.g. when verifying releases or even reviewing PRs. I was already planning on running #11867 test against your other PR (#11860) as I saw that the PR might be older than the last int-overflow bug, so I wanted to make sure we didn't regress. Not ideal situation, but some kind of test is better than nothing.

@mikemccand even has a test somewhere that powers machine on/off with an X10 controller to validate data is synchronized to disk correctly :)

As an aside, all these scenarios fixed here @rmuir had a Java auto-cast warning. Running IntelliJ's analyzer there are over 100 issues of integer multiplication being auto-cast to long. It may be prudent to dig through these warnings and see if they need fixing. What do you think?

I agree with using static analysis to try to address more of these. +1

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work.

One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.

We may need to modify the draft test then to trigger the bug. I used only one dimension and simple docid % 256 to assign vector value. I also only used 16268814 documents so it may need another one. I also am unsure if CheckIndex at the end will trigger the issue you describe, maybe it only calls next and not advance or something like that. In such a case, we may need to fix CheckIndex to do some "advancing", too. It does a similar thing already for the postings, see // Test skipping section of the checkFields() method. So its definitely all a WIP

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

  • In Lucene 9.2+, the bug appears when there are 16268814 (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.

If this is correct we should just be able to create new Lucene94HnswVectorsFormat(BIG_ASS_M, defaultBeamWidth) in the test and only use a few documents to trigger this overflow? could be a normal unit test. I'm not having luck reproducing this issue with a monster test.

@benwtrent
Copy link
Member Author

@rmuir Thinking outside the box! I will try that. It would definitely cause the graph offset calculation to be completely blown out of proportion! Which is the cause of this overflow.

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

Yes, if such a test works it may at least prevent similar regressions.

Another possible idea is to give every vector value of 0, then zip up the index, it should be ~16MB of zeros which would get compressed away. But then we have to regenerate the index (e.g. like backwards-codecs tests) which could be a hassle.

I still havent given up on the monster test, as a backup plan. I tried 16268814 and 16268815 but neither failed. I'm trying 20000000 right now, and configured tests to keep the index afterwards. if it doesn't fail I will start messing with checkindex code...

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

@jdconrad helped with some math that may explain why previous tests didnt fail:

jshell> int M = 16;
M ==> 16

jshell> long v1 = (1 + (M*2)) * 4 * 16268814;
v1 ==> 2147483448

jshell> long v2 = (1 + (M*2)) * 4 * 16268815;
v2 ==> 2147483580

jshell> long v3 = (1 + (M*2)) * 4 * 20000000;
v3 ==> -1654967296

So maybe this run with 20M docs will trip. I'll know in a few hours :)

@rmuir
Copy link
Member

rmuir commented Nov 8, 2022

With the 20M docs it still didnt fail. I have the index saved so i can play around, maybe checkindex doesnt trigger what is needed here (e.g. advance vs next).

It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful.

@benwtrent
Copy link
Member Author

It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful.

Agreed :). Once this stuff is solved, I hope to further iterate on reducing the size of the vector index. There is LOTS of room for improvement there.

@benwtrent
Copy link
Member Author

@rmuir i took your test, modified it slightly (changing number of vectors and the assertion). It ran for 3.5 hours and failed on the old code in the exactly correct spot (overflowing on graph location).

I am running it again with my fix to verify the fix. Thank you for iterating on this with me.

@rmuir
Copy link
Member

rmuir commented Nov 9, 2022

yeah its your search at the end that triggers the issue, because checkindex on vectors is currently too wimpy and doesn't ever call seek(). this is also an issue that we must address here.

you can make the test much faster by sticking with just one dimension and passing some additional args. The one on the other PR runs in 1.5 hours in its current config.

@benwtrent
Copy link
Member Author

Updated the monster test.

Ran about 2hrs on my laptop (but I was working, in virtual meetings, etc. during the entire run).

Confirmed it fails without this patch. This patch allows it to pass.

@rmuir @jdconrad

Would be good to get this in a 9.4.2 patch.

@rmuir
Copy link
Member

rmuir commented Nov 9, 2022

nice. i'm fine with the changes. We can open another issue to fix the checkindex stuff. I do really think we should do that before releasing to look for more trouble. Also good if we can go through a candidate list of other potential problems (even if its noisy, its still less than entire codebase).

@rmuir
Copy link
Member

rmuir commented Nov 9, 2022

and thanks for help battling the testing. it will get better!

rmuir pushed a commit to rmuir/lucene that referenced this pull request Nov 10, 2022
…pache#11905)

* Fix integer overflow when seeking the vector index for connections
* Adding monster test to cause overflow failure
rmuir pushed a commit to rmuir/lucene that referenced this pull request Nov 10, 2022
…pache#11905)

* Fix integer overflow when seeking the vector index for connections
* Adding monster test to cause overflow failure
asfgit pushed a commit that referenced this pull request Nov 10, 2022
…11905)

* Fix integer overflow when seeking the vector index for connections
* Adding monster test to cause overflow failure
jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 23, 2022
Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Nov 23, 2022
Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.
jpountz added a commit to jpountz/elasticsearch that referenced this pull request Nov 23, 2022
Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.
jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 23, 2022
Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.
jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 24, 2022
Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.
@benwtrent benwtrent deleted the bugfix/fix-int-overflow-hnsw branch March 13, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants