Fix integer overflow when seeking the vector index for connections #11905

benwtrent · 2022-11-07T22:00:09Z

This bug has been around since 9.1. It relates directly to the number of nodes that are contained in the level 0 of the HNSW graph. Since level 0 contains all the nodes, this implies the following:

In Lucene 9.1, the bug probably would have appeared once 31580641 (Integer.MAX_VALUE/(maxConn + 1)) vectors were in a single segment
In Lucene 9.2+, the bug appears when there are 16268814 (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.

The stack trace would indicate an EOF failure as Lucene attempts to seek to a negative number in ByteBufferIndexInput.

This commit fixes the type casting and utilizes Math.exact... in the number multiplication and addition. The overhead here is minimal as these calculations are done in constructors and then used repeatably afterwards.

I put fixes in the older codecs, I don't know if that is typically done, but if somebody has a large segment and wants to read the vectors, they could build this jar and read them now (the bug is only on read and data layout is unchanged)

rmuir · 2022-11-08T04:09:47Z

Not good: thanks for splitting this out from your other PR! wonder if we can start cooking up something similar to #11867 ? Looks like we need more vectors but they can have less dimensions to trigger this one? Maybe they can be artificially generated so we don't need a dataset?

rmuir · 2022-11-08T06:47:29Z

ok i tried to make a stab at a test in that draft PR, but its still pretty slow so I'm gonna leave it running. we have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.

benwtrent · 2022-11-08T13:18:03Z

We have to start building up tests for these cases because this seems like deja vu as far as int overflows in this area.

I am right there with ya @rmuir. 100% feels like "whackamole".

Looks like we need more vectors but they can have less dimensions to trigger this one?

Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work.

One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.

but its still pretty slow so I'm gonna leave it running.

Thanks for digging into writing a test up. I am thinking on how to test it. I initially was thinking about moving all these numeric calculations outside of the I/O path. But that pattern is not prevalent in Lucene. Also, doing this type of "unit tests" that wouldn't ACTUALLY be using large amounts of data still won't solve our lack of coverage in larger test scenarios.

As an aside, all these scenarios fixed here @rmuir had a Java auto-cast warning. Running IntelliJ's analyzer there are over 100 issues of integer multiplication being auto-cast to long. It may be prudent to dig through these warnings and see if they need fixing. What do you think?

dweiss · 2022-11-08T13:33:51Z

There's a whole bunch of automated checks you could go through, selectively, and try to enable them for the future. This includes IntLongMath, which is currently off.

https://github.com/apache/lucene/blob/main/gradle/validation/error-prone.gradle#L71-L183

rmuir · 2022-11-08T14:00:45Z

Thanks for digging into writing a test up. I am thinking on how to test it. I initially was thinking about moving all these numeric calculations outside of the I/O path. But that pattern is not prevalent in Lucene. Also, doing this type of "unit tests" that wouldn't ACTUALLY be using large amounts of data still won't solve our lack of coverage in larger test scenarios.

Agreed, it is an annoying problem. But its just incredibly painful for vectors due to the way they don't really scale with large data (heap/cpu). It is true that most of these "Test2BXXX" tests are monstrous, but not all. For example Test2BPostings runs in every single nightly build. Its an example of a @Monster test that eventually became less monstrous and was able to be @Nightly.

Still, monster tests are better than nothing, even though they don't run in CI, e.g. when verifying releases or even reviewing PRs. I was already planning on running #11867 test against your other PR (#11860) as I saw that the PR might be older than the last int-overflow bug, so I wanted to make sure we didn't regress. Not ideal situation, but some kind of test is better than nothing.

@mikemccand even has a test somewhere that powers machine on/off with an X10 controller to validate data is synchronized to disk correctly :)

As an aside, all these scenarios fixed here @rmuir had a Java auto-cast warning. Running IntelliJ's analyzer there are over 100 issues of integer multiplication being auto-cast to long. It may be prudent to dig through these warnings and see if they need fixing. What do you think?

I agree with using static analysis to try to address more of these. +1

rmuir · 2022-11-08T14:14:51Z

Yeah, we can probably trigger this overflow by using 16268815 byte vectors of few dimensions. Something as small as 2 dimensions could work.

One issue with HNSW is that completely random vectors can make it run dog-slow on index. Maybe having few dimensions could alleviate this.

We may need to modify the draft test then to trigger the bug. I used only one dimension and simple docid % 256 to assign vector value. I also only used 16268814 documents so it may need another one. I also am unsure if CheckIndex at the end will trigger the issue you describe, maybe it only calls next and not advance or something like that. In such a case, we may need to fix CheckIndex to do some "advancing", too. It does a similar thing already for the postings, see // Test skipping section of the checkFields() method. So its definitely all a WIP

rmuir · 2022-11-08T19:29:32Z

In Lucene 9.2+, the bug appears when there are 16268814 (Integer.MAX_VALUE/(M * 2 + 1)) or more vectors in a single segment.

If this is correct we should just be able to create new Lucene94HnswVectorsFormat(BIG_ASS_M, defaultBeamWidth) in the test and only use a few documents to trigger this overflow? could be a normal unit test. I'm not having luck reproducing this issue with a monster test.

benwtrent · 2022-11-08T19:45:35Z

@rmuir Thinking outside the box! I will try that. It would definitely cause the graph offset calculation to be completely blown out of proportion! Which is the cause of this overflow.

rmuir · 2022-11-08T19:57:57Z

Yes, if such a test works it may at least prevent similar regressions.

Another possible idea is to give every vector value of 0, then zip up the index, it should be ~16MB of zeros which would get compressed away. But then we have to regenerate the index (e.g. like backwards-codecs tests) which could be a hassle.

I still havent given up on the monster test, as a backup plan. I tried 16268814 and 16268815 but neither failed. I'm trying 20000000 right now, and configured tests to keep the index afterwards. if it doesn't fail I will start messing with checkindex code...

rmuir · 2022-11-08T20:01:50Z

@jdconrad helped with some math that may explain why previous tests didnt fail:

jshell> int M = 16;
M ==> 16

jshell> long v1 = (1 + (M*2)) * 4 * 16268814;
v1 ==> 2147483448

jshell> long v2 = (1 + (M*2)) * 4 * 16268815;
v2 ==> 2147483580

jshell> long v3 = (1 + (M*2)) * 4 * 20000000;
v3 ==> -1654967296

So maybe this run with 20M docs will trip. I'll know in a few hours :)

rmuir · 2022-11-08T21:03:12Z

With the 20M docs it still didnt fail. I have the index saved so i can play around, maybe checkindex doesnt trigger what is needed here (e.g. advance vs next).

It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful.

benwtrent · 2022-11-08T21:14:30Z

It is a little crazy that this index has 2.5GB .vex file that, if i run zip, deflates 98% down to 75MB. very wasteful.

Agreed :). Once this stuff is solved, I hope to further iterate on reducing the size of the vector index. There is LOTS of room for improvement there.

benwtrent · 2022-11-09T12:15:13Z

@rmuir i took your test, modified it slightly (changing number of vectors and the assertion). It ran for 3.5 hours and failed on the old code in the exactly correct spot (overflowing on graph location).

I am running it again with my fix to verify the fix. Thank you for iterating on this with me.

rmuir · 2022-11-09T12:45:11Z

yeah its your search at the end that triggers the issue, because checkindex on vectors is currently too wimpy and doesn't ever call seek(). this is also an issue that we must address here.

you can make the test much faster by sticking with just one dimension and passing some additional args. The one on the other PR runs in 1.5 hours in its current config.

;

benwtrent · 2022-11-09T17:54:32Z

Updated the monster test.

Ran about 2hrs on my laptop (but I was working, in virtual meetings, etc. during the entire run).

Confirmed it fails without this patch. This patch allows it to pass.

@rmuir @jdconrad

Would be good to get this in a 9.4.2 patch.

rmuir · 2022-11-09T18:35:31Z

nice. i'm fine with the changes. We can open another issue to fix the checkindex stuff. I do really think we should do that before releasing to look for more trouble. Also good if we can go through a candidate list of other potential problems (even if its noisy, its still less than entire codebase).

rmuir · 2022-11-09T18:37:19Z

and thanks for help battling the testing. it will get better!

…pache#11905) * Fix integer overflow when seeking the vector index for connections * Adding monster test to cause overflow failure

…11905) * Fix integer overflow when seeking the vector index for connections * Adding monster test to cause overflow failure

This check finds bugs such as #11905. See https://errorprone.info/bugpattern/NarrowCalculation

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

Fix integer overflow when seeking the vector index for connections

b9fc463

rmuir mentioned this pull request Nov 8, 2022

Add monster test for many knn docs #11906

Closed

Adding monster test to cause overflow failure

b883a75

adding forgotten license

5dbf7de

rmuir mentioned this pull request Nov 9, 2022

improve error-prone configuration for int-overflow bugs #11910

Closed

updating test

6cccd68

;

tidy

0e38204

rmuir approved these changes Nov 9, 2022

View reviewed changes

This was referenced Nov 9, 2022

improve checkindex to be more thorough for vectors (e.g. test seeking) #11911

Closed

Can we improve MMapDir's exceptions for invalid offsets? #11912

Closed

rmuir added this to the 9.4.2 milestone Nov 10, 2022

rmuir added the legacy-jira-priority:Blocker label Nov 10, 2022

rmuir merged commit 1360baa into apache:main Nov 10, 2022

asfgit pushed a commit that referenced this pull request Nov 10, 2022

Fix integer overflow when seeking the vector index for connections (#…

eb40f5d

…11905) * Fix integer overflow when seeking the vector index for connections * Adding monster test to cause overflow failure

jtibshirani mentioned this pull request Nov 10, 2022

Add large-scale test for kNN vectors #11863

Open

rmuir mentioned this pull request Nov 11, 2022

enable error-prone "narrow calculation" check #11923

Merged

rmuir added a commit that referenced this pull request Nov 15, 2022

enable error-prone "narrow calculation" check (#11923)

1b9d98d

This check finds bugs such as #11905. See https://errorprone.info/bugpattern/NarrowCalculation

asfgit pushed a commit that referenced this pull request Nov 15, 2022

enable error-prone "narrow calculation" check (#11923)

2364f58

This check finds bugs such as #11905. See https://errorprone.info/bugpattern/NarrowCalculation

asfgit pushed a commit that referenced this pull request Nov 15, 2022

enable error-prone "narrow calculation" check (#11923)

3097f83

This check finds bugs such as #11905. See https://errorprone.info/bugpattern/NarrowCalculation

jdconrad mentioned this pull request Nov 17, 2022

Decrease test time for TestManyKnnDocs.testLargeSegment #11945

Merged

jpountz mentioned this pull request Nov 22, 2022

Upgrade Lucene to version 9.4.2. elastic/elasticsearch#91823

Merged

jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (#91823)

84b940c

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

jpountz mentioned this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (#91823) elastic/elasticsearch#91832

Merged

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (elastic#91823)

5d10c4b

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

jpountz added a commit to jpountz/elasticsearch that referenced this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (elastic#91823)

93920e1

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

jpountz mentioned this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (#91823) elastic/elasticsearch#91833

Merged

jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 23, 2022

Upgrade Lucene to version 9.4.2. (#91823)

3fe0185

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

jpountz added a commit to elastic/elasticsearch that referenced this pull request Nov 24, 2022

Upgrade Lucene to version 9.4.2. (#91823)

e269bd4

Most interestingly, this includes a fix for KNN vectors: apache/lucene#11905.

rmuir mentioned this pull request Mar 10, 2023

Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel #12200

Closed

benwtrent deleted the bugfix/fix-int-overflow-hnsw branch March 13, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix integer overflow when seeking the vector index for connections #11905

Fix integer overflow when seeking the vector index for connections #11905

benwtrent commented Nov 7, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

dweiss commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

benwtrent commented Nov 9, 2022

rmuir commented Nov 9, 2022

benwtrent commented Nov 9, 2022

rmuir commented Nov 9, 2022

rmuir commented Nov 9, 2022

Fix integer overflow when seeking the vector index for connections #11905

Fix integer overflow when seeking the vector index for connections #11905

Conversation

benwtrent commented Nov 7, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

dweiss commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

rmuir commented Nov 8, 2022

benwtrent commented Nov 8, 2022

benwtrent commented Nov 9, 2022

rmuir commented Nov 9, 2022

benwtrent commented Nov 9, 2022

rmuir commented Nov 9, 2022

rmuir commented Nov 9, 2022