Validate binary index consistency during deserialization (#4978)#4978
Closed
scsiguy wants to merge 2 commits into
Closed
Validate binary index consistency during deserialization (#4978)#4978scsiguy wants to merge 2 commits into
scsiguy wants to merge 2 commits into
Conversation
Contributor
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 24, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 24, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
Summary: Validate that the `qb` (query quantization bits) field is in [0, 8] at all 6 RaBitQ deserialization points (IndexRaBitQ, IndexRaBitQFastScan, IndexIVFRaBitQ, IndexIVFRaBitQFastScan, and their multi-bit variants). The `qb` parameter controls the scalar quantization precision of query vectors in the RaBitQ algorithm. The [0, 8] range is enforced because: - **`qb = 0` is valid and means "use unquantized float32 distance computation."** When `qb == 0`, the code returns a `RaBitQDistanceComputerNotQ` that operates on raw fp32 values without any scalar quantization of the query. - **`qb > 8` overflows `uint8_t` quantized values.** The query vector is scalar-quantized into `rotated_qq`, a `std::vector<uint8_t>`, with values in `[0, (1 << qb) - 1]`. For `qb > 8`, `(1 << qb) - 1` exceeds 255 and the cast to `uint8_t` silently truncates, producing incorrect quantized values. The bit-rearrangement loop in `set_query` also extracts individual bits via `rotated_qq[idim] & (1 << iv)` for `iv` in `[0, qb)` — with `qb > 8` this shifts beyond the 8-bit width of `uint8_t`, yielding undefined behavior. - **Existing runtime checks are `FAISS_ASSERT`, not `FAISS_THROW`.** Both `RaBitQDistanceComputerQ::set_query()` and `compute_query_factors()` have `FAISS_ASSERT(qb > 0)` / `FAISS_ASSERT(qb <= 8)` checks, but these compile to no-ops in release builds and call `abort()` in debug builds. In either case, when reached inside an OMP parallel region (as happens during search), `abort()` terminates the entire process without cleanup. Validating at deserialization time converts this into a catchable exception before the invalid value can reach any parallel code path. Differential Revision: D97815297
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
8d1aafb to
ab62ac6
Compare
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
…earch#4978) Summary: Pull Request resolved: facebookresearch#4978 Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
ab62ac6 to
2c3bae8
Compare
scsiguy
added a commit
to scsiguy/faiss
that referenced
this pull request
Mar 25, 2026
…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033
Contributor
|
This pull request has been merged in d26c15d. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.
read_index_binary_header—strengthen
d >= 0tod > 0 && d % 8 == 0and addcode_size == d / 8. Binary indices represent vectors as bit arrays with one bit perdimension; all operations compute byte offsets via
d / 8. A dimensionthat is not a multiple of 8 causes integer truncation in
code_size = d / 8, making every vector stride too short and causing out-of-boundsreads/writes in
add(),search(), and distance computations. Ad = 0causes division by zero. A mismatchedcode_sizehas the same effect— vector data is strided by
code_sizebytes (x + i * code_size),so a wrong value means every vector access is at the wrong offset.
read_binary_hash_invlists—validate
vecs.size() == ids.size() * code_size. Each inverted listentry in
IndexBinaryHashpairs a vector ID withcode_sizebytes ofraw vector data. If
vecsis shorter thanids.size() * code_size,distance computations that iterate through the vector bytes using the
ID count as stride will read past the end of the
vecsbuffer.read_binary_multi_hash_map—validate
id < ntotalfor each deserialized ID. IDs in the multi-hashmap index into the underlying
IndexBinaryFlatstorage array. Anout-of-range ID causes an out-of-bounds read when the vector is looked
up during search, either returning garbage distances or crashing.
IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc) —
validate three consistency invariants:
hnsw.levels.size() == ntotal: The HNSW graph stores one level entryper vector. A mismatch causes out-of-bounds reads in the levels array
during graph traversal, which runs inside OpenMP parallel regions.
IndexBinaryFlat:get_distance_computer()doesdynamic_cast<IndexBinaryFlat*>(storage)and dereferences the result.A non-flat storage type yields a null pointer, crashing the process.
storage->ntotal == ntotal: HNSW graph node IDs index into thestorage. If storage has fewer vectors than the graph expects,
search traversal accesses out-of-bounds storage entries.
Reviewed By: mnorris11
Differential Revision: D97819033