Validate binary index consistency during deserialization (#4978) by scsiguy · Pull Request #4978 · facebookresearch/faiss

scsiguy · 2026-03-24T17:34:30Z

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

read_index_binary_header —
strengthen d >= 0 to d > 0 && d % 8 == 0 and add code_size == d / 8. Binary indices represent vectors as bit arrays with one bit per
dimension; all operations compute byte offsets via d / 8. A dimension
that is not a multiple of 8 causes integer truncation in code_size = d / 8, making every vector stride too short and causing out-of-bounds
reads/writes in add(), search(), and distance computations. A d = 0 causes division by zero. A mismatched code_size has the same effect
— vector data is strided by code_size bytes (x + i * code_size),
so a wrong value means every vector access is at the wrong offset.
read_binary_hash_invlists —
validate vecs.size() == ids.size() * code_size. Each inverted list
entry in IndexBinaryHash pairs a vector ID with code_size bytes of
raw vector data. If vecs is shorter than ids.size() * code_size,
distance computations that iterate through the vector bytes using the
ID count as stride will read past the end of the vecs buffer.
read_binary_multi_hash_map —
validate id < ntotal for each deserialized ID. IDs in the multi-hash
map index into the underlying IndexBinaryFlat storage array. An
out-of-range ID causes an out-of-bounds read when the vector is looked
up during search, either returning garbage distances or crashing.
IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc) —
validate three consistency invariants:
1. hnsw.levels.size() == ntotal: The HNSW graph stores one level entry
  per vector. A mismatch causes out-of-bounds reads in the levels array
  during graph traversal, which runs inside OpenMP parallel regions.
2. Storage is IndexBinaryFlat: get_distance_computer() does
  dynamic_cast<IndexBinaryFlat*>(storage) and dereferences the result.
  A non-flat storage type yields a null pointer, crashing the process.
3. storage->ntotal == ntotal: HNSW graph node IDs index into the
  storage. If storage has fewer vectors than the graph expects,
  search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033

meta-codesync · 2026-03-24T17:34:38Z

@scsiguy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97819033.

…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033

Summary: Validate that the `qb` (query quantization bits) field is in [0, 8] at all 6 RaBitQ deserialization points (IndexRaBitQ, IndexRaBitQFastScan, IndexIVFRaBitQ, IndexIVFRaBitQFastScan, and their multi-bit variants). The `qb` parameter controls the scalar quantization precision of query vectors in the RaBitQ algorithm. The [0, 8] range is enforced because: - **`qb = 0` is valid and means "use unquantized float32 distance computation."** When `qb == 0`, the code returns a `RaBitQDistanceComputerNotQ` that operates on raw fp32 values without any scalar quantization of the query. - **`qb > 8` overflows `uint8_t` quantized values.** The query vector is scalar-quantized into `rotated_qq`, a `std::vector<uint8_t>`, with values in `[0, (1 << qb) - 1]`. For `qb > 8`, `(1 << qb) - 1` exceeds 255 and the cast to `uint8_t` silently truncates, producing incorrect quantized values. The bit-rearrangement loop in `set_query` also extracts individual bits via `rotated_qq[idim] & (1 << iv)` for `iv` in `[0, qb)` — with `qb > 8` this shifts beyond the 8-bit width of `uint8_t`, yielding undefined behavior. - **Existing runtime checks are `FAISS_ASSERT`, not `FAISS_THROW`.** Both `RaBitQDistanceComputerQ::set_query()` and `compute_query_factors()` have `FAISS_ASSERT(qb > 0)` / `FAISS_ASSERT(qb <= 8)` checks, but these compile to no-ops in release builds and call `abort()` in debug builds. In either case, when reached inside an OMP parallel region (as happens during search), `abort()` terminates the entire process without cleanup. Validating at deserialization time converts this into a catchable exception before the invalid value can reach any parallel code path. Differential Revision: D97815297

…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033

…earch#4978) Summary: Pull Request resolved: facebookresearch#4978 Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033

…earch#4978) Summary: Add validation checks for binary index types at deserialization time. Without these checks, malformed index files can cause out-of-bounds memory accesses, null pointer dereferences, and type confusion during subsequent search or add operations. - **`read_index_binary_header`** — strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d / 8`. Binary indices represent vectors as bit arrays with one bit per dimension; all operations compute byte offsets via `d / 8`. A dimension that is not a multiple of 8 causes integer truncation in `code_size = d / 8`, making every vector stride too short and causing out-of-bounds reads/writes in `add()`, `search()`, and distance computations. A `d = 0` causes division by zero. A mismatched `code_size` has the same effect — vector data is strided by `code_size` bytes (`x + i * code_size`), so a wrong value means every vector access is at the wrong offset. - **`read_binary_hash_invlists`** — validate `vecs.size() == ids.size() * code_size`. Each inverted list entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of raw vector data. If `vecs` is shorter than `ids.size() * code_size`, distance computations that iterate through the vector bytes using the ID count as stride will read past the end of the `vecs` buffer. - **`read_binary_multi_hash_map`** — validate `id < ntotal` for each deserialized ID. IDs in the multi-hash map index into the underlying `IndexBinaryFlat` storage array. An out-of-range ID causes an out-of-bounds read when the vector is looked up during search, either returning garbage distances or crashing. - **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** — validate three consistency invariants: 1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry per vector. A mismatch causes out-of-bounds reads in the levels array during graph traversal, which runs inside OpenMP parallel regions. 2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result. A non-flat storage type yields a null pointer, crashing the process. 3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the storage. If storage has fewer vectors than the graph expects, search traversal accesses out-of-bounds storage entries. Reviewed By: mnorris11 Differential Revision: D97819033

meta-codesync · 2026-03-26T19:20:17Z

This pull request has been merged in d26c15d.

meta-cla Bot added the CLA Signed label Mar 24, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 24, 2026

meta-codesync Bot changed the title ~~Validate binary index consistency during deserialization~~ Validate binary index consistency during deserialization (#4978) Mar 25, 2026

scsiguy force-pushed the export-D97819033 branch from 8d1aafb to ab62ac6 Compare March 25, 2026 17:44

scsiguy force-pushed the export-D97819033 branch from ab62ac6 to 2c3bae8 Compare March 25, 2026 17:48

meta-codesync Bot closed this in d26c15d Mar 26, 2026

facebook-github-tools Bot added the Merged label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate binary index consistency during deserialization (#4978)#4978

Validate binary index consistency during deserialization (#4978)#4978
scsiguy wants to merge 2 commits into
facebookresearch:mainfrom
scsiguy:export-D97819033

scsiguy commented Mar 24, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Mar 24, 2026

Uh oh!

meta-codesync Bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scsiguy commented Mar 24, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Mar 24, 2026

Uh oh!

meta-codesync Bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scsiguy commented Mar 24, 2026 •

edited by meta-codesync Bot

Loading