Skip to content

Validate binary index consistency during deserialization (#4978)#4978

Closed
scsiguy wants to merge 2 commits into
facebookresearch:mainfrom
scsiguy:export-D97819033
Closed

Validate binary index consistency during deserialization (#4978)#4978
scsiguy wants to merge 2 commits into
facebookresearch:mainfrom
scsiguy:export-D97819033

Conversation

@scsiguy
Copy link
Copy Markdown
Contributor

@scsiguy scsiguy commented Mar 24, 2026

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

  • read_index_binary_header
    strengthen d >= 0 to d > 0 && d % 8 == 0 and add code_size == d / 8. Binary indices represent vectors as bit arrays with one bit per
    dimension; all operations compute byte offsets via d / 8. A dimension
    that is not a multiple of 8 causes integer truncation in code_size = d / 8, making every vector stride too short and causing out-of-bounds
    reads/writes in add(), search(), and distance computations. A d = 0 causes division by zero. A mismatched code_size has the same effect
    — vector data is strided by code_size bytes (x + i * code_size),
    so a wrong value means every vector access is at the wrong offset.

  • read_binary_hash_invlists
    validate vecs.size() == ids.size() * code_size. Each inverted list
    entry in IndexBinaryHash pairs a vector ID with code_size bytes of
    raw vector data. If vecs is shorter than ids.size() * code_size,
    distance computations that iterate through the vector bytes using the
    ID count as stride will read past the end of the vecs buffer.

  • read_binary_multi_hash_map
    validate id < ntotal for each deserialized ID. IDs in the multi-hash
    map index into the underlying IndexBinaryFlat storage array. An
    out-of-range ID causes an out-of-bounds read when the vector is looked
    up during search, either returning garbage distances or crashing.

  • IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)
    validate three consistency invariants:

    1. hnsw.levels.size() == ntotal: The HNSW graph stores one level entry
      per vector. A mismatch causes out-of-bounds reads in the levels array
      during graph traversal, which runs inside OpenMP parallel regions.
    2. Storage is IndexBinaryFlat: get_distance_computer() does
      dynamic_cast<IndexBinaryFlat*>(storage) and dereferences the result.
      A non-flat storage type yields a null pointer, crashing the process.
    3. storage->ntotal == ntotal: HNSW graph node IDs index into the
      storage. If storage has fewer vectors than the graph expects,
      search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033

@meta-cla meta-cla Bot added the CLA Signed label Mar 24, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 24, 2026

@scsiguy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97819033.

scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 24, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 24, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
Summary:
Validate that the `qb` (query quantization bits) field is in [0, 8] at all
6 RaBitQ deserialization points (IndexRaBitQ, IndexRaBitQFastScan,
IndexIVFRaBitQ, IndexIVFRaBitQFastScan, and their multi-bit variants).

The `qb` parameter controls the scalar quantization precision of query
vectors in the RaBitQ algorithm. The [0, 8] range is enforced because:

- **`qb = 0` is valid and means "use unquantized float32 distance computation."**
  When `qb == 0`, the code returns a `RaBitQDistanceComputerNotQ` that
  operates on raw fp32 values without any scalar quantization of the query.

- **`qb > 8` overflows `uint8_t` quantized values.**
  The query vector is scalar-quantized into `rotated_qq`, a
  `std::vector<uint8_t>`, with values in `[0, (1 << qb) - 1]`. For `qb
  > 8`, `(1 << qb) - 1` exceeds 255 and the cast to `uint8_t` silently
  truncates, producing incorrect quantized values. The bit-rearrangement
  loop in `set_query` also extracts individual bits via `rotated_qq[idim]
  & (1 << iv)` for `iv` in `[0, qb)` — with `qb > 8` this shifts beyond
  the 8-bit width of `uint8_t`, yielding undefined behavior.

- **Existing runtime checks are `FAISS_ASSERT`, not `FAISS_THROW`.**
  Both `RaBitQDistanceComputerQ::set_query()` and
  `compute_query_factors()` have `FAISS_ASSERT(qb > 0)` / `FAISS_ASSERT(qb
  <= 8)` checks, but these compile to no-ops in release builds and call
  `abort()` in debug builds.  In either case, when reached inside an OMP
  parallel region (as happens during search), `abort()` terminates the
  entire process without cleanup.  Validating at deserialization time
  converts this into a catchable exception before the invalid value can
  reach any parallel code path.

Differential Revision: D97815297
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
@meta-codesync meta-codesync Bot changed the title Validate binary index consistency during deserialization Validate binary index consistency during deserialization (#4978) Mar 25, 2026
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
…earch#4978)

Summary:
Pull Request resolved: facebookresearch#4978

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
scsiguy added a commit to scsiguy/faiss that referenced this pull request Mar 25, 2026
…earch#4978)

Summary:

Add validation checks for binary index types at deserialization time.
Without these checks, malformed index files can cause out-of-bounds
memory accesses, null pointer dereferences, and type confusion during
subsequent search or add operations.

- **`read_index_binary_header`** —
  strengthen `d >= 0` to `d > 0 && d % 8 == 0` and add `code_size == d /
  8`. Binary indices represent vectors as bit arrays with one bit per
  dimension; all operations compute byte offsets via `d / 8`. A dimension
  that is not a multiple of 8 causes integer truncation in `code_size =
  d / 8`, making every vector stride too short and causing out-of-bounds
  reads/writes in `add()`, `search()`, and distance computations. A `d =
  0` causes division by zero. A mismatched `code_size` has the same effect
  — vector data is strided by `code_size` bytes (`x + i * code_size`),
  so a wrong value means every vector access is at the wrong offset.

- **`read_binary_hash_invlists`** —
  validate `vecs.size() == ids.size() * code_size`. Each inverted list
  entry in `IndexBinaryHash` pairs a vector ID with `code_size` bytes of
  raw vector data. If `vecs` is shorter than `ids.size() * code_size`,
  distance computations that iterate through the vector bytes using the
  ID count as stride will read past the end of the `vecs` buffer.

- **`read_binary_multi_hash_map`** —
  validate `id < ntotal` for each deserialized ID. IDs in the multi-hash
  map index into the underlying `IndexBinaryFlat` storage array. An
  out-of-range ID causes an out-of-bounds read when the vector is looked
  up during search, either returning garbage distances or crashing.

- **IndexBinaryHNSW (IBHf) and IndexBinaryHNSWCagra (IBHc)** —
  validate three consistency invariants:
  1. `hnsw.levels.size() == ntotal`: The HNSW graph stores one level entry
     per vector. A mismatch causes out-of-bounds reads in the levels array
     during graph traversal, which runs inside OpenMP parallel regions.
  2. Storage is `IndexBinaryFlat`: `get_distance_computer()` does
     `dynamic_cast<IndexBinaryFlat*>(storage)` and dereferences the result.
     A non-flat storage type yields a null pointer, crashing the process.
  3. `storage->ntotal == ntotal`: HNSW graph node IDs index into the
     storage. If storage has fewer vectors than the graph expects,
     search traversal accesses out-of-bounds storage entries.

Reviewed By: mnorris11

Differential Revision: D97819033
@meta-codesync meta-codesync Bot closed this in d26c15d Mar 26, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 26, 2026

This pull request has been merged in d26c15d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant