Skip to content

feat: enlarge indice size limit for sparse vectors#229

Merged
zhourrr merged 7 commits intomainfrom
feat/enlarge_sparse_vector_limit
Mar 16, 2026
Merged

feat: enlarge indice size limit for sparse vectors#229
zhourrr merged 7 commits intomainfrom
feat/enlarge_sparse_vector_limit

Conversation

@zhourrr
Copy link
Collaborator

@zhourrr zhourrr commented Mar 16, 2026

Greptile Summary

This PR enlarges the maximum non-zero element count for sparse vectors from 4096 to 16384 across both the flat and HNSW sparse index implementations, and fixes a pre-existing off-by-one inconsistency where some code paths used >= (rejecting the boundary value) while others used > (accepting it).

Key changes:

  • kSparseMaxDimSize / PARAM_FLAT_SPARSE_MAX_DIM_SIZE raised from 409616384 in all three constant definitions (constants.h, hnsw_sparse_entity.h, flat_sparse_utility.h)
  • Boundary checks in HnswSparseStreamer::add_impl, add_with_id_impl, and HnswSparseBuilderEntity::add_vector corrected from >= to >, making them consistent with the flat-sparse paths and the doc-layer validation
  • New sparse_indices.size() > kSparseMaxDimSize guards added to Doc::validate for both SPARSE_VECTOR_FP16 and SPARSE_VECTOR_FP32 paths — previously validation at the document layer was missing entirely
  • VectorQuery::validate byte-level check updated from >= to >, keeping it semantically aligned with the index-layer checks
  • Error messages improved across all touched paths to include the actual limit value
  • Test updated to use 16385 as the over-limit sentinel, but missing: a test asserting that exactly 16384 indices now passes (boundary regression risk), and tests for the two new Doc::validate guards

Confidence Score: 4/5

  • Safe to merge; the boundary-check fixes are correct and consistent, but the test suite has gaps at the new boundary value.
  • All three constant definitions are updated to the same value (16384), all >= checks that caused the off-by-one inconsistency have been corrected to >, and the logic across doc-layer validation and the core index paths is now aligned. The only concerns are missing boundary test coverage (exact kSparseMaxDimSize case) and no tests for the newly added Doc::validate guards, which reduce confidence slightly.
  • Pay close attention to tests/db/index/common/doc_test.cc — the boundary case for exactly 16384 indices is not tested, and the new Doc::validate sparse size guards have no coverage at all.

Important Files Changed

Filename Overview
src/db/index/common/doc.cc Adds sparse index count validation (> kSparseMaxDimSize) to Doc::validate for both FP16 and FP32 sparse types, and updates VectorQuery::validate boundary from >= to >. Logic is correct and consistent, but the new validation paths lack test coverage.
tests/db/index/common/doc_test.cc Updates the over-limit test vector to 16385 elements to match the new constant. Missing: a boundary test for exactly 16384 elements (the key semantic change from >= to >), and tests for the new Doc::validate sparse size checks.
src/core/algorithm/hnsw_sparse/hnsw_sparse_entity.h Increases kSparseMaxDimSize from 4096 to 16384. Straightforward constant update, consistent with the other two limit constants.
src/core/algorithm/hnsw_sparse/hnsw_sparse_streamer.cc Both add_impl and add_with_id_impl updated from >= to > and receive improved error messages. Boundary check is now consistent with the builder and doc-layer validation.
src/core/algorithm/hnsw_sparse/hnsw_sparse_builder_entity.cc Boundary check in add_vector updated from >= to > and error message improved. Now consistent with streamer paths.
src/core/algorithm/flat_sparse/flat_sparse_streamer.cc Log messages improved in add_impl and add_with_id_impl to include the actual limit value. Both already used > so no semantic change here.
src/core/algorithm/flat_sparse/flat_sparse_utility.h Increases PARAM_FLAT_SPARSE_MAX_DIM_SIZE from 4096 to 16384. Straightforward constant update.
src/db/common/constants.h Increases kSparseMaxDimSize from 4096 to 16384. Consistent with the algorithm-layer constants.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Sparse Vector Insert / Query] --> B{Layer}

    B --> C[Doc Layer\ndoc.cc]
    B --> D[HNSW Sparse\nStreamer / Builder]
    B --> E[Flat Sparse\nStreamer]

    C --> C1["Doc::validate\nsparse_indices.size() > kSparseMaxDimSize\n(constants.h → 16384)"]
    C --> C2["VectorQuery::validate\nbyte_size > kSparseMaxDimSize × sizeof(uint32_t)\n(constants.h → 16384)"]

    D --> D1["HnswSparseStreamer::add_impl\nsparse_count > kSparseMaxDimSize\n(hnsw_sparse_entity.h → 16384)"]
    D --> D2["HnswSparseStreamer::add_with_id_impl\nsparse_count > kSparseMaxDimSize"]
    D --> D3["HnswSparseBuilderEntity::add_vector\nsparse_count > kSparseMaxDimSize"]

    E --> E1["FlatSparseStreamer::add_impl\nsparse_count > PARAM_FLAT_SPARSE_MAX_DIM_SIZE\n(flat_sparse_utility.h → 16384)"]
    E --> E2["FlatSparseStreamer::add_with_id_impl\nsparse_count > PARAM_FLAT_SPARSE_MAX_DIM_SIZE"]

    C1 --> F[InvalidArgument]
    C2 --> F
    D1 --> G[IndexError_InvalidValue]
    D2 --> G
    D3 --> G
    E1 --> G
    E2 --> G
Loading

Comments Outside Diff (1)

  1. tests/db/index/common/doc_test.cc, line 1258-1271 (link)

    Missing exact-boundary test case

    The test only checks a clearly over-limit value (16385) and a clearly under-limit value (3 bytes). Because the semantic of the boundary changed from >= (rejected) to > (accepted) in this PR, there is no assertion that verifies the new allowed boundary — exactly kSparseMaxDimSize (16384) indices (= 65536 bytes) must now pass.

    Without this test, a future regression that accidentally reverts the > to >= in VectorQuery::validate would not be caught.

    Consider adding:

    // Exactly at the limit: should now pass
    std::vector<uint32_t> exact_limit_indices(16384);
    std::string exact_limit_str =
        std::string(reinterpret_cast<char *>(exact_limit_indices.data()),
                    exact_limit_indices.size() * sizeof(uint32_t));
    query.query_sparse_indices_ = exact_limit_str;
    s = query.validate(&schema);
    EXPECT_TRUE(s.ok());

Last reviewed commit: c2c1aee

Greptile also left 1 inline comment on this PR.

@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 16, 2026

@greptile

@zhourrr zhourrr requested a review from iaojnh March 16, 2026 07:41
@zhourrr zhourrr force-pushed the feat/enlarge_sparse_vector_limit branch from 9562156 to bc33e81 Compare March 16, 2026 07:46
@zhourrr
Copy link
Collaborator Author

zhourrr commented Mar 16, 2026

@greptile

@zhourrr zhourrr merged commit 3c1241f into main Mar 16, 2026
10 checks passed
@zhourrr zhourrr deleted the feat/enlarge_sparse_vector_limit branch March 16, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants