fix: some minor bugs by zhourrr · Pull Request #370 · alibaba/zvec

zhourrr · 2026-04-22T13:59:03Z

Asked qoder to do a full round of review before quota expired.

Bug Report: zvec Codebase Audit

Date: April 22, 2026
Scope: Full source audit of the zvec in-process vector database
Files changed: 9 source files edited, 1 file renamed

Bug 1 — Race condition in `DropIndex()` on `destroyed_` flag

Severity: High
File: src/db/collection.cc

Description

CollectionImpl::DropIndex() checks the destroyed_ flag before acquiring the schema_handle_mtx_ lock. Every other mutating method in the class (Destroy(), Flush(), CreateIndex(), Optimize(), AddColumn(), DropColumn(), AlterColumn()) acquires the lock first, then checks destroyed_. This creates a TOCTOU (time-of-check-to-time-of-use) race: another thread could call Destroy() between the check and the lock acquisition, allowing DropIndex() to proceed on a destroyed collection, potentially causing use-after-free or null pointer dereferences on released resources.

Before

Status CollectionImpl::DropIndex(const std::string &column_name) {
  CHECK_COLLECTION_READONLY_RETURN_STATUS;
  CHECK_DESTROY_RETURN_STATUS(destroyed_, false);   // check before lock

  std::lock_guard lock(schema_handle_mtx_);          // lock after check

Fix

Swapped the ordering to match every other method — acquire the lock first, then check:

Status CollectionImpl::DropIndex(const std::string &column_name) {
  CHECK_COLLECTION_READONLY_RETURN_STATUS;

  std::lock_guard lock(schema_handle_mtx_);

  CHECK_DESTROY_RETURN_STATUS(destroyed_, false);

Bug 2 — Integer underflow in `ConcurrentRoaringBitmap32::range_cardinality()`

Severity: High
File: src/db/common/concurrent_roaring_bitmap.h

Description

range_cardinality(uint32_t min_doc_id, uint32_t max_doc_id) computes min_doc_id - 1 to calculate the rank at the position just before min_doc_id. When min_doc_id == 0, this arithmetic wraps to UINT32_MAX (4,294,967,295) because min_doc_id is an unsigned uint32_t. This causes roaring_bitmap_rank() to return the total cardinality, making the subtraction max_rank - min_rank return 0 (or a bogus value) instead of the correct range count.

The 64-bit counterpart (ConcurrentRoaringBitmap64::range_cardinality) already had a guard for this case (min_doc_id <= 0 ? 0 : ...), but the 32-bit version was missing it.

Before

max_rank = roaring_bitmap_rank(bitmap_, max_doc_id);
min_rank = roaring_bitmap_rank(bitmap_, min_doc_id - 1);  // underflows when min_doc_id == 0

Fix

max_rank = roaring_bitmap_rank(bitmap_, max_doc_id);
min_rank = min_doc_id == 0 ? 0 : roaring_bitmap_rank(bitmap_, min_doc_id - 1);

Bug 3 — Uninitialized member variables in `Doc`

Severity: High
File: src/include/zvec/db/doc.h

Description

Doc::doc_id_ (uint64_t) and Doc::op_ (Operator enum) have no default member initializers and are not initialized in the default constructor (Doc() = default). Reading these members before explicit assignment is undefined behavior per the C++ standard. In practice, this means a default-constructed Doc could report a garbage doc_id or operation type, which could silently corrupt data during writes or produce incorrect log output.

All other member variables in the class (pk_, score_, fields_) are properly initialized.

Before

uint64_t doc_id_;
Operator op_;

Fix

uint64_t doc_id_{0};
Operator op_{Operator::INSERT};

Bug 4 — Source filename typo: `rocbsdb_context.cc`

Severity: Medium
File: src/db/common/rocbsdb_context.cc

Description

The RocksDB wrapper implementation file was named rocbsdb_context.cc (missing the letter 'k' — "rocbs" instead of "rocks"). While the build system (SRCS *.cc glob) compiled it without issues, the typo creates confusion and makes it hard to find the file by searching for "rocksdb". The corresponding header is correctly named rocksdb_context.h.

Fix

Renamed rocbsdb_context.cc to rocksdb_context.cc via git mv.

Bug 5 — Wrong error descriptions for `InvalidChannelCount` and `InvalidReplicaCount`

Severity: Medium
File: src/db/common/error_code.cc

Description

Error codes 2045 (InvalidChannelCount) and 2046 (InvalidReplicaCount) both had their description string set to "Invalid field name", which is a copy-paste error from the preceding line (error code 2044, InvalidFieldName). This means any error message surfaced to users or logs for channel count or replica count validation failures would misleadingly say "Invalid field name".

Before

PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidFieldName, 2044, "Invalid field name");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidChannelCount, 2045, "Invalid field name");   // wrong
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidReplicaCount, 2046, "Invalid field name");   // wrong

Fix

PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidFieldName, 2044, "Invalid field name");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidChannelCount, 2045, "Invalid channel count");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidReplicaCount, 2046, "Invalid replica count");

Bug 6 — `DirectoryAlreadyExists` and `DirectoryNotExists` declared but never defined

Severity: Medium
File: src/db/common/error_code.h

Description

error_code.h declares two error codes via PROXIMA_ZVEC_ERROR_CODE_DECLARE:

PROXIMA_ZVEC_ERROR_CODE_DECLARE(DirectoryAlreadyExists);
PROXIMA_ZVEC_ERROR_CODE_DECLARE(DirectoryNotExists);

However, error_code.cc contains no corresponding PROXIMA_ZVEC_ERROR_CODE_DEFINE for either. These are extern declarations of global const objects that have no definition in any translation unit. If any code ever references them, it would cause a linker error. A codebase search confirmed they are not referenced anywhere.

Fix

Removed both dead declarations from error_code.h.

Bug 7 — Incorrect comment: `64 * 1024 * 1024` labeled as "128M"

Severity: Low
File: src/include/zvec/db/options.h

Description

The constant DEFAULT_MAX_BUFFER_SIZE is defined as 64 * 1024 * 1024 (= 67,108,864 bytes = 64 MB), but the comment says // 128M. This is misleading for anyone reading the code to understand the default buffer size.

Before

const uint32_t DEFAULT_MAX_BUFFER_SIZE = 64 * 1024 * 1024;  // 128M

Fix

const uint32_t DEFAULT_MAX_BUFFER_SIZE = 64 * 1024 * 1024;  // 64M

Bug 8 — Write lock used for read-only `storage_size_in_bytes()`

Severity: Low
File: src/db/common/concurrent_roaring_bitmap.h

Description

ConcurrentRoaringBitmap32::storage_size_in_bytes() is a read-only operation (it calls roaring_bitmap_portable_size_in_bytes which does not mutate the bitmap), but it acquires a std::unique_lock (exclusive/write lock). This unnecessarily blocks all concurrent readers, reducing throughput. Every other read-only method in the class correctly uses std::shared_lock.

Before

size_t storage_size_in_bytes() const {
    std::unique_lock<std::shared_mutex> lock(mutex_);
    return roaring_bitmap_portable_size_in_bytes(bitmap_);
}

Fix

size_t storage_size_in_bytes() const {
    std::shared_lock<std::shared_mutex> lock(mutex_);
    return roaring_bitmap_portable_size_in_bytes(bitmap_);
}

Bug 9 — `static` function definition in header causes per-TU duplication

Severity: Low
File: src/db/common/file_helper.h

Description

GetFileName(FileID) is declared static in a header file. In C++, static at file/namespace scope gives each translation unit (.cc file) its own private copy of the function. Since this header is included across many .cc files, the linker includes duplicate copies of the function in the final binary, increasing code size. The function should be inline instead, which tells the linker to deduplicate identical definitions while still allowing the definition in the header.

Before

static const char *GetFileName(FileID t) {

Fix

inline const char *GetFileName(FileID t) {

Bug 10 — Missing `#pragma once` include guard in `utils.h`

Severity: Low
File: src/db/common/utils.h

Description

utils.h is the only header in src/db/common/ that does not have a #pragma once include guard. Without it, if the header is included multiple times (directly or transitively) in the same translation unit, it causes redefinition errors for std::string indent(int level). While the current include graph may not trigger this, it is fragile and inconsistent with the rest of the codebase.

Fix

Added #pragma once at the top of the file.

Bug 11 — Comment indices in `is_valid_type_v` do not match `Value` variant ordering

Severity: Info
File: src/include/zvec/db/doc.h

Description

The Value variant and the is_valid_type_v type trait both list the same types, but the comment indices diverge in two places:

In the Value variant, index 13 is std::vector<int64_t> and index 14 is std::vector<uint32_t>. In is_valid_type_v, the comments had these swapped (13 for uint32_t, 14 for int64_t).
In the Value variant, index 20 is std::pair<..., std::vector<float>> and index 21 is std::pair<..., std::vector<float16_t>>. In is_valid_type_v, the comments had these swapped.

While is_valid_type_v is a simple OR-chain (so the ordering does not affect runtime behavior), mismatched index comments are dangerous for anyone writing serialization or deserialization code that relies on variant indices. The wrong comment could lead someone to use the wrong index in a std::get<N>() call.

Fix

Reordered the entries in is_valid_type_v and corrected the comments to match the actual Value variant ordering exactly.

Verification note: Serialization in doc.cc uses std::visit with if constexpr type matching and a separate ValueType enum — not variant indices — so reordering is_valid_type_v entries has zero impact on existing serialized data. The memory_usage() and operator== functions do use .index() on the variant, but those already have the correct index-to-type mapping.

Bug 12 — `DeleteByFilter()` computes `get_all_segments()` twice (one result unused)

Severity: Medium
File: src/db/collection.cc

Description

CollectionImpl::DeleteByFilter() calls get_all_segments() at line 1552 and stores the result in a local variable segments, but never uses it. Then at line 1560, get_all_segments() is called a second time as an argument to sql_engine_->execute(). This is a copy-paste artifact that wastes work — get_all_segments() iterates and copies all segments each time it is called.

Before

auto segments = get_all_segments();   // result never used

VectorQuery query;
query.filter_ = filter;
query.topk_ = INT32_MAX;
query.output_fields_ = std::vector<std::string>{};
query.include_doc_id_ = true;

auto ret = sql_engine_->execute(schema_, query, get_all_segments());  // called again

Fix

Removed the unused segments variable. The single remaining get_all_segments() call inside execute() is sufficient:

VectorQuery query;
query.filter_ = filter;
query.topk_ = INT32_MAX;
query.output_fields_ = std::vector<std::string>{};
query.include_doc_id_ = true;

auto ret = sql_engine_->execute(schema_, query, get_all_segments());

Bug 13 — `Open()` allocates `sql_engine_` even after `create()`/`recovery()` failure

Severity: Medium
File: src/db/collection.cc

Description

CollectionImpl::Open() calls create() or recovery() and stores the status in s, but then unconditionally proceeds to allocate a Profiler and create the sql_engine_ before returning s. If create() or recovery() failed, these resources are allocated for no reason — the callers (Collection::CreateAndOpen() and Collection::Open()) check the status and discard the CollectionImpl, so the engine is immediately destroyed.

Before

Status s;
if (schema_ == nullptr) {
  s = recovery();
} else {
  s = create();
}

auto profiler = std::make_shared<Profiler>();          // runs even if s is error
sql_engine_ = sqlengine::SQLEngine::create(profiler);  // wasted allocation

return s;

Fix

Added an early return via CHECK_RETURN_STATUS(s) before the sql_engine_ allocation:

Status s;
if (schema_ == nullptr) {
  s = recovery();
} else {
  s = create();
}
CHECK_RETURN_STATUS(s);

auto profiler = std::make_shared<Profiler>();
sql_engine_ = sqlengine::SQLEngine::create(profiler);

return s;

Bug 14 — Typo "ignnored" in `options.h` comment

Severity: Low
File: src/include/zvec/db/options.h

Description

The comment on enable_mmap_ has a double-n typo: "ignnored" instead of "ignored".

Before

bool enable_mmap_{true};  // ignnored when load collection

Fix

bool enable_mmap_{true};  // ignored when load collection

Verification Notes

All 14 fixes were reviewed for correctness and compatibility:

No data structure changes: No serialization formats, on-disk layouts, or public API signatures were modified.
No behavioral changes to serialization: The Doc serialization uses an explicit ValueType enum with std::visit / if constexpr type matching, not variant indices. Reordering is_valid_type_v entries (Bug 11) and initializing doc_id_/op_ (Bug 3) have no impact on existing persisted data.
Thread safety preserved: Lock reordering (Bug 1) and lock type change (Bug 8) both align with the established patterns used by all other methods in their respective classes.
doc_id_{0} initialization (Bug 3): The value 0 is consistent with Doc::clear() which already resets doc_id_ to 0. In all code paths, doc_id is reassigned by Segment::internal_insert() or Segment::Update() before any database operation, so the default value is never persisted.
Backward compatibility: All fixes are internal implementation corrections. No public headers changed their API contract. Existing client code requires no modifications.

chinaux

lgtm

zhourrr requested review from chinaux and feihongxu0824 as code owners April 22, 2026 13:59

github-actions Bot assigned zhourrr Apr 22, 2026

zhourrr force-pushed the fix/alot branch from 2f64c00 to 5ab93a4 Compare April 22, 2026 13:59

zhourrr added 2 commits April 23, 2026 13:48

fix

a0aae99

fix format

f7b142c

zhourrr force-pushed the fix/alot branch from c1558f2 to f7b142c Compare April 23, 2026 05:48

update

361d3d4

chinaux approved these changes Apr 23, 2026

View reviewed changes

zhourrr merged commit 4e7fa0f into alibaba:main Apr 23, 2026
13 checks passed

zhourrr deleted the fix/alot branch April 23, 2026 11:02

ihb2032 pushed a commit to ihb2032/zvec that referenced this pull request Apr 29, 2026

fix: some minor bugs (alibaba#370)

28408cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: some minor bugs#370

fix: some minor bugs#370
zhourrr merged 3 commits into
alibaba:mainfrom
zhourrr:fix/alot

zhourrr commented Apr 22, 2026

Uh oh!

chinaux left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhourrr commented Apr 22, 2026

Bug Report: zvec Codebase Audit

Bug 1 — Race condition in DropIndex() on destroyed_ flag

Description

Before

Fix

Bug 2 — Integer underflow in ConcurrentRoaringBitmap32::range_cardinality()

Description

Before

Fix

Bug 3 — Uninitialized member variables in Doc

Description

Before

Fix

Bug 4 — Source filename typo: rocbsdb_context.cc

Description

Fix

Bug 5 — Wrong error descriptions for InvalidChannelCount and InvalidReplicaCount

Description

Before

Fix

Bug 6 — DirectoryAlreadyExists and DirectoryNotExists declared but never defined

Description

Fix

Bug 7 — Incorrect comment: 64 * 1024 * 1024 labeled as "128M"

Description

Before

Fix

Bug 8 — Write lock used for read-only storage_size_in_bytes()

Description

Before

Fix

Bug 9 — static function definition in header causes per-TU duplication

Description

Before

Fix

Bug 10 — Missing #pragma once include guard in utils.h

Description

Fix

Bug 11 — Comment indices in is_valid_type_v do not match Value variant ordering

Description

Fix

Bug 12 — DeleteByFilter() computes get_all_segments() twice (one result unused)

Description

Before

Fix

Bug 13 — Open() allocates sql_engine_ even after create()/recovery() failure

Description

Before

Fix

Bug 14 — Typo "ignnored" in options.h comment

Description

Before

Fix

Verification Notes

Uh oh!

chinaux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1 — Race condition in `DropIndex()` on `destroyed_` flag

Bug 2 — Integer underflow in `ConcurrentRoaringBitmap32::range_cardinality()`

Bug 3 — Uninitialized member variables in `Doc`

Bug 4 — Source filename typo: `rocbsdb_context.cc`

Bug 5 — Wrong error descriptions for `InvalidChannelCount` and `InvalidReplicaCount`

Bug 6 — `DirectoryAlreadyExists` and `DirectoryNotExists` declared but never defined

Bug 7 — Incorrect comment: `64 * 1024 * 1024` labeled as "128M"

Bug 8 — Write lock used for read-only `storage_size_in_bytes()`

Bug 9 — `static` function definition in header causes per-TU duplication

Bug 10 — Missing `#pragma once` include guard in `utils.h`

Bug 11 — Comment indices in `is_valid_type_v` do not match `Value` variant ordering

Bug 12 — `DeleteByFilter()` computes `get_all_segments()` twice (one result unused)

Bug 13 — `Open()` allocates `sql_engine_` even after `create()`/`recovery()` failure

Bug 14 — Typo "ignnored" in `options.h` comment